Web search provides a promising way for people to obtain information and has been extensively studied.With the surge of deep learning and large-scale pre-training techniques,various neural information retrieval models...Web search provides a promising way for people to obtain information and has been extensively studied.With the surge of deep learning and large-scale pre-training techniques,various neural information retrieval models are proposed,and they have demonstrated the power for improving search(especially,the ranking)quality.All these existing search methods follow a common paradigm,i.e.,index-retrieve-rerank,where they first build an index of all documents based on document terms(i.e.,sparse inverted index)or representation vectors(i.e.,dense vector index),then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models.In this paper,we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model.Instead,all of the knowledge of the documents is encoded into model parameters,which can be regarded as a differentiable indexer and optimized in an end-to-end manner.Specifically,we propose a pre-trained model-based information retrieval(IR)system called DynamicRetriever,which directly returns document identifiers for a given query.Under such a framework,we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models.Compared with existing search methods,the model-based IR system parameterizes the traditional static index with a pre-training model,which converts the document semantic mapping into a dynamic and updatable process.Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension(MS MARCO)verify the effectiveness and potential of our proposed new paradigm for information retrieval.展开更多
In recent years,there has been a surge of interest and rapid development in large-scale pre-training due to the explosive growth of both data and model parameters.Large-scale training has achieved impressive performan...In recent years,there has been a surge of interest and rapid development in large-scale pre-training due to the explosive growth of both data and model parameters.Large-scale training has achieved impressive performance milestones across a wide range of practical problems,including natural language processing,computer vision,recommendation systems,robotics,and other basic research areas like bioinformatics.展开更多
With the increasing availability of real-time traffic information, dynamic spatial networks are pervasive nowa- days and path planning in dynamic spatial networks becomes an important issue. In this light, we propose ...With the increasing availability of real-time traffic information, dynamic spatial networks are pervasive nowa- days and path planning in dynamic spatial networks becomes an important issue. In this light, we propose and investigate a novel problem of dynamically monitoring shortest paths in spatial networks (DSPM query). When a traveler aims to a des- tination, his/her shortest path to the destination may change due to two reasons: 1) the travel costs of some edges have been updated and 2) the traveler deviates from the pre-planned path. Our target is to accelerate the shortest path computing in dynamic spatial networks, and we believe that this study may be useful in many mobile applications, such as route planning and recommendation, car navigation and tracking, and location-based services in general. This problem is challenging due to two reasons: 1) how to maintain and reuse the existing computation results to accelerate the following computations, and 2) how to prune the search space effectively. To overcome these challenges, filter-and-refinement paradigm is adopted. We maintain an expansion tree and define a pair of upper and lower bounds to prune the search space. A series of optimization techniques are developed to accelerate the shortest path computing. The performance of the developed methods is studied in extensive experiments based on real spatial data.展开更多
Nowadays, many e-commerce websites allow users to login with their existing social networking accounts. When a new user comes to an e-commerce website, it is interesting to study whether the information from external ...Nowadays, many e-commerce websites allow users to login with their existing social networking accounts. When a new user comes to an e-commerce website, it is interesting to study whether the information from external social media platforms can be utilized to alleviate the cold-start problem. In this paper, we focus on a specific task on cross-site information sharing, i.e., leveraging the text posted by a user on the social media platform (termed as social text) to infer his/her purchase preference of product categories on an e-commerce platform. To solve the task, a key problem is how to effectively represent the social text in a way that its information can be utilized on the e-commerce platform. We study two major kinds of text representation methods for predicting cross-site purchase preference, including shallow textual features and deep textual features learned by deep neural network models. We conduct extensive experiments on a large linked dataset, and our experimental results indicate that it is promising to utilize the social text for predicting purchase preference. Specially, the deep neural network approach has shown a more powerful predictive ability when the number of categories becomes large.展开更多
To develop a knowledge-aware recommender system,a key issue is how to obtain rich and structured knowledge base(KB)information for recommender system(RS)items.Existing data sets or methods either use side information ...To develop a knowledge-aware recommender system,a key issue is how to obtain rich and structured knowledge base(KB)information for recommender system(RS)items.Existing data sets or methods either use side information from original RSs(containing very few kinds of useful information)or utilize a private KB.In this paper,we present KB4Rec v1.0,a data set linking KB information for RSs.It has linked three widely used RS data sets with two popular KBs,namely Freebase and YAGO.Based on our linked data set,we first preform qualitative analysis experiments,and then we discuss the effect of two important factors(i.e.,popularity and recency)on whether a RS item can be linked to a KB entity.Finally,we compare several knowledge-aware recommendation algorithms on our linked data set.展开更多
Timeline generation is an important research task which can help users to have a quick understanding of the overall evolution of one given topic. Previous methods simply split the time span into fixed, equal time inte...Timeline generation is an important research task which can help users to have a quick understanding of the overall evolution of one given topic. Previous methods simply split the time span into fixed, equal time intervals without studying the role of the evolutionary patterns of the underlying topic in timeline generation. In addition, few of these methods take users' collective interests into considerations to generate timelines. We consider utilizing social media attention to address these two problems due to the facts: 1) social media is an important pool of real users' collective interests; 2) the information cascades generated in it might be good indicators for boundaries of topic phases. Employing Twitter as a basis, we propose to incorporate topic phases and user's collective interests which are learnt from social media into a unified timeline generation algorithm. We construct both one informativeness-oriented and three interestingness-oriented evaluation sets over five topics. We demonstrate that it is very effective to generate both informative and interesting timelines. In addition, our idea naturally leads to a novel presen- tation of timelines, i.e., phase based timelines, which can potentially improve user experience.展开更多
Detecting and using bursty pattems to analyze text streams has been one of the fundamental approaches in many temporal text mining applications. So far, most existing studies have focused on developing methods to dete...Detecting and using bursty pattems to analyze text streams has been one of the fundamental approaches in many temporal text mining applications. So far, most existing studies have focused on developing methods to detect bursty features based purely on term frequency changes. Few have taken the semantic contexts of bursty features into consideration, and as a result the detected bursty features may not always be interesting and can be hard to interpret. In this article, we propose to model the contexts of bursty features using a language modeling approach. We propose two methods to estimate the context language models based on sentence-level context and document-level context. We then propose a novel topic diversity-based metric using the context models to find newsworthy bursty features. We also propose to use the context models to automatically assign meaningful tags to bursty features. Using a large corpus of news articles, we quantitatively show that the proposed context language models for bursty features can effectively help rank bursty features based on their newsworthiness and to assign meaningful tags to annotate bursty features. We also use two example text mining applications to qualitatively demonstrate the usefulness of bursty feature ranking and tagging.展开更多
Motif-based graph local clustering(MGLC)is a popular method for graph mining tasks due to its various applications.However,the traditional two-phase approach of precomputing motif weights before performing local clust...Motif-based graph local clustering(MGLC)is a popular method for graph mining tasks due to its various applications.However,the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs.While some attempts have been made to address the efficiency bottleneck,there is still no applicable algorithm for large scale graphs with billions of edges.In this paper,we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering(TGLC^(*))to solve the MGLC problem w.r.t.a triangle.TGLC^(*)directly estimates the Personalized PageRank(PPR)vector using random walks with the desired triangleweighted distribution and proposes the clustering result using a standard sweep procedure.We demonstrate TGLC^(*)’s scalability through theoretical analysis and its practical benefits through a novel visualization layout.TGLC^(*)is the first algorithm to solve the MGLC problem without precomputing the motif weight.Extensive experiments on seven real-world large-scale datasets show that TGLC^(*)is applicable and scalable for large graphs.展开更多
基金supported by National Natural Science Foundation of China(Nos.61872370 and 61832017)Beijing Outstanding Young Scientist Program(No.BJJWZYJH012019100020098)Beijing Academy of Artificial Intelligence(BAAI),the Outstanding Innovative Talents Cultivation Funded Programs 2021 of Renmin University of China,and Intelligent Social Governance Platform,Major Innovation&Planning Interdisciplinary Platform for the“Double-First Class”Initiative,Renmin University of China.
文摘Web search provides a promising way for people to obtain information and has been extensively studied.With the surge of deep learning and large-scale pre-training techniques,various neural information retrieval models are proposed,and they have demonstrated the power for improving search(especially,the ranking)quality.All these existing search methods follow a common paradigm,i.e.,index-retrieve-rerank,where they first build an index of all documents based on document terms(i.e.,sparse inverted index)or representation vectors(i.e.,dense vector index),then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models.In this paper,we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model.Instead,all of the knowledge of the documents is encoded into model parameters,which can be regarded as a differentiable indexer and optimized in an end-to-end manner.Specifically,we propose a pre-trained model-based information retrieval(IR)system called DynamicRetriever,which directly returns document identifiers for a given query.Under such a framework,we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models.Compared with existing search methods,the model-based IR system parameterizes the traditional static index with a pre-training model,which converts the document semantic mapping into a dynamic and updatable process.Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension(MS MARCO)verify the effectiveness and potential of our proposed new paradigm for information retrieval.
文摘In recent years,there has been a surge of interest and rapid development in large-scale pre-training due to the explosive growth of both data and model parameters.Large-scale training has achieved impressive performance milestones across a wide range of practical problems,including natural language processing,computer vision,recommendation systems,robotics,and other basic research areas like bioinformatics.
基金This work is partially supported by the National Natural Science Foundation of China under Grant Nos. 61402532 and 41371386, the Science Foundation of China University of Petroleum-Beijing under Grant No. 2462013YJRC031, the Excellent Talents of Beijing Program under Grant No. 2013D009051000003, Beijing Nova Program, and the Open Research Fund Program of Shenzhen Key Laboratory of Spatial Smart Sensing and Services (Shenzhen University).
文摘With the increasing availability of real-time traffic information, dynamic spatial networks are pervasive nowa- days and path planning in dynamic spatial networks becomes an important issue. In this light, we propose and investigate a novel problem of dynamically monitoring shortest paths in spatial networks (DSPM query). When a traveler aims to a des- tination, his/her shortest path to the destination may change due to two reasons: 1) the travel costs of some edges have been updated and 2) the traveler deviates from the pre-planned path. Our target is to accelerate the shortest path computing in dynamic spatial networks, and we believe that this study may be useful in many mobile applications, such as route planning and recommendation, car navigation and tracking, and location-based services in general. This problem is challenging due to two reasons: 1) how to maintain and reuse the existing computation results to accelerate the following computations, and 2) how to prune the search space effectively. To overcome these challenges, filter-and-refinement paradigm is adopted. We maintain an expansion tree and define a pair of upper and lower bounds to prune the search space. A series of optimization techniques are developed to accelerate the shortest path computing. The performance of the developed methods is studied in extensive experiments based on real spatial data.
文摘Nowadays, many e-commerce websites allow users to login with their existing social networking accounts. When a new user comes to an e-commerce website, it is interesting to study whether the information from external social media platforms can be utilized to alleviate the cold-start problem. In this paper, we focus on a specific task on cross-site information sharing, i.e., leveraging the text posted by a user on the social media platform (termed as social text) to infer his/her purchase preference of product categories on an e-commerce platform. To solve the task, a key problem is how to effectively represent the social text in a way that its information can be utilized on the e-commerce platform. We study two major kinds of text representation methods for predicting cross-site purchase preference, including shallow textual features and deep textual features learned by deep neural network models. We conduct extensive experiments on a large linked dataset, and our experimental results indicate that it is promising to utilize the social text for predicting purchase preference. Specially, the deep neural network approach has shown a more powerful predictive ability when the number of categories becomes large.
基金The work was partially supported by National Natural Science Foundation of China under the grant numbers 61872369,61832017 and 61502502.
文摘To develop a knowledge-aware recommender system,a key issue is how to obtain rich and structured knowledge base(KB)information for recommender system(RS)items.Existing data sets or methods either use side information from original RSs(containing very few kinds of useful information)or utilize a private KB.In this paper,we present KB4Rec v1.0,a data set linking KB information for RSs.It has linked three widely used RS data sets with two popular KBs,namely Freebase and YAGO.Based on our linked data set,we first preform qualitative analysis experiments,and then we discuss the effect of two important factors(i.e.,popularity and recency)on whether a RS item can be linked to a KB entity.Finally,we compare several knowledge-aware recommendation algorithms on our linked data set.
文摘Timeline generation is an important research task which can help users to have a quick understanding of the overall evolution of one given topic. Previous methods simply split the time span into fixed, equal time intervals without studying the role of the evolutionary patterns of the underlying topic in timeline generation. In addition, few of these methods take users' collective interests into considerations to generate timelines. We consider utilizing social media attention to address these two problems due to the facts: 1) social media is an important pool of real users' collective interests; 2) the information cascades generated in it might be good indicators for boundaries of topic phases. Employing Twitter as a basis, we propose to incorporate topic phases and user's collective interests which are learnt from social media into a unified timeline generation algorithm. We construct both one informativeness-oriented and three interestingness-oriented evaluation sets over five topics. We demonstrate that it is very effective to generate both informative and interesting timelines. In addition, our idea naturally leads to a novel presen- tation of timelines, i.e., phase based timelines, which can potentially improve user experience.
基金Acknowledgements The authors thank the anonymous reviewers for their valuable and constructive comments. The work was partially supported by the National Natural Science Foundation of China (Grant No. 61502502), the National Basic Research Program (973 Program) of China (2014CB340403), Beijing Natural Science Foundation (4162032), and the Open Fund of Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, North China University of Technology, China.
文摘Detecting and using bursty pattems to analyze text streams has been one of the fundamental approaches in many temporal text mining applications. So far, most existing studies have focused on developing methods to detect bursty features based purely on term frequency changes. Few have taken the semantic contexts of bursty features into consideration, and as a result the detected bursty features may not always be interesting and can be hard to interpret. In this article, we propose to model the contexts of bursty features using a language modeling approach. We propose two methods to estimate the context language models based on sentence-level context and document-level context. We then propose a novel topic diversity-based metric using the context models to find newsworthy bursty features. We also propose to use the context models to automatically assign meaningful tags to bursty features. Using a large corpus of news articles, we quantitatively show that the proposed context language models for bursty features can effectively help rank bursty features based on their newsworthiness and to assign meaningful tags to annotate bursty features. We also use two example text mining applications to qualitatively demonstrate the usefulness of bursty feature ranking and tagging.
基金supported by the Fundamental Research Funds for the Central Universities(No.2020JS005).
文摘Motif-based graph local clustering(MGLC)is a popular method for graph mining tasks due to its various applications.However,the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs.While some attempts have been made to address the efficiency bottleneck,there is still no applicable algorithm for large scale graphs with billions of edges.In this paper,we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering(TGLC^(*))to solve the MGLC problem w.r.t.a triangle.TGLC^(*)directly estimates the Personalized PageRank(PPR)vector using random walks with the desired triangleweighted distribution and proposes the clustering result using a standard sweep procedure.We demonstrate TGLC^(*)’s scalability through theoretical analysis and its practical benefits through a novel visualization layout.TGLC^(*)is the first algorithm to solve the MGLC problem without precomputing the motif weight.Extensive experiments on seven real-world large-scale datasets show that TGLC^(*)is applicable and scalable for large graphs.