In software testing,the quality of test cases is crucial,but manual generation is time-consuming.Various automatic test case generation methods exist,requiring careful selection based on program features.Current evalu...In software testing,the quality of test cases is crucial,but manual generation is time-consuming.Various automatic test case generation methods exist,requiring careful selection based on program features.Current evaluation methods compare a limited set of metrics,which does not support a larger number of metrics or consider the relative importance of each metric to the final assessment.To address this,we propose an evaluation tool,the Test Case Generation Evaluator(TCGE),based on the learning to rank(L2R)algorithm.Unlike previous approaches,our method comprehensively evaluates algorithms by considering multiple metrics,resulting in a more reasoned assessment.The main principle of the TCGE is the formation of feature vectors that are of concern by the tester.Through training,the feature vectors are sorted to generate a list,with the order of the methods on the list determined according to their effectiveness on the tested assembly.We implement TCGE using three L2R algorithms:Listnet,LambdaMART,and RFLambdaMART.Evaluation employs a dataset with features of classical test case generation algorithms and three metrics—Normalized Discounted Cumulative Gain(NDCG),Mean Average Precision(MAP),and Mean Reciprocal Rank(MRR).Results demonstrate the TCGE’s superior effectiveness in evaluating test case generation algorithms compared to other methods.Among the three L2R algorithms,RFLambdaMART proves the most effective,achieving an accuracy above 96.5%,surpassing LambdaMART by 2%and Listnet by 1.5%.Consequently,the TCGE framework exhibits significant application value in the evaluation of test case generation algorithms.展开更多
While China has become the largest online market in the world with approximately 1 billion internet users,Baidu runs the world's largest Chinese search engine serving more than hundreds of millions of daily active...While China has become the largest online market in the world with approximately 1 billion internet users,Baidu runs the world's largest Chinese search engine serving more than hundreds of millions of daily active users and responding to billions of queries per day.To handle the diverse query requests from users at the web-scale,Baidu has made tremendous efforts in understanding users'queries,retrieving relevant content from a pool of trillions of webpages,and ranking the most relevant webpages on the top of the res-ults.Among the components used in Baidu search,learning to rank(LTR)plays a critical role and we need to timely label an extremely large number of queries together with relevant webpages to train and update the online LTR models.To reduce the costs and time con-sumption of query/webpage labelling,we study the problem of active learning to rank(active LTR)that selects unlabeled queries for an-notation and training in this work.Specifically,we first investigate the criterion-Ranking entropy(RE)characterizing the entropy of relevant webpages under a query produced by a sequence of online LTR models updated by different checkpoints,using a query-by-com-mittee(QBC)method.Then,we explore a new criterion namely prediction variances(PV)that measures the variance of prediction res-ults for all relevant webpages under a query.Our empirical studies find that RE may favor low-frequency queries from the pool for la-belling while PV prioritizes high-frequency queries more.Finally,we combine these two complementary criteria as the sample selection strategies for active learning.Extensive experiments with comparisons to baseline algorithms show that the proposed approach could train LTR models to achieve higher discounted cumulative gain(i.e.,the relative improvement DCG4=1.38%)with the same budgeted labellingefforts.展开更多
For the complex questions of Chinese question answering system, we propose an answer extraction method with discourse structure feature combination. This method uses the relevance of questions and answers to learn to ...For the complex questions of Chinese question answering system, we propose an answer extraction method with discourse structure feature combination. This method uses the relevance of questions and answers to learn to rank the answers. Firstly, the method analyses questions to generate the query string, and then submits the query string to search engines to retrieve relevant documents. Sec- ondly, the method makes retrieved documents seg- mentation and identifies the most relevant candidate answers, in addition, it uses the rhetorical relations of rhetorical structure theory to analyze the relationship to determine the inherent relationship between para- graphs or sentences and generate the answer candi- date paragraphs or sentences. Thirdly, we construct the answer ranking model,, and extract five feature groups and adopt Ranking Support Vector Machine (SVM) algorithm to train ranking model. Finally, it re-ranks the answers with the training model and fred the optimal answers. Experiments show that the proposed method combined with discourse structure features can effectively improve the answer extrac- ting accuracy and the quality of non-factoid an- swers. The Mean Reciprocal Rank (MRR) of the an- swer extraction reaches 69.53%.展开更多
Current search engines in most geospatial data portals tend to induce users to focus on one single-data characteristic dimension(e.g.popularity and release date).This approach largely fails to take account of users’m...Current search engines in most geospatial data portals tend to induce users to focus on one single-data characteristic dimension(e.g.popularity and release date).This approach largely fails to take account of users’multidimensional preferences for geospatial data,and hence may likely result in a less than optimal user experience in discovering the most applicable dataset.This study reports a machine learning framework to address the ranking challenge,the fundamental obstacle in geospatial data discovery,by(1)identifying a number of ranking features of geospatial data to represent users’multidimensional preferences by considering semantics,user behavior,spatial similarity,and static dataset metadata attributes;(2)applying a machine learning method to automatically learn a ranking function;and(3)proposing a system architecture to combine existing search-oriented open source software,semantic knowledge base,ranking feature extraction,and machine learning algorithm.Results show that the machine learning approach outperforms other methods,in terms of both precision at K and normalized discounted cumulative gain.As an early attempt of utilizing machine learning to improve the search ranking in the geospatial domain,we expect this work to set an example for further research and open the door towards intelligent geospatial data discovery.展开更多
Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Gene...Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.展开更多
Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts about the visual objects and their relationships,largely neglecting fine-grained scene understanding.In fact,many data-dr...Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts about the visual objects and their relationships,largely neglecting fine-grained scene understanding.In fact,many data-driven applications on the Web(e.g.,news-reading and e-shopping)require accurate recognition of much less coarse concepts as entities and proper linking them to a knowledge graph(KG),which can take their performance to the next level.In light of this,in this paper,we identify a new research task:visual entity linking for fine-grained scene understanding.To accomplish the task,we first extract features of candidate entities from different modalities,i.e.,visual features,textual features,and KG features.Then,we design a deep modal-attention neural network-based learning-to-rank method which aggregates all features and maps visual objects to the entities in KG.Extensive experimental results on the newly constructed dataset show that our proposed method is effective as it significantly improves the accuracy performance from 66.46%to 83.16%compared with baselines.展开更多
The rapid development of online services and information overload has inspired the fast development of recommender systems, among which collaborative filtering algorithms and model-based recommendation approaches are ...The rapid development of online services and information overload has inspired the fast development of recommender systems, among which collaborative filtering algorithms and model-based recommendation approaches are wildly exploited. For instance, matrix factorization (MF) demonstrated successful achievements and advantages in assisting internet users in finding interested information. These existing models focus on the prediction of the users' ratings on unknown items. The performance is usually evaluated by the metric root mean square error (RMSE). However, achieving good performance in terms of RMSE does not always guarantee a good ranking performance. Therefore, in this paper, we advocate to treat the recommendation as a ranking problem. Normalized discounted cumulative gain (NDCG) is chosen as the optimization target when evaluating the ranking accuracy. Specifically, we present three ranking-oriented recommender algorithms, NSME AdaMF and AdaNSME NSMF builds a NDCG approximated loss function for Matrix Factorization. AdaMF is based on an algorithm by adaptively combining component MF recommenders with boosting method. To combine the advantages of both algorithms, we propose AdaNSME which is a hybird of NSMF and AdaME and show the superiority in both ranking accuracy and model generalization. In addition, we compare our proposed approaches with the state-of-the-art recommendation algorithms. The comparison studies confirm the advantage of our proposed approaches.展开更多
Listwise approaches are an important class of learning to rank, which utilizes automatic learning techniques to discover useful information. Most previous research on listwise approaches has focused on optimizing rank...Listwise approaches are an important class of learning to rank, which utilizes automatic learning techniques to discover useful information. Most previous research on listwise approaches has focused on optimizing ranking models using weights and has used imprecisely labeled training data; optimizing ranking models using features was largely ignored thus the continuous performance improvement of these approaches was hindered. To address the limitations of previous listwise work, we propose a quasi-KNN model to discover the ranking of features and employ rank addition rule to calculate the weight of combination. On the basis of this, we propose three listwise algorithms, FeatureRank, BL-FeatureRank, and DiffRank. The experimental results show that our proposed algorithms can be applied to a strict ordered ranking training set and gain better performance than state-of-the-art listwise algorithms.展开更多
As one of the state-of-the-art automated function prediction(AFP)methods,NetGO 2.0 integrates multi-source information to improve the performance.However,it mainly utilizes the proteins with experimentally supported f...As one of the state-of-the-art automated function prediction(AFP)methods,NetGO 2.0 integrates multi-source information to improve the performance.However,it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins.Recently,protein language models have been proposed to learn informative representations[e.g.,Evolutionary Scale Modeling(ESM)-1b embedding] from protein sequences based on self-supervision.Here,we represented each protein by ESM-1b and used logistic regression(LR)to train a new model,LR-ESM,for AFP.The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0.Therefore,by incorporating LR-ESM into NetGO 2.0,we developed NetGO 3.0 to improve the performance of AFP extensively.展开更多
文摘In software testing,the quality of test cases is crucial,but manual generation is time-consuming.Various automatic test case generation methods exist,requiring careful selection based on program features.Current evaluation methods compare a limited set of metrics,which does not support a larger number of metrics or consider the relative importance of each metric to the final assessment.To address this,we propose an evaluation tool,the Test Case Generation Evaluator(TCGE),based on the learning to rank(L2R)algorithm.Unlike previous approaches,our method comprehensively evaluates algorithms by considering multiple metrics,resulting in a more reasoned assessment.The main principle of the TCGE is the formation of feature vectors that are of concern by the tester.Through training,the feature vectors are sorted to generate a list,with the order of the methods on the list determined according to their effectiveness on the tested assembly.We implement TCGE using three L2R algorithms:Listnet,LambdaMART,and RFLambdaMART.Evaluation employs a dataset with features of classical test case generation algorithms and three metrics—Normalized Discounted Cumulative Gain(NDCG),Mean Average Precision(MAP),and Mean Reciprocal Rank(MRR).Results demonstrate the TCGE’s superior effectiveness in evaluating test case generation algorithms compared to other methods.Among the three L2R algorithms,RFLambdaMART proves the most effective,achieving an accuracy above 96.5%,surpassing LambdaMART by 2%and Listnet by 1.5%.Consequently,the TCGE framework exhibits significant application value in the evaluation of test case generation algorithms.
基金This work was supported in part by the National Key R&D Program of China(No.2021ZD0110303).
文摘While China has become the largest online market in the world with approximately 1 billion internet users,Baidu runs the world's largest Chinese search engine serving more than hundreds of millions of daily active users and responding to billions of queries per day.To handle the diverse query requests from users at the web-scale,Baidu has made tremendous efforts in understanding users'queries,retrieving relevant content from a pool of trillions of webpages,and ranking the most relevant webpages on the top of the res-ults.Among the components used in Baidu search,learning to rank(LTR)plays a critical role and we need to timely label an extremely large number of queries together with relevant webpages to train and update the online LTR models.To reduce the costs and time con-sumption of query/webpage labelling,we study the problem of active learning to rank(active LTR)that selects unlabeled queries for an-notation and training in this work.Specifically,we first investigate the criterion-Ranking entropy(RE)characterizing the entropy of relevant webpages under a query produced by a sequence of online LTR models updated by different checkpoints,using a query-by-com-mittee(QBC)method.Then,we explore a new criterion namely prediction variances(PV)that measures the variance of prediction res-ults for all relevant webpages under a query.Our empirical studies find that RE may favor low-frequency queries from the pool for la-belling while PV prioritizes high-frequency queries more.Finally,we combine these two complementary criteria as the sample selection strategies for active learning.Extensive experiments with comparisons to baseline algorithms show that the proposed approach could train LTR models to achieve higher discounted cumulative gain(i.e.,the relative improvement DCG4=1.38%)with the same budgeted labellingefforts.
基金supported by the National Nature Science Foundation of China under Grants No.60863011,No.61175068,No.61100205,No.60873001the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212+1 种基金the National Innovation Fund for Technology based Firms under Grant No.11C26215305905the Open Fund of Software Engineering Key Laboratory of Yunnan Province under Grant No.2011SE14
文摘For the complex questions of Chinese question answering system, we propose an answer extraction method with discourse structure feature combination. This method uses the relevance of questions and answers to learn to rank the answers. Firstly, the method analyses questions to generate the query string, and then submits the query string to search engines to retrieve relevant documents. Sec- ondly, the method makes retrieved documents seg- mentation and identifies the most relevant candidate answers, in addition, it uses the rhetorical relations of rhetorical structure theory to analyze the relationship to determine the inherent relationship between para- graphs or sentences and generate the answer candi- date paragraphs or sentences. Thirdly, we construct the answer ranking model,, and extract five feature groups and adopt Ranking Support Vector Machine (SVM) algorithm to train ranking model. Finally, it re-ranks the answers with the training model and fred the optimal answers. Experiments show that the proposed method combined with discourse structure features can effectively improve the answer extrac- ting accuracy and the quality of non-factoid an- swers. The Mean Reciprocal Rank (MRR) of the an- swer extraction reaches 69.53%.
基金NSF I/UCRC:[Grant Number IIP-1338925]NSF EarthCube:[Grant Number ICER-1540998]NASA AIST Program:[Grant Number NNX15AM85G].
文摘Current search engines in most geospatial data portals tend to induce users to focus on one single-data characteristic dimension(e.g.popularity and release date).This approach largely fails to take account of users’multidimensional preferences for geospatial data,and hence may likely result in a less than optimal user experience in discovering the most applicable dataset.This study reports a machine learning framework to address the ranking challenge,the fundamental obstacle in geospatial data discovery,by(1)identifying a number of ranking features of geospatial data to represent users’multidimensional preferences by considering semantics,user behavior,spatial similarity,and static dataset metadata attributes;(2)applying a machine learning method to automatically learn a ranking function;and(3)proposing a system architecture to combine existing search-oriented open source software,semantic knowledge base,ranking feature extraction,and machine learning algorithm.Results show that the machine learning approach outperforms other methods,in terms of both precision at K and normalized discounted cumulative gain.As an early attempt of utilizing machine learning to improve the search ranking in the geospatial domain,we expect this work to set an example for further research and open the door towards intelligent geospatial data discovery.
基金supported by the National Social Science Foundation of China(No.14CTQ032)the National Natural Science Foundation of China(No.61370170)
文摘Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
文摘Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts about the visual objects and their relationships,largely neglecting fine-grained scene understanding.In fact,many data-driven applications on the Web(e.g.,news-reading and e-shopping)require accurate recognition of much less coarse concepts as entities and proper linking them to a knowledge graph(KG),which can take their performance to the next level.In light of this,in this paper,we identify a new research task:visual entity linking for fine-grained scene understanding.To accomplish the task,we first extract features of candidate entities from different modalities,i.e.,visual features,textual features,and KG features.Then,we design a deep modal-attention neural network-based learning-to-rank method which aggregates all features and maps visual objects to the entities in KG.Extensive experimental results on the newly constructed dataset show that our proposed method is effective as it significantly improves the accuracy performance from 66.46%to 83.16%compared with baselines.
文摘The rapid development of online services and information overload has inspired the fast development of recommender systems, among which collaborative filtering algorithms and model-based recommendation approaches are wildly exploited. For instance, matrix factorization (MF) demonstrated successful achievements and advantages in assisting internet users in finding interested information. These existing models focus on the prediction of the users' ratings on unknown items. The performance is usually evaluated by the metric root mean square error (RMSE). However, achieving good performance in terms of RMSE does not always guarantee a good ranking performance. Therefore, in this paper, we advocate to treat the recommendation as a ranking problem. Normalized discounted cumulative gain (NDCG) is chosen as the optimization target when evaluating the ranking accuracy. Specifically, we present three ranking-oriented recommender algorithms, NSME AdaMF and AdaNSME NSMF builds a NDCG approximated loss function for Matrix Factorization. AdaMF is based on an algorithm by adaptively combining component MF recommenders with boosting method. To combine the advantages of both algorithms, we propose AdaNSME which is a hybird of NSMF and AdaME and show the superiority in both ranking accuracy and model generalization. In addition, we compare our proposed approaches with the state-of-the-art recommendation algorithms. The comparison studies confirm the advantage of our proposed approaches.
文摘Listwise approaches are an important class of learning to rank, which utilizes automatic learning techniques to discover useful information. Most previous research on listwise approaches has focused on optimizing ranking models using weights and has used imprecisely labeled training data; optimizing ranking models using features was largely ignored thus the continuous performance improvement of these approaches was hindered. To address the limitations of previous listwise work, we propose a quasi-KNN model to discover the ranking of features and employ rank addition rule to calculate the weight of combination. On the basis of this, we propose three listwise algorithms, FeatureRank, BL-FeatureRank, and DiffRank. The experimental results show that our proposed algorithms can be applied to a strict ordered ranking training set and gain better performance than state-of-the-art listwise algorithms.
基金supported by the National Natural Science Foundation of China(Grant Nos.61872094 and 62272105)the Shanghai Municipal Science and Technology Major Project(Grant No.2018SHZDZX01)+2 种基金the ZJ Lab,and the Shanghai Research Center for Brain Science and Brain-Inspired Intelligence Technology.Shaojun Wang and Ronghui You have been supported by the lll Project(Grant No.B18015)the Shanghai Municipal Science and Technology Major Project(Grant No.2017SHZDZX01)the Information Technology Facility,CAS-MPG Partner Institute for Computational Biology,Shanghai Institute for Biological Sciences,Chinese Academy of Sciences.Yi Xiong has been supported by the National Natural Science Foundation of China(Grant Nos.61832019 and 62172274).
文摘As one of the state-of-the-art automated function prediction(AFP)methods,NetGO 2.0 integrates multi-source information to improve the performance.However,it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins.Recently,protein language models have been proposed to learn informative representations[e.g.,Evolutionary Scale Modeling(ESM)-1b embedding] from protein sequences based on self-supervision.Here,we represented each protein by ESM-1b and used logistic regression(LR)to train a new model,LR-ESM,for AFP.The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0.Therefore,by incorporating LR-ESM into NetGO 2.0,we developed NetGO 3.0 to improve the performance of AFP extensively.