The growing collection of scientific data in various web repositories is referred to as Scientific Big Data,as it fulfills the four“V’s”of Big Data—volume,variety,velocity,and veracity.This phenomenon has created ...The growing collection of scientific data in various web repositories is referred to as Scientific Big Data,as it fulfills the four“V’s”of Big Data—volume,variety,velocity,and veracity.This phenomenon has created new opportunities for startups;for instance,the extraction of pertinent research papers from enormous knowledge repositories using certain innovative methods has become an important task for researchers and entrepreneurs.Traditionally,the content of the papers are compared to list the relevant papers from a repository.The conventional method results in a long list of papers that is often impossible to interpret productively.Therefore,the need for a novel approach that intelligently utilizes the available data is imminent.Moreover,the primary element of the scientific knowledge base is a research article,which consists of various logical sections such as the Abstract,Introduction,Related Work,Methodology,Results,and Conclusion.Thus,this study utilizes these logical sections of research articles,because they hold significant potential in finding relevant papers.In this study,comprehensive experiments were performed to determine the role of the logical sections-based terms indexing method in improving the quality of results(i.e.,retrieving relevant papers).Therefore,we proposed,implemented,and evaluated the logical sections-based content comparisons method to address the research objective with a standard method of indexing terms.The section-based approach outperformed the standard content-based approach in identifying relevant documents from all classified topics of computer science.Overall,the proposed approach extracted 14%more relevant results from the entire dataset.As the experimental results suggested that employing a finer content similarity technique improved the quality of results,the proposed approach has led the foundation of knowledge-based startups.展开更多
Citations based relevant research paper recommendations can be generated primarily with the assistance of three citation models:(1)Bibliographic Coupling,(2)Co-Citation,and(3)Direct Citations.Millions of new scholarly...Citations based relevant research paper recommendations can be generated primarily with the assistance of three citation models:(1)Bibliographic Coupling,(2)Co-Citation,and(3)Direct Citations.Millions of new scholarly articles are published every year.This flux of scientific information has made it a challenging task to devise techniques that could help researchers to find the most relevant research papers for the paper at hand.In this study,we have deployed an in-text citation analysis that extends the Direct Citation Model to discover the nature of the relationship degree-ofrelevancy among scientific papers.For this purpose,the relationship between citing and cited articles is categorized into three categories:weak,medium,and strong.As an experiment,around 5,000 research papers were crawled from the CiteSeerX.These research papers were parsed for the identification of in-text citation frequencies.Subsequently,0.1 million references of those articles were extracted,and their in-text citation frequencies were computed.A comprehensive benchmark dataset was established based on the user study.Afterwards,the results were validated with the help of Least Square Approximation by Quadratic Polynomial method.It was found that degreeof-relevancy between scientific papers is a quadratic increasing/decreasing polynomial with respect to-increase/decrease in the in-text citation frequencies of a cited article.Furthermore,the results of the proposed model were compared with state-of-the-art techniques by utilizing a well-known measure,known as the normalized Discount Cumulative Gain(nDCG).The proposed method received an nDCG score of 0.89,whereas the state-of-the-art models such as the Content,Bibliographic-coupling,and Metadata-based Models were able to acquire the nDCG values of 0.65,0.54,and 0.51 respectively.These results indicate that the proposed mechanism may be applied in future information retrieval systems for better results.展开更多
基金supported by Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(2020-0-01592)Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(2019R1F1A1058548).
文摘The growing collection of scientific data in various web repositories is referred to as Scientific Big Data,as it fulfills the four“V’s”of Big Data—volume,variety,velocity,and veracity.This phenomenon has created new opportunities for startups;for instance,the extraction of pertinent research papers from enormous knowledge repositories using certain innovative methods has become an important task for researchers and entrepreneurs.Traditionally,the content of the papers are compared to list the relevant papers from a repository.The conventional method results in a long list of papers that is often impossible to interpret productively.Therefore,the need for a novel approach that intelligently utilizes the available data is imminent.Moreover,the primary element of the scientific knowledge base is a research article,which consists of various logical sections such as the Abstract,Introduction,Related Work,Methodology,Results,and Conclusion.Thus,this study utilizes these logical sections of research articles,because they hold significant potential in finding relevant papers.In this study,comprehensive experiments were performed to determine the role of the logical sections-based terms indexing method in improving the quality of results(i.e.,retrieving relevant papers).Therefore,we proposed,implemented,and evaluated the logical sections-based content comparisons method to address the research objective with a standard method of indexing terms.The section-based approach outperformed the standard content-based approach in identifying relevant documents from all classified topics of computer science.Overall,the proposed approach extracted 14%more relevant results from the entire dataset.As the experimental results suggested that employing a finer content similarity technique improved the quality of results,the proposed approach has led the foundation of knowledge-based startups.
文摘Citations based relevant research paper recommendations can be generated primarily with the assistance of three citation models:(1)Bibliographic Coupling,(2)Co-Citation,and(3)Direct Citations.Millions of new scholarly articles are published every year.This flux of scientific information has made it a challenging task to devise techniques that could help researchers to find the most relevant research papers for the paper at hand.In this study,we have deployed an in-text citation analysis that extends the Direct Citation Model to discover the nature of the relationship degree-ofrelevancy among scientific papers.For this purpose,the relationship between citing and cited articles is categorized into three categories:weak,medium,and strong.As an experiment,around 5,000 research papers were crawled from the CiteSeerX.These research papers were parsed for the identification of in-text citation frequencies.Subsequently,0.1 million references of those articles were extracted,and their in-text citation frequencies were computed.A comprehensive benchmark dataset was established based on the user study.Afterwards,the results were validated with the help of Least Square Approximation by Quadratic Polynomial method.It was found that degreeof-relevancy between scientific papers is a quadratic increasing/decreasing polynomial with respect to-increase/decrease in the in-text citation frequencies of a cited article.Furthermore,the results of the proposed model were compared with state-of-the-art techniques by utilizing a well-known measure,known as the normalized Discount Cumulative Gain(nDCG).The proposed method received an nDCG score of 0.89,whereas the state-of-the-art models such as the Content,Bibliographic-coupling,and Metadata-based Models were able to acquire the nDCG values of 0.65,0.54,and 0.51 respectively.These results indicate that the proposed mechanism may be applied in future information retrieval systems for better results.