The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is av...The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be ally linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.展开更多
With the development of big data,all walks of life in society have begun to venture into big data to serve their own enterprises and departments.Big data has been embraced by university digital libraries.The most cumb...With the development of big data,all walks of life in society have begun to venture into big data to serve their own enterprises and departments.Big data has been embraced by university digital libraries.The most cumbersome work for the management of university libraries is document retrieval.This article uses Hadoop algorithm to extract semantic keywords and then calculates semantic similarity based on the literature retrieval keyword calculation process.The fast-matching method is used to determine the weight of each keyword,so as to ensure an efficient and accurate document retrieval in digital libraries,thus completing the design of the document retrieval method for university digital libraries based on Hadoop technology.展开更多
To quantitatively analyze main figure,field,agency and level of sponge city research in China,and clear research focus and hot spot in each year,by using the Full Text Database of Chinese Sci-tech Periodicals and othe...To quantitatively analyze main figure,field,agency and level of sponge city research in China,and clear research focus and hot spot in each year,by using the Full Text Database of Chinese Sci-tech Periodicals and other retrieval tools,the statistics and analysis of 3152 research literatures on sponge city published in domestic academic journals of 2004-2016 are conducted based on bibliometrics. It is found that since the concept of " sponge city" was firstly proposed in 2012,development research of sponge city involves 40 subject fields and is mainly published in 32 kinds of journals,which is dominated by natural science research( 1427 literatures). Researchers are mainly from each college and university,some design institutes and Chinese Academy of Sciences. The research could play certain guidance significance for further research and construction of ecological city construction in China.展开更多
Academic literature retrieval concerns about the selection of papers that are most likely to match a user's information needs. Most of the retrieval systems are limited to list-output models, in which the retrieva...Academic literature retrieval concerns about the selection of papers that are most likely to match a user's information needs. Most of the retrieval systems are limited to list-output models, in which the retrieval results are isolated from each other. In this paper, we aim to uncover the relationships between the retrieval results and propose a method to build structural retrieval results for academic literature, which we call a paper evolution graph(PEG).The PEG describes the evolution of diverse aspects of input queries through several evolution chains of papers. By using the author, citation, and content information, PEGs can uncover various underlying relationships among the papers and present the evolution of articles from multiple viewpoints. Our system supports three types of input queries: keyword query, single-paper query, and two-paper query. The construction of a PEG consists mainly of three steps. First, the papers are soft-clustered into communities via metagraph factorization, during which the topic distribution of each paper is obtained. Second, topically cohesive evolution chains are extracted from the communities that are relevant to the query. Each chain focuses on one aspect of the query. Finally, the extracted chains are combined to generate a PEG, which fully covers all the topics of the query. Experimental results on a real-world dataset demonstrate that the proposed method can construct meaningful PEGs.展开更多
The paper proposes a novel method for subtopics segmentation of Web document. An effective retrieval results may be obtained by using subtopics segmentation. The proposed method can segment hierarchically subtopics an...The paper proposes a novel method for subtopics segmentation of Web document. An effective retrieval results may be obtained by using subtopics segmentation. The proposed method can segment hierarchically subtopics and identify the boundary of each subtopic. Based on the term frequency matrix, the method measures the similarity between adjacent blocks, such as paragraphs, passages. In the real-world sample experiment, the macro-averaged precision and recall reach 73.4 % and 82.5 %, and the micro-averaged precision and recall reach 72.9% and 83. 1%. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.展开更多
A document layout can be more informative than merely a document’s visual and structural appearance.Thus,document layout analysis(DLA)is considered a necessary prerequisite for advanced processing and detailed docume...A document layout can be more informative than merely a document’s visual and structural appearance.Thus,document layout analysis(DLA)is considered a necessary prerequisite for advanced processing and detailed document image analysis to be further used in several applications and different objectives.This research extends the traditional approaches of DLA and introduces the concept of semantic document layout analysis(SDLA)by proposing a novel framework for semantic layout analysis and characterization of handwritten manuscripts.The proposed SDLA approach enables the derivation of implicit information and semantic characteristics,which can be effectively utilized in dozens of practical applications for various purposes,in a way bridging the semantic gap and providingmore understandable high-level document image analysis and more invariant characterization via absolute and relative labeling.This approach is validated and evaluated on a large dataset ofArabic handwrittenmanuscripts comprising complex layouts.The experimental work shows promising results in terms of accurate and effective semantic characteristic-based clustering and retrieval of handwritten manuscripts.It also indicates the expected efficacy of using the capabilities of the proposed approach in automating and facilitating many functional,reallife tasks such as effort estimation and pricing of transcription or typing of such complex manuscripts.展开更多
This paper introduces a new enhanced Arabic stemming algorithm for solving the information retrieval problem,especially in medical documents.Our proposed algorithm is a light stemming algorithm for extracting stems an...This paper introduces a new enhanced Arabic stemming algorithm for solving the information retrieval problem,especially in medical documents.Our proposed algorithm is a light stemming algorithm for extracting stems and roots from the input data.One of the main challenges facing the light stemming algorithm is cutting off the input word,to extract the initial segments.When initiating the light stemmer with strong initial segments,the final extracting stems and roots will be more accurate.Therefore,a new enhanced segmentation based on deploying the Direct Acyclic Graph(DAG)model is utilized.In addition to extracting the powerful initial segments,the main two procedures(i.e.,stems and roots extraction),should be also reinforced with more efficient operators to improve the final outputs.To validate the proposed enhanced stemmer,four data sets are used.The achieved stems and roots resulted from our proposed light stemmer are compared with the results obtained from five other well-known Arabic light stemmers using the same data sets.This evaluation process proved that the proposed enhanced stemmer outperformed other comparative stemmers.展开更多
To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorith...To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorithm is that the crossover and mutation of operator are constructed according to its own characteristics of information retrieval. Immune operator is adopted to avoid degeneracy. Relevant documents retrieved are merged to a single document list according to rank formula. Experimental results show that the novel immune algorithm can lead to substantial improvements of relevant document retrieval effectiveness.展开更多
基金Supported by the Funds of Heilongjiang Outstanding Young Teacher (1151G037).
文摘The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be ally linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.
文摘With the development of big data,all walks of life in society have begun to venture into big data to serve their own enterprises and departments.Big data has been embraced by university digital libraries.The most cumbersome work for the management of university libraries is document retrieval.This article uses Hadoop algorithm to extract semantic keywords and then calculates semantic similarity based on the literature retrieval keyword calculation process.The fast-matching method is used to determine the weight of each keyword,so as to ensure an efficient and accurate document retrieval in digital libraries,thus completing the design of the document retrieval method for university digital libraries based on Hadoop technology.
基金Supported by Science Research Fund of Yunnan Provincial Education Department(2017ZZX090)School-level Key Project of Kunming University(XJZD1602)"Study on Key Technology of Ecological Planting Mode of Balcony Vegetables"of Provincial-level"Quality Project"of Kunming University-Innovation and Entrepreneurship Training Program for College Students
文摘To quantitatively analyze main figure,field,agency and level of sponge city research in China,and clear research focus and hot spot in each year,by using the Full Text Database of Chinese Sci-tech Periodicals and other retrieval tools,the statistics and analysis of 3152 research literatures on sponge city published in domestic academic journals of 2004-2016 are conducted based on bibliometrics. It is found that since the concept of " sponge city" was firstly proposed in 2012,development research of sponge city involves 40 subject fields and is mainly published in 32 kinds of journals,which is dominated by natural science research( 1427 literatures). Researchers are mainly from each college and university,some design institutes and Chinese Academy of Sciences. The research could play certain guidance significance for further research and construction of ecological city construction in China.
基金Project supported by the National Key R&D Program of China(No.2018YFB0505000)the National Natural Science Foundation of China(No.61571393)
文摘Academic literature retrieval concerns about the selection of papers that are most likely to match a user's information needs. Most of the retrieval systems are limited to list-output models, in which the retrieval results are isolated from each other. In this paper, we aim to uncover the relationships between the retrieval results and propose a method to build structural retrieval results for academic literature, which we call a paper evolution graph(PEG).The PEG describes the evolution of diverse aspects of input queries through several evolution chains of papers. By using the author, citation, and content information, PEGs can uncover various underlying relationships among the papers and present the evolution of articles from multiple viewpoints. Our system supports three types of input queries: keyword query, single-paper query, and two-paper query. The construction of a PEG consists mainly of three steps. First, the papers are soft-clustered into communities via metagraph factorization, during which the topic distribution of each paper is obtained. Second, topically cohesive evolution chains are extracted from the communities that are relevant to the query. Each chain focuses on one aspect of the query. Finally, the extracted chains are combined to generate a PEG, which fully covers all the topics of the query. Experimental results on a real-world dataset demonstrate that the proposed method can construct meaningful PEGs.
基金Supported by the National High Tech-nology Research and Development Program of China(2002AA119050)
文摘The paper proposes a novel method for subtopics segmentation of Web document. An effective retrieval results may be obtained by using subtopics segmentation. The proposed method can segment hierarchically subtopics and identify the boundary of each subtopic. Based on the term frequency matrix, the method measures the similarity between adjacent blocks, such as paragraphs, passages. In the real-world sample experiment, the macro-averaged precision and recall reach 73.4 % and 82.5 %, and the micro-averaged precision and recall reach 72.9% and 83. 1%. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.
基金This research was supported and funded by KAU Scientific Endowment,King Abdulaziz University,Jeddah,Saudi Arabia.
文摘A document layout can be more informative than merely a document’s visual and structural appearance.Thus,document layout analysis(DLA)is considered a necessary prerequisite for advanced processing and detailed document image analysis to be further used in several applications and different objectives.This research extends the traditional approaches of DLA and introduces the concept of semantic document layout analysis(SDLA)by proposing a novel framework for semantic layout analysis and characterization of handwritten manuscripts.The proposed SDLA approach enables the derivation of implicit information and semantic characteristics,which can be effectively utilized in dozens of practical applications for various purposes,in a way bridging the semantic gap and providingmore understandable high-level document image analysis and more invariant characterization via absolute and relative labeling.This approach is validated and evaluated on a large dataset ofArabic handwrittenmanuscripts comprising complex layouts.The experimental work shows promising results in terms of accurate and effective semantic characteristic-based clustering and retrieval of handwritten manuscripts.It also indicates the expected efficacy of using the capabilities of the proposed approach in automating and facilitating many functional,reallife tasks such as effort estimation and pricing of transcription or typing of such complex manuscripts.
文摘This paper introduces a new enhanced Arabic stemming algorithm for solving the information retrieval problem,especially in medical documents.Our proposed algorithm is a light stemming algorithm for extracting stems and roots from the input data.One of the main challenges facing the light stemming algorithm is cutting off the input word,to extract the initial segments.When initiating the light stemmer with strong initial segments,the final extracting stems and roots will be more accurate.Therefore,a new enhanced segmentation based on deploying the Direct Acyclic Graph(DAG)model is utilized.In addition to extracting the powerful initial segments,the main two procedures(i.e.,stems and roots extraction),should be also reinforced with more efficient operators to improve the final outputs.To validate the proposed enhanced stemmer,four data sets are used.The achieved stems and roots resulted from our proposed light stemmer are compared with the results obtained from five other well-known Arabic light stemmers using the same data sets.This evaluation process proved that the proposed enhanced stemmer outperformed other comparative stemmers.
基金TheNationalHigh TechDevelopment 863ProgramofChina (No .2 0 0 3AA1Z2 610 )
文摘To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorithm is that the crossover and mutation of operator are constructed according to its own characteristics of information retrieval. Immune operator is adopted to avoid degeneracy. Relevant documents retrieved are merged to a single document list according to rank formula. Experimental results show that the novel immune algorithm can lead to substantial improvements of relevant document retrieval effectiveness.