Keyword extraction is an important research topic of information retrieval. This paper gave the specification of keywords in Chinese news documents based on analyzing linguistic characteristics of news documents and t...Keyword extraction is an important research topic of information retrieval. This paper gave the specification of keywords in Chinese news documents based on analyzing linguistic characteristics of news documents and then proposed a new keyword extraction method based on tf/idf with multi-strategies. The approach selected candidate keywords of uni-, hi- and tri-grams, and then defines the features according to their morphological characters and context information. Moreover, the paper proposed several strategies to amend the incomplete words gotten from the word segmentation and found unknown potential keywords in news documents. Experimental results show that our proposed method can significantly outperform the baseline method. We also applied it to retrospective event detection. Experimental results show that the accuracy and efficiency of news retrospective event detection can be significantly improved.展开更多
As the Smart city trend especially artificial intelligence,data science,and the internet of things has attracted lots of attention,many researchers have created various smart applications for improving people’s life ...As the Smart city trend especially artificial intelligence,data science,and the internet of things has attracted lots of attention,many researchers have created various smart applications for improving people’s life quality.As it is very essential to automatically collect and exploit information in the era of industry 4.0,a variety of models have been proposed for storage problem solving and efficient data mining.In this paper,we present our proposed system,Trendy Keyword Extraction System(TKES),which is designed for extracting trendy keywords from text streams.The system also supports storing,analyzing,and visualizing documents coming from text streams.The system first automatically collects daily articles,then it ranks the importance of keywords by calculating keywords’frequency of existence in order to find trendy keywords by using the Burst Detection Algorithm which is proposed in this paper based on the idea of Kleinberg.This method is used for detecting bursts.A burst is defined as a period of time when a keyword is continuously and unusually popular over the text stream and the identification of bursts is known as burst detection procedure.The results from user requests could be displayed visually.Furthermore,we create a method in order to find a trendy keyword set which is defined as a set of keywords that belong to the same burst.This work also describes the datasets used for our experiments,processing speed tests of our two proposed algorithms.展开更多
Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The...Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.展开更多
The internet is an abundant source of news every day. Thus, efficient algorithms to extract keywords from the text are important to obtain information quickly. However, the precision and recall of mature keyword extra...The internet is an abundant source of news every day. Thus, efficient algorithms to extract keywords from the text are important to obtain information quickly. However, the precision and recall of mature keyword extraction algorithms need improvement. TextRank, which is derived from the PageRank algorithm, uses word graphs to spread the weight of words. The keyword weight propagation in Text Rank focuses only on word frequency. To improve the performance of the algorithm, we propose Semantic Clustering TextRank(SCTR), a semantic clustering news keyword extraction algorithm based on TextRank. Firstly, the word vectors generated by the Bidirectional Encoder Representation from Transformers(BERT) model are used to perform k-means clustering to represent semantic clustering. Then, the clustering results are used to construct a TextRank weight transfer probability matrix. Finally,iterative calculation of word graphs and extraction of keywords are performed. The test target of this experiment is a Chinese news library. The results of the experiment conducted on this text set show that the SCTR algorithm has greater precision, recall, and F1 value than the traditional TextRank and Term Frequency-Inverse Document Frequency(TF-IDF) algorithms.展开更多
Graph-based methods are one of the widely used unsupervised approaches for keyword extraction. In this approach, words are linked according to their co- occurrences within the document. Afterwards, graph-based ranking...Graph-based methods are one of the widely used unsupervised approaches for keyword extraction. In this approach, words are linked according to their co- occurrences within the document. Afterwards, graph-based ranking algorithms are used to rank words and those with the highest scores are selected as keywords. Although graph-based methods are effective for keyword extraction, they rank words merely based on word graph topology. In fact, we have various prior knowledge to identify how likely the words are keywords. The knowledge of words may be frequency-based, position-based, or semantic- based. In this paper, we propose to incorporate prior knowledge with graph-based methods for keyword extraction and investigate the contributions of the prior knowledge. Experiments reveal that prior knowledge can significantly improve the performance of graph-based keyword extraction. Moreover, by combining prior knowl- edge with neighborhood knowledge, in experiments we achieve the best results compared to previous graph-based methods.展开更多
Keyword extraction is a branch of natural language processing,which plays an important role in many tasks,such as long text classification,automatic summary,machine translation,dialogue system,etc.All of them need to ...Keyword extraction is a branch of natural language processing,which plays an important role in many tasks,such as long text classification,automatic summary,machine translation,dialogue system,etc.All of them need to use high-quality keywords as a starting point.In this paper,we propose a deep learning network called deep neural semantic network(DNSN)to solve the problem of short text keyword extraction.It can map short text and words to the same semantic space,get the semantic vector of them at the same time,and then compute the similarity between short text and words to extract top-ranked words as keywords.The Bidirectional Encoder Representations from Transformers was first used to obtain the initial semantic feature vectors of short text and words,and then feed the initial semantic feature vectors to the residual network so as to obtain the final semantic vectors of short text and words at the same vector space.Finally,the keywords were extracted by calculating the similarity between short text and words.Compared with existed baseline models including Frequency,Term Frequency Inverse Document Frequency(TF-IDF)and Text-Rank,the model proposed is superior to the baseline models in Precision,Recall,and F-score on the same batch of test dataset.In addition,the precision,recall,and F-score are 6.79%,5.67%,and 11.08%higher than the baseline model in the best case,respectively.展开更多
Microblog is a social platform with huge user community and mass data. We propose a semantic recommendation mechanism based on sentiment analysis for microblog. Firstly, the keywords and sensibility words in this mech...Microblog is a social platform with huge user community and mass data. We propose a semantic recommendation mechanism based on sentiment analysis for microblog. Firstly, the keywords and sensibility words in this mechanism are extracted by natural language processing including segmentation, lexical analysis and strategy selection. Then, we query the background knowledge base based on linked open data (LOD) with the basic information of users. The experiment result shows that the accuracy of recommendation is within the range of 70% -89% with sentiment analysis and semantic query. Compared with traditional recommendation method, this method can satisfy users' requirement greatly.展开更多
Microblogging services nformation and express opinions pro by vide a novel and popular communication scheme for Web users to share publishing short posts, which usually reflect the users' daily life. We can thus mode...Microblogging services nformation and express opinions pro by vide a novel and popular communication scheme for Web users to share publishing short posts, which usually reflect the users' daily life. We can thus model the users' daily status and interests according to their posts. Because of the high complexity and the large amount of the content of the microblog users' posts, it is necessary to provide a quick summary of the users' life status, both for personal users and commercial services. It is non-trivial to summarize the life status of microblog users, particularly when the summary is conducted over a long period. In this paper, we present a compact interactive visualization prototype, LifeCircle, as an efficient summary for exploring the long-term life status of microblog users. The radial visualization provides multiple views for a given microblog user, including annual topics, monthly keywords, monthly sentiments, and temporal trends of posts. We tightly integrate interactive visualization with novel and state-of-the-art microblogging analytics to maximize their advantages. We implement LifeCircle on Sina Weibo, the most popular microblogging service in China, and illustrate the effectiveness of our prototype with various case studies. Results show that our prototype makes users nostalgic and makes them reminiscent about past events, which helps them to better understand themselves and others展开更多
This work presents a spoken dialog summariza- tion system with HAPPINESS/SUFFERING factor recognition. The semantic content is compressed and classified by factor categories from spoken dialog. The transcription of au...This work presents a spoken dialog summariza- tion system with HAPPINESS/SUFFERING factor recognition. The semantic content is compressed and classified by factor categories from spoken dialog. The transcription of au- tomatic speech recognition is then processed through Chinese Knowledge and Information Processing segmentation system. The proposed system also adopts the part-of-speech tags to effectively select and rank the keywords. Finally, the HAPPINESS/SUFFERING factor recognition is done by the proposed point-wise mutual information. Compared with the original method, the performance is improved by applying the significant scores of keywords. The experimental results show that the average precision rate for factor recognition in outside test can reach 73.5% which demonstrates the possi- bility and potential of the proposed system.展开更多
With the popularity of social media,there has been an increasing interest in user profiling and its applications nowadays.This paper presents our system named UIR-SIST for User Profiling Technology Evaluation Campaign...With the popularity of social media,there has been an increasing interest in user profiling and its applications nowadays.This paper presents our system named UIR-SIST for User Profiling Technology Evaluation Campaign in SMP CUP 2017.UIR-SIST aims to complete three tasks,including keywords extraction from blogs,user interests labeling and user growth value prediction.To this end,we first extract keywords from a user’s blog,including the blog itself,blogs on the same topic and other blogs published by the same user.Then a unified neural network model is constructed based on a convolutional neural network(CNN)for user interests tagging.Finally,we adopt a stacking model for predicting user growth value.We eventually receive the sixth place with evaluation scores of 0.563,0.378 and 0.751 on the three tasks,respectively.展开更多
基金Supported by the National Natural Science Foundation of China (90604025)
文摘Keyword extraction is an important research topic of information retrieval. This paper gave the specification of keywords in Chinese news documents based on analyzing linguistic characteristics of news documents and then proposed a new keyword extraction method based on tf/idf with multi-strategies. The approach selected candidate keywords of uni-, hi- and tri-grams, and then defines the features according to their morphological characters and context information. Moreover, the paper proposed several strategies to amend the incomplete words gotten from the word segmentation and found unknown potential keywords in news documents. Experimental results show that our proposed method can significantly outperform the baseline method. We also applied it to retrospective event detection. Experimental results show that the accuracy and efficiency of news retrospective event detection can be significantly improved.
基金The work of Tham Vo is supported by Lac Hong University,and funded by Thu Dau Mot University(No.DT.20-031)The work of Phuc Do is funded by Vietnam National University,Ho Chi Minh City(No.DS2020-26-01).
文摘As the Smart city trend especially artificial intelligence,data science,and the internet of things has attracted lots of attention,many researchers have created various smart applications for improving people’s life quality.As it is very essential to automatically collect and exploit information in the era of industry 4.0,a variety of models have been proposed for storage problem solving and efficient data mining.In this paper,we present our proposed system,Trendy Keyword Extraction System(TKES),which is designed for extracting trendy keywords from text streams.The system also supports storing,analyzing,and visualizing documents coming from text streams.The system first automatically collects daily articles,then it ranks the importance of keywords by calculating keywords’frequency of existence in order to find trendy keywords by using the Burst Detection Algorithm which is proposed in this paper based on the idea of Kleinberg.This method is used for detecting bursts.A burst is defined as a period of time when a keyword is continuously and unusually popular over the text stream and the identification of bursts is known as burst detection procedure.The results from user requests could be displayed visually.Furthermore,we create a method in order to find a trendy keyword set which is defined as a set of keywords that belong to the same burst.This work also describes the datasets used for our experiments,processing speed tests of our two proposed algorithms.
文摘Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.
基金supported by the National Key R&D Program of China (No.2018YFE0205502)the National Natural Science Foundation of China (No.61672108)。
文摘The internet is an abundant source of news every day. Thus, efficient algorithms to extract keywords from the text are important to obtain information quickly. However, the precision and recall of mature keyword extraction algorithms need improvement. TextRank, which is derived from the PageRank algorithm, uses word graphs to spread the weight of words. The keyword weight propagation in Text Rank focuses only on word frequency. To improve the performance of the algorithm, we propose Semantic Clustering TextRank(SCTR), a semantic clustering news keyword extraction algorithm based on TextRank. Firstly, the word vectors generated by the Bidirectional Encoder Representation from Transformers(BERT) model are used to perform k-means clustering to represent semantic clustering. Then, the clustering results are used to construct a TextRank weight transfer probability matrix. Finally,iterative calculation of word graphs and extraction of keywords are performed. The test target of this experiment is a Chinese news library. The results of the experiment conducted on this text set show that the SCTR algorithm has greater precision, recall, and F1 value than the traditional TextRank and Term Frequency-Inverse Document Frequency(TF-IDF) algorithms.
文摘Graph-based methods are one of the widely used unsupervised approaches for keyword extraction. In this approach, words are linked according to their co- occurrences within the document. Afterwards, graph-based ranking algorithms are used to rank words and those with the highest scores are selected as keywords. Although graph-based methods are effective for keyword extraction, they rank words merely based on word graph topology. In fact, we have various prior knowledge to identify how likely the words are keywords. The knowledge of words may be frequency-based, position-based, or semantic- based. In this paper, we propose to incorporate prior knowledge with graph-based methods for keyword extraction and investigate the contributions of the prior knowledge. Experiments reveal that prior knowledge can significantly improve the performance of graph-based keyword extraction. Moreover, by combining prior knowl- edge with neighborhood knowledge, in experiments we achieve the best results compared to previous graph-based methods.
基金the Major Program of National Natural Science Foundation of China(Grant Nos.91938301)the National Defense Equipment Advance Research Shared Technology Program of China(41402050301-170441402065)he Sichuan Science and Technology Major Project on New Generation Artificial Intelligence(2018GZDZX0034).
文摘Keyword extraction is a branch of natural language processing,which plays an important role in many tasks,such as long text classification,automatic summary,machine translation,dialogue system,etc.All of them need to use high-quality keywords as a starting point.In this paper,we propose a deep learning network called deep neural semantic network(DNSN)to solve the problem of short text keyword extraction.It can map short text and words to the same semantic space,get the semantic vector of them at the same time,and then compute the similarity between short text and words to extract top-ranked words as keywords.The Bidirectional Encoder Representations from Transformers was first used to obtain the initial semantic feature vectors of short text and words,and then feed the initial semantic feature vectors to the residual network so as to obtain the final semantic vectors of short text and words at the same vector space.Finally,the keywords were extracted by calculating the similarity between short text and words.Compared with existed baseline models including Frequency,Term Frequency Inverse Document Frequency(TF-IDF)and Text-Rank,the model proposed is superior to the baseline models in Precision,Recall,and F-score on the same batch of test dataset.In addition,the precision,recall,and F-score are 6.79%,5.67%,and 11.08%higher than the baseline model in the best case,respectively.
基金Supported by the National Natural Science Foundation of China(60803160 and 61272110)the Key Projects of National Social Science Foundation of China(11&ZD189)+4 种基金the Natural Science Foundation of Hubei Province(2013CFB334)the Natural Science Foundation of Educational Agency of Hubei Province(Q20101110)the State Key Lab of Software Engineering Open Foundation of Wuhan University(SKLSE2012-09-07)the Teaching Research Project of Hubei Province(2011s005)the Wuhan Key Technology Support Program(2013010602010216)
文摘Microblog is a social platform with huge user community and mass data. We propose a semantic recommendation mechanism based on sentiment analysis for microblog. Firstly, the keywords and sensibility words in this mechanism are extracted by natural language processing including segmentation, lexical analysis and strategy selection. Then, we query the background knowledge base based on linked open data (LOD) with the basic information of users. The experiment result shows that the accuracy of recommendation is within the range of 70% -89% with sentiment analysis and semantic query. Compared with traditional recommendation method, this method can satisfy users' requirement greatly.
基金supported by the National Natural Science Foundation of China (Nos. 61170196 and 61202140)by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office
文摘Microblogging services nformation and express opinions pro by vide a novel and popular communication scheme for Web users to share publishing short posts, which usually reflect the users' daily life. We can thus model the users' daily status and interests according to their posts. Because of the high complexity and the large amount of the content of the microblog users' posts, it is necessary to provide a quick summary of the users' life status, both for personal users and commercial services. It is non-trivial to summarize the life status of microblog users, particularly when the summary is conducted over a long period. In this paper, we present a compact interactive visualization prototype, LifeCircle, as an efficient summary for exploring the long-term life status of microblog users. The radial visualization provides multiple views for a given microblog user, including annual topics, monthly keywords, monthly sentiments, and temporal trends of posts. We tightly integrate interactive visualization with novel and state-of-the-art microblogging analytics to maximize their advantages. We implement LifeCircle on Sina Weibo, the most popular microblogging service in China, and illustrate the effectiveness of our prototype with various case studies. Results show that our prototype makes users nostalgic and makes them reminiscent about past events, which helps them to better understand themselves and others
文摘This work presents a spoken dialog summariza- tion system with HAPPINESS/SUFFERING factor recognition. The semantic content is compressed and classified by factor categories from spoken dialog. The transcription of au- tomatic speech recognition is then processed through Chinese Knowledge and Information Processing segmentation system. The proposed system also adopts the part-of-speech tags to effectively select and rank the keywords. Finally, the HAPPINESS/SUFFERING factor recognition is done by the proposed point-wise mutual information. Compared with the original method, the performance is improved by applying the significant scores of keywords. The experimental results show that the average precision rate for factor recognition in outside test can reach 73.5% which demonstrates the possi- bility and potential of the proposed system.
基金This work is partially supported by the National Natural Science Foundation of China(Grant numbers:61502115,61602326,U1636103 and U1536207)the Fundamental Research Fund for the Central Universities(Grant numbers:3262017T12,3262017T18,3262018T02 and 3262018T58).
文摘With the popularity of social media,there has been an increasing interest in user profiling and its applications nowadays.This paper presents our system named UIR-SIST for User Profiling Technology Evaluation Campaign in SMP CUP 2017.UIR-SIST aims to complete three tasks,including keywords extraction from blogs,user interests labeling and user growth value prediction.To this end,we first extract keywords from a user’s blog,including the blog itself,blogs on the same topic and other blogs published by the same user.Then a unified neural network model is constructed based on a convolutional neural network(CNN)for user interests tagging.Finally,we adopt a stacking model for predicting user growth value.We eventually receive the sixth place with evaluation scores of 0.563,0.378 and 0.751 on the three tasks,respectively.