摘要
关键词提取作为自然语言处理(NLP)的重要步骤,其作用是挖掘文本主题,通过几个词高度概括文本内容,在信息检索、文本挖掘中应用广泛。选出的关键词必须包含以下3个特性:易于理解、与文本高度关联、能很好地覆盖文本内容。对TextRank算法进行改进,将一段文本分成若干部分,对其中的每个部分构建关键词图,并在每一部分中提取若干关键词,最后根据词频、长度、位置和词性等综合因素进行打分,选出最终的关键词。通过实验得出,该算法相比传统的TextRank算法准确率提高了2.3%。改进TextRank算法改善了传统Tex⁃tRank算法将文本按句子划分,且划分过于细致,造成句子之间联系被割裂的现象,提高了算法效率。
As an important step of natural language processing(NLP),keyword extraction is used to mine the subject of a text.It is widely used in information retrieval and text mining.The chosen keyword must contain three features:easy to understand,highly rele⁃vant to the text,and has good coverage of the text.In this paper,the TextRank algorithm is improved,a text is divided into a number of parts,for each part of the construction of a keyword graph,in each part to extract a number of keywords.Finally,according to the word frequency,length,position and part of speech and other comprehensive factors to score,select the final keywords.The experimental results show that the proposed algorithm is 2.3%better than the traditional TextRank algorithm.The improved TextRank algorithm im⁃proves the traditional TextRank algorithm,which divides the text into sentences too carefully,which results in the segmentation of sen⁃tences,and improves the efficiency of the algorithm.
作者
王俊玲
WANG Jun-ling(School of Computer Science and Engineering,Shandong University of Science and Technology,Qingdao 266500,China)
出处
《软件导刊》
2021年第4期49-52,共4页
Software Guide
基金
国家重点研发计划项目(2017YFC0804406)
山东省重点研发计划项目(2016ZDJS02A05)。