摘要
目前,获取林业文本关键信息存在2个问题:关键信息获取主要从关键词角度考虑,忽略了词语的信息类型;网络上的林业文本没有统一的记述结构,词语信息类型提取困难。为此,本文提出了基于改进TextRank和簇过滤的林业文本关键信息抽取方法,以“关键词+信息类型”两部分表示文本关键信息。首先,抽取关键词并进行Word2Vec向量化,然后通过构建融合词语特征值、边权值的图模型对TextRank进行改进,对经迭代收敛得到的稳定图进行归并聚类形成簇;然后,设计簇品质评价公式进行簇过滤,再次应用TextRank形成最终簇集合;最后,对簇进行信息类型标注。对于测试文本,通过比较关键词向量和簇心向量的距离获得词语的信息类型,将信息类型与关键词结合得到文本的关键信息。基于2000篇与林业政策新闻相关的林业文本进行实验,最终簇集合的紧密度为0.9680,间隔度为0.0572,综合评价指标为0.8871;对其中400篇文本进行关键词人工标注,将本文关键词抽取方法与TextRank、TF IDF等6种算法进行比较,结果表明,本文方法在MRR、Bpref、准确率和综合评价指标上均获得了较好的效果,说明本文方法在提取林业文本关键词方面具有优势。
There are two main problems in obtaining key information of forestry text,firstly,the key information is mainly considered from the perspective of keywords,and the information types of words are neglected;secondly,there is no unified description structure for forestry text on the Internet,which makes it difficult to extract word information types.Through combining the two characteristics of“keywords+information types”,a method about forestry text key information extraction was proposed based on inproved TextRank and clusters filtering.The main contents were as follows:the first step was to extract the text keywords according to the keywords extraction formula.The second step was to characterize the keywords with Word2Vec vectorization.The third step was to improve the TextRank algorithm,mainly by merging the word features and introducing the edge weights to construct the graph model of the text.The fourth step was to obtain the stable graph structures through iterative convergence,and then merged them to form clusters.And the clusters’s quality was evaluated from three aspects:the uniformity of elements distribution,the size of the clusters,and the universality of the clusters.The fifth step was to form the final clusters’set in combination with the TextRank algorithm.The final step was to label the final clusters about information types.The data used in the experiments were 2000 forestry texts related to forestry policies and news.The experimental results showed that compactness of the final clusters’set was 0.9680,the separation of the final clusters’set was 0.0572,and the F1-measure of the final clusters’set was 0.8871.It showed that the information types of the clusters can be clearly marked.For a text’s keywords,their information type was obtained by calculating the cosine similarity of the keywords’vector and the clusters’heart.The combination of keywords and information types constituted key information of a foresty text.Meanwhile,manually labeled 400 texts,comparing with the six algorithms such as TextRank,TF-IDF,this method achieved the better results in MRR,Bpref,accuracy,and F1-measure.It showed that this method had advantages in extracting forestry text keywords.
作者
陈志泊
李钰曼
许福
冯国明
师栋瑜
崔晓晖
CHEN Zhibo;LI Yuman;XU Fu;FENG Guoming;SHI Dongyu;CUI Xiaohui(School of Information Science and Technology,Beijing Forestry University,Beijing 100083,China;China United Network Communications Group Co.,Ltd.,Beijing 100033,China;China Telecom System Integration Co.,Ltd.,Beijing 100035,China)
出处
《农业机械学报》
EI
CAS
CSCD
北大核心
2020年第5期207-214,172,共9页
Transactions of the Chinese Society for Agricultural Machinery
基金
国家自然科学基金项目(61772078)
北京林业大学热点追踪项目(2018BLRD18)。