摘要
关键词提取是自然语言研究领域的基础和关键点,在很多领域都有广泛的应用。以本校图书馆提供的8045篇《红色中华》新闻为源数据,首先对数据进行数据清理,去除其中的噪声数据,然后对每篇新闻进行数据结构解析,在解析的基础上计算了词语的TFIDF权重、词位置权重、词性权重、词长权重和词跨度权重,综合考虑这些权重计算出词语的综合权重,以综合权重最大的前8个词语作为新闻的关键词。从准确度、召回率及F1值3个指标对改进算法、经典的TFIDF算法和专家标注进行对比,发现改进算法在3个指标上均优于经典的TFIDF算法,与专家标注比较接近,值得推广应用。
Keyword extraction is the foundation and key point in the field of natural language research and has been widely used in many fields.Based on the data of 8045 pieces of "Red China" news provided by our university library,this paper firstly cleans up the data,removes the noise data,and then analyzes the data structure of each news.On the basis of the analysis,it calculates the TFIDF weight of words,the weight of word position,the weight of part of speech,the weight of word length and the span of words. Weight,these weights are considered to calculate the comprehensive weight of words,and the first eight words with the largest comprehensive weight are used as the keywords of news.Comparing the improved algorithm,the classical TFIDF algorithm and the expert annotation from the three indexes of accuracy,recall rate and F1 value,it is found that the improved algorithm is superior to the classical TFIDF algorithm in the three indexes,and is close to the expert annotation,so it is worth popularizing and applying.
作者
牛永洁
NIU Yong-jie(College of Mathematics & Computer Science,Yan'an University,Yan'an 716000,China)
出处
《电子设计工程》
2019年第13期11-15,共5页
Electronic Design Engineering
基金
国家社会科学基金项目(18BTQ042)
延安大学继续教育教学改革研究专项(YDJY2016-11)
关键词
TFIDF
词性
词跨度
词长
词位置
TFIDF
part of speech
word span
word length
word position