摘要
针对现有关键词提取算法需要大量训练数据及时间、常用词分词困难、互联网文档噪音等问题,提出了一种基于TF-IWF的领域文档关键词快速提取算法。该算法使用简单统计并考虑词长、位置、词性等启发性知识计算词权重,并通过文档净化、领域词典分词等方法提高了关键词提取的速度及准确度。对523篇学生心理健康领域文档的实验结果表明,该算法提取的文档关键词质量优于TF-IDF方法,且能在时间内完成。
Aimed at the problems of existing keywords extraction algorithm needs a lot of training data and time, the difficult to segmentation of common words and the noise to internet documents, a fast algorithm ofkeywords extraction in the field base on TF-IWF is proposed. This algorithm uses simple statistics, considering heuristic knowledge of the word length, position and part of speech to calculate the term weight, and improves the speed and accuracy ofkeywords extraction by methods of documentation purification, domain dictionary segmentation. 523 articles on students' mental health of experiment shows that keywords obtained from this algorithm is better than the quality of based on TF-IDF, and time complexity is O(n).
出处
《计算机工程与设计》
CSCD
北大核心
2011年第6期2142-2145,共4页
Computer Engineering and Design
关键词
关键词提取
中文分词
领域词典
启发式知识
时间复杂度
keywords extraction
Chinese word segmentation
domain dictionary
heuristic knowledge
time complexity