期刊文献+

基于词语相关度的文档主题抽取算法

Algorithm of Document Subject Extraction Based on Word Relevancy
下载PDF
导出
摘要 考虑到文档中出现频率较高的词语能够体现文档的主题,设计了一种中文文档主题抽取算法.该算法首先对目标文档进行预处理,然后计算文档中每个词语的出现频率,用出现频率最高的几个词语作为文档的主题.其中,将词语间的相关度作为计算出现频率的参考因素.词语相关度的计算是基于中文知识库《知网》的方法.实验证明,本算法具有较高的准确性. A kind of subject extraction algorithm was designed based on the consideration that words with high frequent occurrence could represent the theme of the document. Firstly, this algorithm pre-processed the sample document and calculated the occurrence frequency of each word of the document. Some most frequent words were used to represent the subject. The relevancy between words was referred to calculate the frequency of each word and the calculation of relevancy was based on the ontology Hownet. At last, the high accuracy of the algorithm was testified by the experiment.
作者 袁晓峰
出处 《成都大学学报(自然科学版)》 2012年第4期367-369,共3页 Journal of Chengdu University(Natural Science Edition)
关键词 词语相关度 出现频率 知网 主题抽取 word relevancy occurrence frequency Hownet subject extraction
  • 相关文献

参考文献8

二级参考文献22

共引文献116

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部