期刊文献+

基于语义的文档特征提取研究方法 被引量:10

Semantic-based Feature Extraction Method for Document
下载PDF
导出
摘要 中文文本特征词选取是文本处理的重要方面,对文本分类有重要影响。现有的文本特征提取方法存在生成特征向量维数高、依赖训练集、忽略低频关键词等不足。利用《同义词词林》计算词语之间的语义距离,通过聚类算法筛选类别的主题相关词,最后通过信息增益算法从主题相关词中选取特征词。以宏F值和微F值为评价指标,通过有效性实验和对比实验表明,该方法的文本特征选取效果优于其他经典算法。 Feature extraction of Chinese documents is an important part in the document processing,and imposes great influence on the document classification.Pre-existing document feature extraction methods have many shortcomings,such as creating a feature vector of high dimensions,depending on training sets,ignoring low-frequency keywords,and so on.In this paper,the semantic distance between words was calculated based on the synonyms dictionary,and then theme related words of each classification were selected by the density clustering method,and finally the feature words were selected from the theme related words using the information gain algorithm.In order to validate the proposed method,one validation experiment and one comparison experiment were designed and the evaluation indexes including the macro-F value and the micro-F value were calculated.Experiment results show that the proposed document feature extraction method has better performance than other traditional algorithms.
出处 《计算机科学》 CSCD 北大核心 2016年第2期254-258,共5页 Computer Science
基金 国家高新技术研究发展计划(2009AA062802) 国家自然科学基金(60473125) 中国石油(CNPC)石油科技中青年创新基金(05E7013) 国家重大专项子课题(G5800-08-ZS-WX)资助
关键词 特征词 语义距离 信息增益 文本分类 Feature word Semantic distance Information gain Text classification
  • 相关文献

参考文献14

二级参考文献136

共引文献308

同被引文献119

引证文献10

二级引证文献56

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部