期刊文献+

基于词语关联的文本特征词提取方法 被引量:10

Text feature word selection based on relationship between words
下载PDF
导出
摘要 文本的特征描述是文本自动处理的基础工作之一,目前的文本特征描述一般采用加权VSM模型,该模型大都使用统计的和经验的加权算法,文本每一维特征的权重就是其TFIDF值,这种方法难以突出对文本内容起到关键性作用的特征,而且不能很好地揭示文本中词与词的关系。针对此缺点,提出了一种新的基于关键词语和词语共现频率的特征选择和权重计算方法。该方法在TF-IDF方法的基础上利用了文本的结构信息,同时运用互信息理论提取出对文本内容起到关键性作用的词语;权重计算则综合了词语位置、词语关系和词语频率等信息,突出了文本中关键词语的贡献,弥补了单纯使用TF-IDF权重函数进行计算的一些缺陷,并使文本的特征向量蕴涵了词与词的相关信息。通过采用KNN分类器进行实验,结果显示该方法比传统TF-IDF方法的平均分类准确率有明显提高。 The description of text feature is one of the fundamental works of Natural Language Processing (NLP). Some scholars often use the Vector Space Model (VSM) in description of text feature at present. VSM adopts statistical or experiential term weighting algorithm, term weight in each dimension of the text feature is its TF-IDF value. But TF-IDF is unable to emphasize the significance of key terms which contribute mainly to the content of a text. TF-IDF does not consider the relationship between words and is important in information extraction. In allusion to the disadvantage mentioned above, a new feature selection and term weighting approach based on keywords and word co-occurrence was proposed. Based on TF-IDF, the structure information and mutual information were employed to extract key words of the text; and word location, word dependence, word frequency, document frequency, and relationship between words in weighting a term were integrated. In SVM classification experiment, the approach outperforms the traditional TF-IDF approach with a boost in average precision.
出处 《计算机应用》 CSCD 北大核心 2007年第12期3009-3012,共4页 journal of Computer Applications
关键词 词语关联 词共现率 向量空间模型 特征提取 权重计算 word relationship word co-occurrence Vector Space Model (VSM) feature selection term weighting
  • 相关文献

参考文献13

二级参考文献46

共引文献207

同被引文献75

引证文献10

二级引证文献50

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部