期刊文献+

基于基尼系数的n-grams特征约简加权算法

N- grams Features Reduction and Weighting Based on Gini Coefficient
下载PDF
导出
摘要 目前,关于n-grams特征加权的计算方法大多是基于其出现频率进行设计的。这类加权计算方式存在一定的问题:n-grams特征是由多个词汇构造而成,由于其出现频率取决于多个词汇,即多个词汇的出现概率取交集,故经常造成出现频率过小而无法得到满意的加权效果。另外,构成n-grams特征的词汇中可能存在一部分与分类无关,传统方法无法对n-grams特征做进一步处理。为了对n-grams特征更好地加权并做进一步处理,利用基尼系数和洛伦茨曲线对ngrams特征内的词汇进行约简和加权,最终得到对n-grams特征的加权结果。通过支持向量机中的实验结果表明,经过基尼系数约简和加权后的n-grams特征在分类结果上要优于TF(Term Frequency)等加权方法,验证了算法的有效性。 At present,the calculation method of n- grams feature weighting is based on the frequency or occurrence frequency. However,there are some problems in this kind of weighting method. Firstly,the n- grams features are constructed by multiple words. Because of the occurrence frequency of a number of words,it often could not yield a satisfied result caused by a small number of frequency. Secondly,there may be a part of vocabulary of n- grams characteristics,which is not related to the classification,and the traditional method could not do the further processing of n- grams features. In order to make the n- grams feature better and do further processing,the Gini coefficient and the Lorenz curve were used to reduce the vocabulary within the n- grams feature,and finally the weighted results of n- grams features were obtained. In the support vector machine,the experimental results showed that the proposed method was better than the Term( Frequency TF) and the feature in the classification results of the n- grams feature.
出处 《淮阴工学院学报》 CAS 2016年第1期25-28,共4页 Journal of Huaiyin Institute of Technology
关键词 n-grams特征 基尼指数 洛伦茨曲线 支持向量机 n-grams feature Gini coefficient Lorenz curve SVM
  • 相关文献

参考文献8

二级参考文献55

共引文献62

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部