摘要
分析了传统的互信息特征选择算法的不足,针对可能赋予低频特征词过高权重的问题,利用词频、集中度这两个强信息特征指标对算法进行改进,提出了一种基于词频和文本类别的互信息改进算法(Improved Mutual Information Algorithm based on Word Frequency and Text Category,简称改进的MIFC)。实验结果表明,改进的MIFC算法提取的特征空间比传统的互信息算法有更高的精确度。
This paper analyzes the shortages of Mutual Information (MI) algorithm. Aiming at the problem that low frequency features may have higher weights, we take advantage of two indexes of strong informational features- word frequency and concentration ratio and propose an improved MI algorithm based on word frequency and text category (MIFC). The result of the experiment shows that MIFC algorithm has greater accuracy than traditional MI algorithm.
出处
《井冈山大学学报(自然科学版)》
2013年第3期41-44,共4页
Journal of Jinggangshan University (Natural Science)
基金
上海市科委国际合作基金项目(10510712500)
关键词
互信息
特征选择
词频
文本类别
MⅢc
mutual information
feature selection
word frequency
text category
MIFC