摘要
n-grams作为文本分类特征时易造成分类准确率下降,并且在对n-grams加权时通常忽略单词间的冗余度和相关性.针对上述问题,文中提出基于相关性及语义的n-grams特征加权算法.在文本预处理时,对n-grams进行特征约简,降低内部冗余,再根据n-grams内单词与类别的相关性及n-grams与测试集的语义近似度加权.搜狗中文新闻语料库和网易文本分类语料库上的实验表明,文中算法能筛选高类别相关且低冗余的n-grams特征,在量化测试集时减少稀疏数据的产生.
When n-grams are considered as text classification features, the classification accuracy is decreased. The redundancy and relevance between words are ignored while n-grams are weighted. Thus, n-grams features weighting algorithm based on relevance and semantic is proposed. To decrease the internal redundancy, feature reduction is conducted to n-grams during text preprocessing. Then, n-grams are weighted according to the relevance of words and classes in n-grams and the semantic similarity of n-grams and testing dataset. The experimental results on Sougo Chinese news corpse and NetEase text corpse show that the proposed algorithm can select n-grams features of high relevance and low redundancy, and reduce the sparse data while quantifying the testing dataset.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2015年第11期992-1001,共10页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.70971059)
辽宁省创新团队项目(No.2009T045)
辽宁省高等学校杰出青年学者成长计划项目(No.LJQ2012027)资助