摘要
KNN短文本分类算法通过扩充短文本内容提高短文本分类准确率,却导致短文本分类效率降低。鉴于此,通过卡方统计方法提取训练空间中各类别的类别特征,根据训练空间中各类别样本与该类别特征的相似情况,对已有的训练空间进行拆分细化,将训练空间中的每个类别细化为多个包含部分样本的训练子集;然后针对测试文本,从细化后的训练空间中提取与测试文本相似度较高的类别特征所对应的训练子集的样本来重构该测试文本的训练集合,减少KNN短文本分类算法比较文本对数,从而提高KNN短文本分类算法的效率。实验表明,与基于知网语义的KNN短文本分类算法相比,本算法提高KNN短文本分类算法效率近50%,分类的准确性也有一定的提升。
The KNN classification algorithm improves the accuracy of short text classification by enlarging the content of short text.However,it leads to the decrease of classification efficiency on short text.Given this problem,we extract the category feature words in the categories of the training set by the CHI.According to the similarities between the samples of every classification and their features in the training set,the existing training set is split and refined.In this way,every classification of the training set can be split into many training subsets containing part of the samples.Then,according to the test text,the corresponding samples of the training subsets which are more similar to the test text are extracted to reconstruct the training sets of the test text.By decreasing the number of comparative text pairs in the KNN short text classification algorithm,the efficiency of the KNN short text classification algorithm can be increased.Experimental results show that comparing with the KNN short text classification algorithm based on HowNet,the efficiency of short text classification of the proposed algorithm can be increased by about 50 percent and the classification accuracy is also improved to some extent.
出处
《计算机工程与科学》
CSCD
北大核心
2018年第1期148-154,共7页
Computer Engineering & Science
基金
国家自然科学基金(11547148)
教育部人文社会科学研究青年基金(15YJC790061)
重庆市教委科学技术研究项目(16SKGH133)
关键词
短文本分类
KNN分类
类别特征
HOWNET
效率
short text classification
KNN classification
category feature
hownet
efficiency