摘要
针对目前基于规则和基于统计的文本分类方法存在的不足,提出了一种新颖的基于规则和K-近邻分类相融合的文本分类方法。首先,对描述文本特征的传统向量空间模型进行了扩充,给出了具体的扩展模型。然后,基于扩展模型提出了一种规则的表示方法,并为每一条规则赋予了一个强弱系数,根据这个系数可以对识别的文本按级别排序。最后,通过设定一个阀值,将级别低于阀值的文本过滤掉。该方法可有效地排除被K-近邻分类误识别的那些文本,从而在一定程度上提高了分类的正确率。通过小数据集测试实验结果表明,该方法是有效的、可行的。
There were two methods of text classification, one was based on rules, another was on statistic.The two methods had merit and defect.Aiming at their respective shortcomings, an effective method of text classification was proposed that it included assembled KNN and rule method.The conventional VSM description was expanded in the text, and a detailed description of the extended VSM was shown.Based on it, an expression method of rules is presented.By assigning a coefficient it indicates the accuracy and sorting the results, the documents were filtered,the coefficients are less than that of the given threshold.Hence, the inaccuracy documents identified by KNN method were excluded, and the precision and the recall were improved in a certain extent.Experimental results show that the method is effective and feasible.
出处
《长江大学学报(自科版)(上旬)》
CAS
2008年第2期92-95,共4页
JOURNAL OF YANGTZE UNIVERSITY (NATURAL SCIENCE EDITION) SCI & ENG
基金
黑龙江省自然科学基金资助项目(11521013)