摘要
采用KNN算法实现了一种中文专利文献自动分类系统。针对专利文献数据规模过大,分类效率低下的问题,采用修剪样本技术删除冗余样本,提高了分类器的效率。为解决修剪样本导致干扰文献积累对KNN分类性能下降的影响,系统使用信息增益对专利文献进行特征词选择,削弱了干扰文献对KNN分类的作用。实验证明,采用修剪样本技术和基于信息增益的特征词选择能有效缩小训练集规模,提高KNN分类准确率。
A Chinese patent texts automatic classification system based on KNN is implemented. Focus on the inef- ficient categorization, caused a huge number of patent texts, present the techniques of pruning redundant exemplars in order to improve the efficiency of classifier. In order to solve the performance degradation of KNN classification caused pruning exemplars lead to the accumulation of noisy exemplars, information gain is used to select the feature of patent texts and weaken the impact of the accumulation of noisy exemplars. The experiment result show that using the techniques of pruning exemplars can effectively reduce the size of the training set, and based on information gain of feature selection can improve KNN classification accuracy.
出处
《嘉应学院学报》
2010年第2期24-29,共6页
Journal of Jiaying University
基金
广东省知识产权局软科学研究计划项目(GDIP2008-C16)
梅州市科学研究项目(08KJ08)