摘要
本文以比较购物搜索中的商品数据自动分类为应用背景,探讨短文本数据的分类问题,比较了常用的文本分类(Text Categorization)算法的特点,在此基础上提出k-NN与NB相结合的多分类器方案,对于NB算法分类不可信的情况下改用k-NN算法进行再次分类,并充分利用NB的中间结果供k-NN剪枝时作参考。实验数据表明该方法在与NB相近的时间复杂度下可明显地提高短文本分类的正确率和召回率,达到实际应用的要求。
On the basis of the application of automatism in comparison shopping,this paper probes into the issue of text catego- rization.It has compared two popular algorithms for text categorization:Naive Bayes(NB)and k-Nearest Neighbor(k-NN). On this basis,it proposes another suggestion combiningthese two algorithms.In the situation that NB is unauthentic,K-NN arithmetic is suggested to be used to recategorize the results.And the k-NN algorithm can also make the best use of the results from NB algorithm during the process of recategorization.The statistics from the experiments show that under similar time com- plexity,the new algorithm can markedly improve the precision of the text categorization and the recall rate.It can reach the ex- pected demand.
出处
《微型电脑应用》
2007年第2期19-21,4-5,共3页
Microcomputer Applications
关键词
文本分类
短文本
朴素贝页斯K
近邻
Text categorization
Short text
Naive Bayes(NB)
k-Nearest Neighbor(k-NN)