短文本数据的自动分类

Short Text Categorization

下载PDF

导出

摘要本文以比较购物搜索中的商品数据自动分类为应用背景,探讨短文本数据的分类问题,比较了常用的文本分类(Text Categorization)算法的特点,在此基础上提出k-NN与NB相结合的多分类器方案,对于NB算法分类不可信的情况下改用k-NN算法进行再次分类,并充分利用NB的中间结果供k-NN剪枝时作参考。实验数据表明该方法在与NB相近的时间复杂度下可明显地提高短文本分类的正确率和召回率,达到实际应用的要求。 On the basis of the application of automatism in comparison shopping,this paper probes into the issue of text catego- rization.It has compared two popular algorithms for text categorization:Naive Bayes(NB)and k-Nearest Neighbor(k-NN). On this basis,it proposes another suggestion combiningthese two algorithms.In the situation that NB is unauthentic,K-NN arithmetic is suggested to be used to recategorize the results.And the k-NN algorithm can also make the best use of the results from NB algorithm during the process of recategorization.The statistics from the experiments show that under similar time com- plexity,the new algorithm can markedly improve the precision of the text categorization and the recall rate.It can reach the ex- pected demand.

作者宋东风张志浩

机构地区同济大学计算机系

出处《微型电脑应用》 2007年第2期19-21,4-5,共3页 Microcomputer Applications

关键词文本分类短文本朴素贝页斯K 近邻 Text categorization Short text Naive Bayes(NB) k-Nearest Neighbor(k-NN)

分类号 F724.6 [经济管理—产业经济]

引文网络
相关文献

参考文献7

1Kjersti Aas,Line Eikvil.Text Categorisation:A Survey[C],Technical Report,Norwegian Computing Center,1999.
2Yang Y,Liu X.A Re-examination of Text Categorization Methods[C],In:Proc.of the 22nd Annual Int'l ACM SIGIR Conf.on Research and Development in Information Retrieval,New York:ACM Press,1999.
3Ciya Liao,Shamim Alpha,Paul Dixon.Feature Preparation in Text Categoryization[A].
4Evgeniy Gabrilovich,Shaul Markovitch.Feature Generation for Text Categorization Using World Knowledge[J].IJCAI 2005:1048-1053.
5Y.Yang,J.O.Pederson,A comparative study on feature selection in text categorization[C].Proc.of the 14th International Conference on Machine Learning,ICML97,1997.
6王强,王晓龙,关毅,徐志明.K-NN与SVM相融合的文本分类技术研究[J].高技术通讯,2005,15(5):19-24. 被引量：10
7刘斌,黄铁军,程军,高文.一种新的基于统计的自动文本分类方法[J].中文信息学报,2002,16(6):18-24. 被引量：48

二级参考文献12

1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量：24
2卜东波.聚类/分类理论研究及其在文本挖掘中的应用.中科院计算所博士学位论文[M].-,2000..
3Yang Y M, Liu X. A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, USA. August, 1999. 42-49
4John C P. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, MIT Press,1999. 61-73
5Lin H T,Lin C J, Weng R C. A note on Platt's probabilistic outputs for support vector machines:[Technical report]. Department of Computer Science and Information Engineering, National Taiwan University, 2003
6Tom A, Yang Y M. kNN at TREC-9. In: Voorhees EM and Harman DK, Eds., Proceedings of the Ninth Text Retrieval Conference (TREC-9). Department of Commerce, National Institute of Standards and Technology, 1999. 127-134
7Giacinto G, Roli F, Fumera G. Selection of classifiers based on multiple classifier behaviour, workshops on syntactical and structural pattern recognition and statistical pattern recognition.Lecture Notes in Computer Science 1876. Berlin: Springer-verlag, 2000.87-93
8Giacinto G, Roli F. Adaptive selection of image classifiers. In: 9th International Conference on Image Analysis and Processing ( ICIAP '97) ,Florence, Italy. Lecture Notes in Computer Science 1310. Berlin: Springer-Verlag, 1997.38-45
9Paul N B, Susan T D, Eric H. Probabilistic combination of text classifiers using reliability indicators: models and results. In: SIGIR'02, 2002.207-214
10黄萱菁,吴立德.基于向量空间模型的文档分类系统[J].模式识别与人工智能,1998,11(2):147-153. 被引量：24

共引文献55

1雷小锋,夏征义,谢昆青.SROC:一种面向结构鲁棒性的迭代聚类方法[J].计算机研究与发展,2007,44(z3):263-267.
2卢娇丽,郑家恒.基于粗糙集的文本分类方法研究[J].中文信息学报,2005,19(2):66-70. 被引量：16
3罗永莲,张永奎.基于混合特征的中文文本分类[J].电脑开发与应用,2005,18(4):4-5. 被引量：1
4胡佳妮,徐蔚然,郭军,邓伟洪.中文文本分类中的特征选择算法研究[J].光通信研究,2005(3):44-46. 被引量：47
5白振田,侯汉清.基于向量空间的行业自动分类系统应用[J].情报科学,2005,23(6):940-944. 被引量：4
6王强,王晓龙,关毅,徐志明.K-NN与SVM相融合的文本分类技术研究[J].高技术通讯,2005,15(5):19-24. 被引量：10
7万中英,王明文,廖海波.基于投影寻踪的中文网页分类算法[J].中文信息学报,2005,19(4):60-67. 被引量：11
8王元珍,钱铁云,冯小年.基于关联规则挖掘的中文文本自动分类[J].小型微型计算机系统,2005,26(8):1380-1383. 被引量：13
9王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量：129
10庄世芳,林世平,陈旭晖,苏芳仲.基于概念集和粗集的中文Web文本挖掘特征提取的研究[J].福建电脑,2006,22(2):31-32. 被引量：1

1刘国香,张钧锋.垃圾短信分类方式的探讨[J].沧州师范学院学报,2011,27(4):122-124. 被引量：3
2苏然.成为备选老公[J].销售与市场,2013(15):44-47.
3曾荷.电子商务领域个性化信息服务商业模式分析[J].情报杂志,2005,24(8):107-109. 被引量：13
4莫岱青.淘宝天猫怎样做大移动电商蛋糕?[J].经理人,2014(10):32-37. 被引量：1
5乔红.关于比较购物的分析研究[J].价格月刊,2008(11):65-66. 被引量：1
6生意[J].快乐青春（经典阅读）（小学生必读）,2016,0(8):96-96.
7易水寒.“3D”魔表-杀组选[J].彩票研究,2011(1):17-17.
8把“参与感”用到煎饼店[J].智富时代,2015,0(3):45-45.
9盖雄雄.全网购物搜索在路上[J].广告主,2011,0(12):76-77.
10安海忠.利用条码提高超市商品数据准度[J].条码与信息系统,2005(5):35-37.

微型电脑应用

2007年第2期

浏览历史

内容加载中请稍等...

短文本数据的自动分类

参考文献7

二级参考文献12

共引文献55

相关作者

相关机构

相关主题

浏览历史