一种基于改进贝叶斯算法的Web文本分类方法被引量：1

A Modified Complement Naive Bayes for Chinese Web Page Classification

下载PDF

导出

摘要针对基于互补贝叶斯的分类算法在数据倾斜分布时由于过学习现象导致分类准确度不理想的状况,提出一种改进的互补贝叶斯分类算法。通过实验分析数据的倾斜分布对改进后的互补贝叶斯算法的影响,经验证该算法能够在数据倾斜分布时依然能保持较高的分类准确度,并且能随数据倾斜分布表现出较好的鲁棒性。讨论普通文本与Web文本的不同,建立一种带有主题权重的文档向量模型,并分析主题权重对文本算法的影响。实验发现,使用带有主题权重的文档向量模型,能够使得分类准确率相比普通的文本分类提高5%。 Focusing on the poor performance of complement naive Bayes algorithm on skewed data set,presents a modified complement naive Bayes algorithm by using a superior estimation for the prior class probability.Comprehensive experiments show that the modified complement naive Bayes algorithm exhibits excellent robustness to skewed data and achieves higher precision than any other naive Bayes algorithm.Furthermore,regards the difference between Web page classification and text classification,and presents a title weighted vector space model and analyses the effect of title weighted factor on classifier＇s precision.Experimental result shows that the precision is improved by 5% on average by using title weighted vector space model.

作者徐小伟成亚谊

机构地区四川大学计算机学院

出处《现代计算机（中旬刊）》 2012年第4期3-7,共5页 Modern Computer

基金国家863高科技项目(No.2008AA01Z119)

关键词朴素贝叶斯互补贝叶斯 WEB文本分类倾斜数据分布 Naive Bayes Complement Naive Bayes Web Classification Skewed Distribution

分类号 TP393.08 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献15

1Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Volume 34, Issue 1,1-47p,March 2002.
2Ji He, Ah-hwee Tan, Chew-lim Tan. On Machine Learning Methods for Chinese Document Categorization. Applied Intel- ligence 18, 311-322p, 2003.
3Yong WANG, Julia Hodges, Bo Tang. Classification of Web Documents Using a Naive Bayes Method. Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, 560p, November 03 - 05, 2003.
4David D. Lewis. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. ECML-98: European Conference on Machine Learning Nol0, Chemnitz, ALLE- MAGNE (21/04/1998), vol. 1398, 4-15p, 1998.
5JD Rennie, L Shih, J Teevan, D Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In Proceedings of the Twentieth International Conference on Machine Learn- ing, 2003.
6S.Dumais, J.Platt, D.Heckerman, M.Sahami. Inductive Learn- ing Algorithms and Representations for Text Categorization. In Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, 1998:148-155.
7Fu Chunpeng, Dale Schuurmans, Shao-jun Wang. Augment- ing Naive Bayse Classifiers with Statistical Language Models Information Retrival, 7, 317-345p, 2004.
8Dou Shen, Yan Cong, Jian-tao Sun, Yu-chang Lu. Studies On Chinese Web Page Classification. Proceedings of the Sec- ond International Conference on Machine Learning and Cy- bernetics, Xi'an, 2-5 November 2003.
9D.D. Lewis. Representation and Learning in Information Re- trieval. PHD thesis, Graduate School of the University of Maassachusetts, 1992.
10Wei-tong HUANG, Lu-xiong XU, Jun-feng DUAN. Chinese Web Page Classification Study. 2007 IEEE International Conference on Control and Automation Guangzhou, China - May 30 to June 1, 2007.

同被引文献8

1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：386
2卢苇,彭雅.几种常用文本分类算法性能比较与分析[J].湖南大学学报（自然科学版）,2007,34(6):67-69. 被引量：31
3Rennie J D,Shih L,Teevan J,et al.Tackling the Poor Assumptions ofNaive Bayes Text Classifiers[C]//Proceedings of the Twentieth Inter-national Conference on Machine Learning,2003.
4李丹.基于朴素贝叶斯方法的中文文本分类研究[D].保定:河北大学,2011.
5Andrew McCallum,Kamal Nigam.A Comparison of Event Models forNaive Bayes Text Classification[C]//AAAI98 Workshop on Learningfor Text Categorization,1998.
6秦锋,任诗流,程泽凯,罗慧.基于属性加权的朴素贝叶斯分类算法[J].计算机工程与应用,2008,44(6):107-109. 被引量：48
7邱云飞,王威,刘大有,邵良杉.基于方差的CHI特征选择方法[J].计算机应用研究,2012,29(4):1304-1306. 被引量：30
8饶丽丽,刘雄辉,张东站.基于特征相关的改进加权朴素贝叶斯分类算法[J].厦门大学学报（自然科学版）,2012,51(4):682-685. 被引量：30

引证文献1

1杜选.基于加权补集的朴素贝叶斯文本分类算法研究[J].计算机应用与软件,2014,31(9):253-255. 被引量：13

二级引证文献13

1裴向杰,唐红昇,陈鹏.一种改进的贝叶斯算法在短信过滤中的研究[J].计算机技术与发展,2015,25(9):89-93. 被引量：2
2朱清伟,施彦,王小艺,许继平,李世明,黄振芳,刘波.基于MM分词算法与SOM神经网络的分类方法研究[J].计算机与应用化学,2015,32(10):1251-1254. 被引量：1
3罗新.基于随机森林的文本分类模型研究[J].农业图书情报学刊,2016,28(11):50-54. 被引量：3
4高知新,徐林会.基于隐马尔科夫模型与语义融合的文本分类[J].计算机应用与软件,2017,34(7):303-307. 被引量：4
5陶峰,汤鲲,程光.基于改进TFIDF算法的邮件分类技术[J].计算机技术与发展,2018,28(8):27-31. 被引量：3
6董祥和.基于情感特征向量空间模型的中文商品评论倾向分类算法[J].计算机应用与软件,2016,33(8):319-322. 被引量：2
7高超,许翰林.基于支持向量机的不均衡文本分类方法[J].现代电子技术,2018,41(15):183-186. 被引量：8
8陈凯,黄英来,高文韬,赵鹏.一种基于属性加权补集的朴素贝叶斯文本分类算法[J].哈尔滨理工大学学报,2018,23(4):69-74. 被引量：14
9刘佩鑫,于洪志,徐涛.基于朴素贝叶斯的档案分类研究[J].河北大学学报（自然科学版）,2018,38(5):549-554. 被引量：6
10邓远飞.基于知识的短文本相似度研究[J].电脑与电信,2018(10):19-21.

1文桥,王卫平.基于改进贝叶斯算法的入侵检测方法[J].计算机工程,2006,32(12):160-162. 被引量：5
2池万乐,张笑笑.改进贝叶斯算法的垃圾邮件过滤技术研究[J].现代计算机,2007,13(4):27-29. 被引量：1
3周强,李玉龙,罗旭,黄宁.基于贝叶斯算法的垃圾邮件过滤系统的改进[J].科技信息,2011(3):5-6. 被引量：1
4唐淑珍.基于贝叶斯的入侵检测[J].软件导刊,2010,9(4):149-151. 被引量：2
5欧红星,杨路明.垃圾邮件过滤技术研究[J].湖南科技学院学报,2008,29(12):96-97. 被引量：1
6杨一锋.改进贝叶斯算法的商业银行信用风险评估模型[J].重庆工商大学学报（自然科学版）,2010,27(3):249-251. 被引量：1
7刘建峰,吕佳.融合主动学习的改进贝叶斯半监督分类算法研究[J].计算机测量与控制,2014,22(6):1938-1940. 被引量：4
8蒋宗礼,鲁国相.MatchLink：一种主题爬行方法[J].北京工业大学学报,2007,33(11):1227-1232.
9赖英旭,杨震.改进贝叶斯算法在未知恶意软件识别中的研究[J].北京工业大学学报,2011,37(5):766-772. 被引量：3
10陈福志,史杏荣.基于改进贝叶斯算法的信息安全模型[J].计算机工程,2003,29(20):116-118. 被引量：3

现代计算机（中旬刊）

2012年第4期

浏览历史

内容加载中请稍等...

一种基于改进贝叶斯算法的Web文本分类方法被引量：1

参考文献15

同被引文献8

引证文献1

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

一种基于改进贝叶斯算法的Web文本分类方法 被引量：1

参考文献15

同被引文献8

引证文献1

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

一种基于改进贝叶斯算法的Web文本分类方法被引量：1