云环境下软件错误报告自动分类算法改进

Improved automatic classification algorithm of software bug report in cloud environment

下载PDF

导出

摘要用户提交的软件错误报告随意性大、主观性强且内容少导致自动分类正确率不高,需要花费大量人工干预时间。随着互联网的快速发展用户提交的错误报告数量也不断增加,如何在海量数据下提高其自动分类的精确度越来越受到关注。通过改进词频-逆文档频率(TF-IDF),考虑到词条在类间和类内出现情况对文本分类的影响,提出一种基于软件错误报告数据集的改进多项式朴素贝叶斯算法,同时在Hadoop平台下使用MapReduce计算模型实现该算法的分布式版本。实验结果表明,改进的多项式朴素贝叶斯算法将F1值提高到71%,比原算法提高了27个百分点,同时在海量数据下可以通过拓展节点的方式缩短运行时间,有较好的执行效率。 User-submitted bug reports are arbitrary and subjective. The accuracy of automatic classification of bug reports is not ideal. Hence it requires many human labors to intervention. With the bug reports database growing bigger and bigger,the problem of improving the accuracy of automatic classification of these reports is becoming urgent. A TF-IDF（ Term Frequency-Inverse Document Freqency） based Naive Bayes（ NB） algorithm was proposed. It not only considered the relationship of a term in different classes but also the relationship of a term inside a class. It was also implemented in distributed parallel environment of MapReduce model in Hadoop platform. The experimental results show that the proposed Naive Bayes algorithm improves the performance of F1 measument to 71%,which is 27 percentage points higher than the stateof-the-art method. And it is able to deal with massive amounts of data in distributed way by addding computational node to offer shorter running time and has better effective performance.

作者黄伟林劼江育娥

机构地区福建师范大学软件学院

出处《计算机应用》 CSCD 北大核心 2016年第5期1212-1215,1221,共5页 journal of Computer Applications

基金国家自然科学基金资助项目(61472082) 福建省自然科学基金资助项目(2014J01220)~~

关键词多项式朴素贝叶斯错误报告文本自动分类词频-逆文档频率云计算 Naive Bayes of polynomials bug report text automatic classification Term Frequency-Inverse Document Frequency（TF-IDF） cloud computing

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献18

1ZHANG Jie,WANG XiaoYin,HAO Dan,XIE Bing,ZHANG Lu,MEI Hong.A survey on bug-report analysis[J].Science China(Information Sciences),2015,58(2):88-111. 被引量：8
2STRATE J D, LAPLANTE P A. A literature review of research in software defect reporting[J]. IEEE Transactions on Reliability, 2013, 62(2):444-454.
3SHOKRIPOUR R, ANVIK J, KASIRUN Z M, et al. A time-based approach to automatic bug report assignment[J]. Journal of Systems & Software, 2015, 102:109-122.
4SHOKRIPOUR R, ANVIK J, KASIRUN Z M, et al. Improving automatic bug assignment using time-metadata in term-weighting[J]. IET Software, 2014, 8(6):269-278.
5ALENEZI M, MAGEL K, BANITAAN S. Efficient bug triaging using text mining[J]. Journal of Software, 2013, 8(9):2185-2190.
6SHOKRIPOUR R, ANVIK J, KASIRUN Z M, et al. Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation[C]//Proceedings of the 10th International Workshop on Mining Software Repositories. Piscataway, NJ:IEEE, 2013:2-11.
7黄小亮,郁抒思,关佶红.基于LDA主题模型的软件缺陷分派方法[J].计算机工程,2011,37(21):46-48. 被引量：11
8JEONG G, KIM S, ZIMMERMANN T. Improving bug triage with bug tossing graphs[C]//Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. New York:ACM, 2009:111-120.
9MATTER D, KUHN A, NIERSTRASZ O. Assigning bug reports using a vocabulary-based expertise model of developers[C]//Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories. Piscataway, NJ:IEEE, 2009:131-140.
10SHOKRIPOUR R, KASIRUN Z M, ZAMANI S, et al. Automatic bug assignment using information extraction methods[C]//Proceedings of the 2012 International Conference on Computer Science Applications and Technologies. Piscataway, NJ:IEEE, 2012:144-149.

二级参考文献79

1黄建明.贝叶斯网络在学生成绩预测中的应用[J].计算机科学,2012,39(S3):280-282. 被引量：30
2左晓娜,刘冀伟,王志良.基于TAN贝叶斯网络分类器的测井岩性预测[J].微计算机信息,2006(09S):284-286. 被引量：4
3张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量：120
4Hadoop WT. The definitive guide.O'Reilly Media,Inc, 2009.
5Taiwan Hadoop Forum.http://forum.hadoop.tw/2009.
6Apache Hadoop.(2009-09-12).http://hadoop.apache.org/.
7McCallum A, Nigam K. A Comparison of Event Models for Naive Bayes Text Classification. AAAF ICML-98 Workshop on Learning for Text Categorization 1998:41-48.
8Dean J, Ghemawat S. MapReduce: Simplifed Data Processing on Large Clusters. Proc.of the 6th Symposium on Operating System Design and Implementation, San Francisco, 2004.
9Cutting D. Scalable Computing with MapReduce. Proc.of O'Reilly Open Source Convention, Poland. 2005.
10Salton G, Clement TY. On the construction of effective vocabularies for information retrieval. Proc. of the 1973 Meeting on Programming Languages and Information Retrieval, New York ACM, 1973:11.

共引文献59

1李永红,汪盈,李腊全,赵志强.一种改进的特征选择算法在邮件过滤中的应用[J].计算机科学,2022,49(S02):740-744. 被引量：4
2邵晓根,鞠训光,胡局新,马忠伟.基于改进权重的贝叶斯推理和TFIDF算法文本主题词提取研究[J].南京师大学报（自然科学版）,2014,37(1):57-60. 被引量：5
3李湘东,廖香鹏,黄莉.LDA模型下书目信息分类系统的研究与实现[J].现代图书情报技术,2014(5):18-25. 被引量：12
4孙华林.自适应均衡调度算法在激光云数据传输中的应用[J].激光杂志,2018,39(12):175-178.
5邵晓.基于Hadoop的网络大数据挖掘应用与实践[J].计算机光盘软件与应用,2014,17(18):107-108.
6李湘东,曹环,黄莉.基于分布偏斜训练集的特征选择方法研究[J].情报理论与实践,2015,38(4):139-144. 被引量：2
7李湘东,巴志超,黄莉.一种基于加权LDA模型和多粒度的文本特征选择方法[J].现代图书情报技术,2015(5):42-49. 被引量：18
8黄伟,林劼,江育娥,江秉华.改进的软件错误报告自动分类算法[J].计算机工程,2015,41(6):183-187.
9林英姿,曾宇平,徐飞龙,傅昊阳.基于Hadoop的分布式朴素贝叶斯智能诊断系统[J].医学信息学杂志,2015,36(7):53-57. 被引量：3
10蒋婉婷,孙蕾,钱江.基于Hadoop的朴素贝叶斯算法在中文微博情感分类中的研究与应用[J].计算机应用与软件,2015,32(7):60-62. 被引量：4

1郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究[J].计算机应用研究,2008,25(11):3256-3258. 被引量：101
2数字[J].中国信息安全,2014(10):12-13.
32004年中国网络安全报告显示：安全能力亟待提高[J].广播电视信息,2005,12(4):19-20.
4李朔,石宇良.基于位置社交网络中地点聚类推荐方法[J].山东大学学报（工学版）,2016,46(3):44-50. 被引量：1
5国家计算机网络应急技术处理协调中心公布“2004年网络安全工作报告”[J].网络安全技术与应用,2005(4):7-7. 被引量：1
6钟建友,常姗,刘晓强,宋晖.移动轨迹数据去匿名化攻击方法[J].计算机工程,2016,42(12):133-138. 被引量：3
7李磊.P4P浅谈[J].新课程研究（职业教育）,2008(12):83-83. 被引量：2
8杨福强,王洪国,董树霞,丁艳辉,尹传城.基于微博扩展的用户兴趣主题挖掘算法[J].计算机工程与设计,2015,36(5):1214-1218. 被引量：4
9杨中秋,季莉.移动社交网络的用户兴趣建模研究[J].江苏工程职业技术学院学报,2016,16(2):4-7.
10樊梦佳,段东圣,杜翠兰,张仰森,佟玲玲.统计与规则相融合的领域术语抽取算法[J].计算机应用研究,2016,33(8):2282-2285. 被引量：12

计算机应用

2016年第5期

浏览历史

内容加载中请稍等...

云环境下软件错误报告自动分类算法改进

参考文献18

二级参考文献79

共引文献59

相关作者

相关机构

相关主题

浏览历史