结合自助抽样的动态数据流贝叶斯分类算法被引量：3

Bayesian classification algorithm of dynamic data stream based on bootstrap

下载PDF

导出

摘要动态数据流具有数据量大、变化快、随机存取代价高、详细数据难以存储等特点,挖掘动态数据流对计算能力与存储能力要求非常高。针对动态数据流的以上特点,设计了一种基于自助抽样的动态数据流贝叶斯分类算法,算法运用滑动窗口模型对动态数据流进行处理分析。该模型以每个窗口的数据为基本单位,对窗口内的数据进行处理分析;算法采用自助抽样技术对待分类数据中的属性进行裁剪和优化,解决了数据属性间的多重线性相关问题;算法结合贝叶斯算法的特点,采用动态增量存储树来解决动态样本数据流的存储问题,实现了无限动态数据流无信息失真的静态有限存储,解决了动态数据流挖掘最大的难题——数据存储;对优化的待分类数据使用all-贝叶斯分类器和k-贝叶斯分类器进行分类,结合数据流的特性对两个分类器进行实时更新。该算法有效克服了贝叶斯分类属性独立性的约束和传统贝叶斯只对静态数据分类的缺点,克服了动态数据流最大的难题——数据存储问题。通过实验测试证明,基于自助抽样的贝叶斯分类具有很高的时效性和精确性。 Dynamic data streams have features of large data,instant change,costly random access and difficult storage of detailed data,so mining of such dynamic data streams puts forwards high requirements on the computing power and storage capacity.According to the above features,a Bayesian classification algorithm of dynamic data stream based on bootstrap is proposed to process and analyze dynamic data streams with the sliding window model.This model,taking data of each window as the basic unit,processes and analyzes the data of windows.The algorithm adopts the bootstrap method to cut and optimize the attributes of data to be classified,solving the problem in multi-linear inter-relation between data attributes.The algorithm,combining characteristics of Bayesian algorithm,adopts the dynamic incremental storage tree to store the dynamic sample data stream to realize the static finite storage of infinite dynamic data streams without distortion of information and ultimately solve the biggest problem in dynamic data stream mining——data storage.The all-Bayesian classifier and k-Bayesian classifier are adopted to classify the optimized data,and their updates are made according to the features of data streams.This algorithm overcomes the attribute independence of the Bayesian classifier and its limitation only to the static data.It overcomes the biggest problem of dynamic data stream——the data storage.Experimental tests prove that the Bayesian classification algorithm based on bootstrap has high timeliness and accuracy.

作者琚春华殷贤君许翀寰

机构地区浙江工商大学计算机与信息工程学院

出处《计算机工程与应用》 CSCD 北大核心 2011年第8期118-121,142,共5页 Computer Engineering and Applications

基金国家自然科学基金(No.70671094) 浙江科技计划项目(No.2008C14061) 浙江省自然科学基金重点项目(No.Z1091224) 浙江省自然科学基金项目(No.Y1090617)~~

关键词数据流自助抽样贝叶斯分类滑动窗口增量存储树 data stream bootstrap Bayesian classification sliding window incremental storage tree

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Widmer G, Kubat M.Leaming in the presence of concept drift and hidden contexts [J].Machine Learning, 1996,23 ( 1 ) : 69-101.
2Hulten G, Spencer L, Domingos P.Mining time-changing data streams[C]//Proc of the Int'l Conf on Knowledge Discovery and Data Mining.New York:ACM Press,2001:97-106.
3Wang Hai-xun,Han Jia-wei.Mining concept-drifting data streams using ensemble classifiers[C]//Proc of the Int'l Conf on Knowl- edge Discovery and Data Mining.New York:ACM Press,2003.
4I Xie Q H.An efficient approach for mining concept-drifting data streams[DJ.Tainan,China:National University of Tainan,2004.
5Wcbb G I,Boughton J R, Wang Z.Aggrcgating one-dependence estimators[J].Machine Learning, 2005,58 ( 1 ) : 5-24.
6Giannella C,Han J, Pei J,et al.Mining frequent patterns in data streams at multiple time granularities[J].Next Generation Data Mining,2003:191-212.
7Mozina M, Demsar J, Kattan M,et al.Nomograms for visualiza-tion of naive Bayesian classifier[C]//Proc of PKDD-2004,2004: 337-348.
8石洪波,黄厚宽,王志海.基于Boosting的TAN组合分类器[J].计算机研究与发展,2004,41(2):340-345. 被引量：14
9刘君强,孙晓莹,庄越挺,潘云鹤.挖掘闭合模式的高性能算法[J].软件学报,2004,15(1):94-102. 被引量：19
10Eft'on B.Bootstrap methods: Another look at the jackknife[J]. Ann Statist, 1979,7(1) : 1-26.

二级参考文献22

1[1]Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering frequent closed itemsets for association rules. In: Beeri C, et al, eds. Proc. of the 7th Int'l. Conf. on Database Theory. Jerusalem: Springer-Verlag, 1999. 398～416.
2[2]Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Beeri C, et al, eds. Proc. of the 20th Int'l. Conf. on Very Large Databases. Santiago: Morgan Kaufmann Publishers, 1994. 487～499.
3[3]Pei J, Han J, Mao R. CLOSET: An efficient algorithm for mining frequent closed itemsets. In: Gunopulos D, et al, eds. Proc. of the 2000 ACM SIGMOD Int'l. Workshop on Data Mining and Knowledge Discovery. Dallas: ACM Press, 2000. 21～30.
4[4]Burdick D, Calimlim M, Gehrke J. MAFIA: A maximal frequent itemset algorithm for transactional databases. In: Georgakopoulos D, et al, eds. Proc. of the 17th Int'l. Conf. on Data Engineering. Heidelberg: IEEE Press, 2001. 443～452.
5[5]Zaki MJ, Hsiao CJ. CHARM: An efficient algorithm for closed itemset mining. In: Grossman R, et al, eds. Proc. of the 2nd SIAM Int'l. Conf. on Data Mining. Arlington: SIAM, 2002. 12～28.
6[6]Liu JQ, Pan YH, Wang K, Han J. Mining frequent item sets by opportunistic projection. In: Hand D, et al, eds. Proc. of the 8th ACM SIGKDD Int'l. Conf. on Knowledge Discovery and Data Mining. Alberta: ACM Press, 2002. 229～238.
7[7]Srikant R. Quest synthetic data generation code. San Jose: IBM Almaden Research Center, 1994. http://www.almaden.ibm.com/ software/quest/Resources/index.shtml
8[8]Blake C, Merz C. UCI Repository of machine learning. Irvine: University of California, Department of Information and Computer Science, 1998. http://www.ics.uci.edu/～mlearn/MLRepository.html
9R E Schapire. The strength of weak learnability. Machine Learning, 1990, 5(2): 197～227
10R E Schapire, Y Freund, P Bartlett et al. Boosting the margin: A new explanation for the effectiveness of voting methods. In: Douglas H Fisher eds. Proc of the 14th Int'l Conf on Machine Learning. San Francisco: Morgan Kaufmann, 1997. 322～330

共引文献31

1张莹,韩芳溪,柴乔林.基于频繁模式树的AOI聚类算法[J].计算机工程与应用,2004,40(35):178-179.
2刘学军,徐宏炳,董逸生,钱江波,王永利.基于滑动窗口的数据流闭合频繁模式的挖掘[J].计算机研究与发展,2006,43(10):1738-1743. 被引量：26
3杨萍,李立乡,杨明.快速更新频繁闭合项目集算法[J].计算机工程与应用,2006,42(36):148-151. 被引量：1
4刘旭,毛国君,孙岳,刘椿年.数据流中频繁闭项集的近似挖掘算法[J].电子学报,2007,35(5):900-905. 被引量：14
5眭俊明,姜远,周志华.基于频繁项集挖掘的贝叶斯分类算法[J].计算机研究与发展,2007,44(8):1293-1300. 被引量：12
6程转流,胡为成,胡学钢.基于DSFCI-tree的分布式数据流频繁闭合模式挖掘[J].微电子学与计算机,2007,24(9):120-122. 被引量：2
7宋威,杨炳儒,徐章艳,张桃红.基于索引数组和复合频繁模式树的频繁闭项集挖掘算法[J].计算机科学,2007,34(8):165-167. 被引量：1
8缪裕青,陈国良,徐云.基因表达数据的频繁闭合模式挖掘新算法[J].中国科学技术大学学报,2007,37(9):1080-1087. 被引量：1
9李广群,王志海,田凤占.一种基于AdaBoost方法的树形HNB组合分类器[J].广西师范大学学报（自然科学版）,2007,25(4):164-167. 被引量：1
10郭宇红,童云海,唐世渭,杨冬青.基于FP-Tree的反向频繁项集挖掘[J].软件学报,2008,19(2):338-350. 被引量：20

同被引文献20

1中国植物志编辑委员会.中国植物志[M].北京:科学出版社,1959.
2MitchellTM著曾华军张银奎译.机器学习[M].北京:机械工业出版社,2003..
3Cui H. The XML Schema for MARTT[OL].[2012-08-08]. http://publish.uwo.ca/-hcui7/research/xmlschema.xsd.
4Michie D,Spiegelhalter D J,Taylor C C.Machine Learning, Neural and Statistical Classification[M]. New York: Ellis Horwood, 1994.
5Sacchi L, Tucker A, Counsell S, et al. Improving Predictive Models of Glaucoma Severity by Incorporationg Quality Indicators[J]. Artificial Intelligence in Medicine, 2014, 60(2): 103-112.
6Cui H. MARTT:A General Approach to Automatic Markup of Taxonomic Descriptions with XML[OL]. [2011-10-12]. http://cais-acsi.ca/proceedings/2005/cui_2005.pdf.
7罗军,高琦,王翊.基于Bootstrapping的本体标注方法[J].计算机工程,2010,36(23):85-87. 被引量：3
8段宇锋,黑珍珍,鞠菲,崔红.基于自主学习规则的中文物种描述文本的语义标注研究[J].现代图书情报技术,2012(5):41-47. 被引量：4
9段宇锋,黑珍珍,鞠菲,崔红.基于贝叶斯分类的中文物种描述文本的语义标注研究[J].情报学报,2012,31(8):805-812. 被引量：3
10于海洋,卢小平,程钢,张育民,余鹏磊.基于LiDAR数据的流域水系网络提取方法研究[J].地理与地理信息科学,2013,29(1):17-21. 被引量：10

引证文献3

1段宇锋,朱雯晶,陈巧,崔红.朴素贝叶斯算法与Bootstrapping方法相结合的中文物种描述文本语义标注研究[J].现代图书情报技术,2014(5):83-89. 被引量：9
2王翠红.探析数据挖掘中抽样技术的应用[J].商,2015,0(30):218-218.
3于海洋,杨礼,张春芳,牛峰明,吴建鹏.基于LiDAR DEM不确定性分析的矿区沉陷信息提取[J].金属矿山,2017,46(10):1-7. 被引量：8

二级引证文献17

1许德山,李辉,张运良.文献关键词链接标引方法研究[J].现代图书情报技术,2015(9):31-37. 被引量：3
2段宇锋,黄思思.中文植物物种多样性描述文本的信息抽取研究[J].现代图书情报技术,2016(1):87-96. 被引量：4
3李薇,肖仰华,汪卫.基于中文知识图谱的人物实体识别[J].计算机工程,2017,34(3):225-231. 被引量：8
4李煜甫,黄蔚,胡国超.弱监督军事实体关系识别[J].电子设计工程,2018,26(1):74-78. 被引量：3
5孙建军,裴雷,蒋婷.面向学科领域的学术文献语义标注框架研究[J].情报学报,2018,37(11):1077-1086. 被引量：12
6芦家欣,汤伏全,赵军仪,闫照存.黄土矿区开采沉陷与地表损害研究述评[J].西安科技大学学报,2019,39(5):859-866. 被引量：27
7王小芳,刘树林,刘洪江.融合机器学习算法在旅游推荐中的研究与实现[J].电脑知识与技术,2020,16(9):198-199.
8汤伏全,芦家欣,韦书平,李小涛,何柯璐,杨倩.基于无人机LiDAR的榆神矿区采煤沉陷建模方法改进[J].煤炭学报,2020,45(7):2655-2666. 被引量：27
9高雨,张倍,高倩倩,井淇,盛红旗,马桂峰,马安宁,蔡伟芹.新医改前后我国卫生总费用影响因素变化研究[J].中国卫生经济,2020,39(8):39-41. 被引量：10
10孙杰.高斯朴素贝叶斯算法在大学生成绩预测中的应用研究[J].电脑知识与技术,2021,17(20):23-26. 被引量：4

1陈凯.基于回归问题的选择性集成算法[J].计算机工程,2009,35(21):17-19. 被引量：2
2周立兵,周大伟,张昌宏.一种新的证书撤销列表发布方案设计[J].计算机与数字工程,2012,40(9):79-81.
3程军锋.数据流挖掘中的聚类技术[J].衡水学院学报,2015,17(1):16-18.
4程军锋,王治和,刘佳,潘丽娜.一种基于滑动窗口的一趟数据流聚类算法[J].首都师范大学学报（自然科学版）,2014,35(4):38-40. 被引量：1
5赵文亮,郭华平,范明.基于特征变换的Tri-Training算法[J].计算机工程,2014,40(5):183-187.
6秦静,钱雪忠,王卫涛,谢国伟,宋威.一种处理不平衡大数据的并行随机森林算法[J].微电子学与计算机,2017,34(4):22-27. 被引量：8
7杨国华.基于增量技术的分布式协同设计体系研究[J].中国勘察设计,2016(5):88-91. 被引量：2
8胥永康,吴志杰,岳筱玲.基于客户/服务器的应用软件自动发布系统[J].信息与电子工程,2004,2(3):200-203. 被引量：2
9刘三民,王忠群,刘涛,修宇.融合互近邻降噪的动态数据流分类研究[J].计算机科学与探索,2016,10(1):36-42. 被引量：5
10黄晖,王泉.航空软件配置管理系统设计和关键技术研究[J].航空计算技术,2010,40(4):69-71. 被引量：5

计算机工程与应用

2011年第8期

浏览历史

内容加载中请稍等...

结合自助抽样的动态数据流贝叶斯分类算法被引量：3

参考文献11

二级参考文献22

共引文献31

同被引文献20

引证文献3

二级引证文献17

相关作者

相关机构

相关主题

浏览历史

结合自助抽样的动态数据流贝叶斯分类算法 被引量：3

参考文献11

二级参考文献22

共引文献31

同被引文献20

引证文献3

二级引证文献17

相关作者

相关机构

相关主题

浏览历史

结合自助抽样的动态数据流贝叶斯分类算法被引量：3