期刊文献+

结合自助抽样的动态数据流贝叶斯分类算法 被引量:3

Bayesian classification algorithm of dynamic data stream based on bootstrap
下载PDF
导出
摘要 动态数据流具有数据量大、变化快、随机存取代价高、详细数据难以存储等特点,挖掘动态数据流对计算能力与存储能力要求非常高。针对动态数据流的以上特点,设计了一种基于自助抽样的动态数据流贝叶斯分类算法,算法运用滑动窗口模型对动态数据流进行处理分析。该模型以每个窗口的数据为基本单位,对窗口内的数据进行处理分析;算法采用自助抽样技术对待分类数据中的属性进行裁剪和优化,解决了数据属性间的多重线性相关问题;算法结合贝叶斯算法的特点,采用动态增量存储树来解决动态样本数据流的存储问题,实现了无限动态数据流无信息失真的静态有限存储,解决了动态数据流挖掘最大的难题——数据存储;对优化的待分类数据使用all-贝叶斯分类器和k-贝叶斯分类器进行分类,结合数据流的特性对两个分类器进行实时更新。该算法有效克服了贝叶斯分类属性独立性的约束和传统贝叶斯只对静态数据分类的缺点,克服了动态数据流最大的难题——数据存储问题。通过实验测试证明,基于自助抽样的贝叶斯分类具有很高的时效性和精确性。 Dynamic data streams have features of large data,instant change,costly random access and difficult storage of detailed data,so mining of such dynamic data streams puts forwards high requirements on the computing power and storage capacity.According to the above features,a Bayesian classification algorithm of dynamic data stream based on bootstrap is proposed to process and analyze dynamic data streams with the sliding window model.This model,taking data of each window as the basic unit,processes and analyzes the data of windows.The algorithm adopts the bootstrap method to cut and optimize the attributes of data to be classified,solving the problem in multi-linear inter-relation between data attributes.The algorithm,combining characteristics of Bayesian algorithm,adopts the dynamic incremental storage tree to store the dynamic sample data stream to realize the static finite storage of infinite dynamic data streams without distortion of information and ultimately solve the biggest problem in dynamic data stream mining——data storage.The all-Bayesian classifier and k-Bayesian classifier are adopted to classify the optimized data,and their updates are made according to the features of data streams.This algorithm overcomes the attribute independence of the Bayesian classifier and its limitation only to the static data.It overcomes the biggest problem of dynamic data stream——the data storage.Experimental tests prove that the Bayesian classification algorithm based on bootstrap has high timeliness and accuracy.
出处 《计算机工程与应用》 CSCD 北大核心 2011年第8期118-121,142,共5页 Computer Engineering and Applications
基金 国家自然科学基金(No.70671094) 浙江科技计划项目(No.2008C14061) 浙江省自然科学基金重点项目(No.Z1091224) 浙江省自然科学基金项目(No.Y1090617)~~
关键词 数据流 自助抽样 贝叶斯分类 滑动窗口 增量存储树 data stream bootstrap Bayesian classification sliding window incremental storage tree
  • 相关文献

参考文献11

  • 1Widmer G, Kubat M.Leaming in the presence of concept drift and hidden contexts [J].Machine Learning, 1996,23 ( 1 ) : 69-101.
  • 2Hulten G, Spencer L, Domingos P.Mining time-changing data streams[C]//Proc of the Int'l Conf on Knowledge Discovery and Data Mining.New York:ACM Press,2001:97-106.
  • 3Wang Hai-xun,Han Jia-wei.Mining concept-drifting data streams using ensemble classifiers[C]//Proc of the Int'l Conf on Knowl- edge Discovery and Data Mining.New York:ACM Press,2003.
  • 4I Xie Q H.An efficient approach for mining concept-drifting data streams[DJ.Tainan,China:National University of Tainan,2004.
  • 5Wcbb G I,Boughton J R, Wang Z.Aggrcgating one-dependence estimators[J].Machine Learning, 2005,58 ( 1 ) : 5-24.
  • 6Giannella C,Han J, Pei J,et al.Mining frequent patterns in data streams at multiple time granularities[J].Next Generation Data Mining,2003:191-212.
  • 7Mozina M, Demsar J, Kattan M,et al.Nomograms for visualiza-tion of naive Bayesian classifier[C]//Proc of PKDD-2004,2004: 337-348.
  • 8石洪波,黄厚宽,王志海.基于Boosting的TAN组合分类器[J].计算机研究与发展,2004,41(2):340-345. 被引量:14
  • 9刘君强,孙晓莹,庄越挺,潘云鹤.挖掘闭合模式的高性能算法[J].软件学报,2004,15(1):94-102. 被引量:19
  • 10Eft'on B.Bootstrap methods: Another look at the jackknife[J]. Ann Statist, 1979,7(1) : 1-26.

二级参考文献22

  • 1[1]Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering frequent closed itemsets for association rules. In: Beeri C, et al, eds. Proc. of the 7th Int'l. Conf. on Database Theory. Jerusalem: Springer-Verlag, 1999. 398~416.
  • 2[2]Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Beeri C, et al, eds. Proc. of the 20th Int'l. Conf. on Very Large Databases. Santiago: Morgan Kaufmann Publishers, 1994. 487~499.
  • 3[3]Pei J, Han J, Mao R. CLOSET: An efficient algorithm for mining frequent closed itemsets. In: Gunopulos D, et al, eds. Proc. of the 2000 ACM SIGMOD Int'l. Workshop on Data Mining and Knowledge Discovery. Dallas: ACM Press, 2000. 21~30.
  • 4[4]Burdick D, Calimlim M, Gehrke J. MAFIA: A maximal frequent itemset algorithm for transactional databases. In: Georgakopoulos D, et al, eds. Proc. of the 17th Int'l. Conf. on Data Engineering. Heidelberg: IEEE Press, 2001. 443~452.
  • 5[5]Zaki MJ, Hsiao CJ. CHARM: An efficient algorithm for closed itemset mining. In: Grossman R, et al, eds. Proc. of the 2nd SIAM Int'l. Conf. on Data Mining. Arlington: SIAM, 2002. 12~28.
  • 6[6]Liu JQ, Pan YH, Wang K, Han J. Mining frequent item sets by opportunistic projection. In: Hand D, et al, eds. Proc. of the 8th ACM SIGKDD Int'l. Conf. on Knowledge Discovery and Data Mining. Alberta: ACM Press, 2002. 229~238.
  • 7[7]Srikant R. Quest synthetic data generation code. San Jose: IBM Almaden Research Center, 1994. http://www.almaden.ibm.com/ software/quest/Resources/index.shtml
  • 8[8]Blake C, Merz C. UCI Repository of machine learning. Irvine: University of California, Department of Information and Computer Science, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html
  • 9R E Schapire. The strength of weak learnability. Machine Learning, 1990, 5(2): 197~227
  • 10R E Schapire, Y Freund, P Bartlett et al. Boosting the margin: A new explanation for the effectiveness of voting methods. In: Douglas H Fisher eds. Proc of the 14th Int'l Conf on Machine Learning. San Francisco: Morgan Kaufmann, 1997. 322~330

共引文献31

同被引文献20

引证文献3

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部