期刊文献+

流式大数据下随机森林方法及应用 被引量:5

Random Forest Method and Application in Stream Big Data Systems
下载PDF
导出
摘要 流式计算形态下的大数据分析一直是当前需要解决的问题,而且研究成果和实践经验较少。随机森林方法是目前应用较多的分类算法,但在流式计算应用场景中,数据所呈现出来的实时性、易失性、无序性等特征会使得算法准确度逐渐降低。针对这个问题,分析了随机森林的算法特点,提出了根据决策树的准确度进行随机森林剪枝的思路。同时为了适应数据的变化,结合准确度间隔的概念提出生成、验证并补充新决策树的方法,最终形成可以不断随数据更新的随机森林,满足流式大数据环境对算法的要求。使用实际数据对改进后方法的可行性进行了验证,证明新方法在真实流式大数据场景中有着更高的分类准确度,最后分析讨论了随机森林方法如何进一步研究改进的主题。 Stream computing is an important form widely applied classification algorithms at present. only huge number of features but also constantly c of big data computing. Random forest method is one of the most From the actual requirements, random forest method faces not hanging data pattern over time. The algorithm without self renewal and adaptive algorithm will gradually reduce over time. accuracy of a random forest Aiming at this problem, this paper analyzes the characteristics of random forest algorithm, gives a new pruning idea according to the accuracy of the decision trees. In order to adapt to the change of data, a new random method based on margin is presented. This new method can update itself constantly and can be applied in stream big data environments. Using the actual data, the new method is verified has higher accuracy in classification, and analysis and discussion of how to further re- search and improve the random forest method in big data environment.
出处 《西北工业大学学报》 EI CAS CSCD 北大核心 2015年第6期1055-1061,共7页 Journal of Northwestern Polytechnical University
关键词 决策树 随机森林方法 大数据 流式计算 社交网站 搜索引擎 分类器 剪枝 客户评分 分布式系统 decision tree, random forest, big data, stream computing, social network, searching engine, classifi-er, pruning, customer rating, distributed system
  • 相关文献

参考文献11

二级参考文献342

  • 1Zhou MQ, Zhang R, Zeng DD, Qian WN, Zhou AY. Join optimization in the MapReduce environment for column-wise data store. In: Fang YF, Huang ZX, eds. Proc. of the SKG. Ningbo: IEEE Computer Society, 2010.97-104. [doi: 10.1109/SKG.2010.18].
  • 2Afrati FN, Ullman JD. Optimizing joins in a Map-Reduce environment. In: Manolescu I, Spaecapietra S, Teubner J, Kitsuregawa M, Leger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 99-110. [doi: 10.1145/ 1739041.1739056].
  • 3Sandholm T, Lai K. MapReduce optimization using regulated dynamic prioritization. In: Douceur JR, Greenberg AG, Bonald T, Nieh J, eds. Proc. of the SIGMETRICS. Seattle: ACM Press, 2009. 299-310. [doi: 10.1145/1555349.1555384].
  • 4Hoefler T, Lumsdaine A, Dongarra J. Towards; efficient MapReduce using MPI. In: Oster P, ed. Proc. of the EuroPVM/MPI. Berlin: Springer-Verlag, 2009. 240-249. [doi: 10.100'7/978-3-642-03770-2_30].
  • 5Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1-2):494-505.
  • 6Kambatla K, Rapolu N, Jagannathan S, Grama A. Asynchronous algorithms in MapReduce. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 245-254. [doi: 10.1109/CLUSTER.2010.30].
  • 7Polo J, Carrera D, Becerra Y, Torres J, Ayguad6 E, Steinder M, Whalley I. Performance-Driven task co-scheduling for MapReduce environments. In: Tonouchi T, Kim MS, eds. Proc. of the 1EEE Network Operations and Management Symp. (NOMS). Osaka: IEEE Press, 2010. 373-380. [doi: 10.1109/NOMS.2010.5488494].
  • 8Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Draves R, van Renesse R, eds. Proc. of the ODSI. Berkeley: USENIX Association, 2008.29-42.
  • 9Xie J, Yin S, Ruan XJ, Ding ZY, Tian Y, Majors J, Manzanares A, Qin X. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Taufer M, Rfinger G, Du ZH, eds. Proc. of the Workshop on Heterogeneity in Computing (IPDPS 2010). Atlanta: IEEE Press, 2010. 1-9. [doi: 10.1109/IPDPSW.2010.5470880].
  • 10Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguad6 E. Performance management of accelerated MapReduce workloads in heterogeneous clusters. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 653-662. [doi: 10.1109/ ICPP.2010.73].

共引文献4744

同被引文献38

引证文献5

二级引证文献99

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部