Hadoop环境下基于敏感度的随机森林分类算法研究被引量：2

Research on random forest classification algorithm based on sensitivity degree in Hadoop environment

下载PDF

导出

摘要针对当前大数据环境下随机森林分类算法在处理不平衡数据集分类任务时存在的小类样本被忽视及效率低的问题,提出了一种Hadoop环境下基于敏感度的随机森林分类算法.该算法引入了文本分类特征选择技术中的相关方法,采用MapReduce编程模型,在Hadoop云计算平台上实现了算法的并行化.通过实验对比分析了该算法与传统随机森林分类算法对不平衡数据的分类效果.结果表明,该算法显著提高了传统随机森林分类算法的性能,且具有高效性和易扩展性. When applied to deal with the imbalanced dataset classification task under the circumstance of big data,Random Forest classification algorithm always suffers from the neglect of minority class and inefficiency problem. A Random Forest classification algorithm based on Sensitivity Degree in Hadoop environment is proposed to solve the above-mentioned problems,which introduced the method from feature selection of text classification,and is parallelized by using MapReduce programming model in Hadoop cloud computing environment. Comparison was made through experiments in regard to the effect of the imbalanced dataset classification by this algorithm and by the traditional Random Forest classification algorithm. The experimental results show that this algorithm significantly improves the performance of the traditional Random Forest classification algorithm,and has high efficiency and ease of scalability.

作者孟海东冀小青肖银龙宋宇辰

机构地区内蒙古科技大学信息工程学院

出处《内蒙古科技大学学报》 CAS 2016年第3期297-301,共5页 Journal of Inner Mongolia University of Science and Technology

基金国家自然科学基金资助项目(71363040)

关键词分类云计算 MAPREDUCE 随机森林特征选择 classification cloud computing MapReduce Random Forest feature selection

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献16

1He J, Zhang Y, Li X, et al. Learning naive Bayes classi- fiers from positive and unlabelled examples with uncer- tainty [ J ]. International Journal of Systems Science, 2012, 43(10): 1805-1825.
2Bijalwan V, Kumar V, Kumari P, et ah KNN based ma- chine learning approach for text and document mining [ J ]. International Journal of Database Theory and Appli- cation, 2014, 7(1): 61-70.
3Ciresan D, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification [ A ]//Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on Computer Vision and Pattern Recognition [ C]. IEEE Press, 2012: 3642-3649.
4Rodriguez-Galiano V F, Ghimire B, Rogan J, et al. An assessment of the effectiveness of a random forest classifi- er for land-cover classification[J]. ISPRS Journal of Pho- togrammetry and Remote Sensing, 2012, (67) :93-104.
5Gray K R, Aljabar P, Heckemann R A, et al. Random forest-based similarity measures for multi-modal classifi- cation of Alzheimer's disease [ J ]. Neuroimage, 2013, ( 65 ) : 167-175.
6Lin W Z, Fang J A, Xiao X, et al. iDNA - Prot: identi- fication of DNA binding proteins using random forest with grey model[J]. PLoS One, 2011, 6(9) : 1-7.
7Idris A, Rizwan M, Khan A. Churn prediction in telecom using Random Forest and PSO based data balancing in combination with various feature selection strategies [ J ]. Computers & Electrical Engineering, 2012, 38 ( 6 ) : 1808-1819.
8Dobre C, Xhafa F. Parallel Programming Paradigms and Frameworks in Big Data Era[ J]. International Journal of Parallel Programming, 2014, 42(5) :710-738.
9Apache Hadoop. Hadoop [ EB/OL ]. http://wiki. aquche, org/hadoop/prontpage Map Reduce. html, 2015,05-04.
10Lopez V, Fernandez A, Garcia S, et al. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics [ J ]. Information Sciences, 2013,250 : 113-141.

二级参考文献31

1Vapnik V N. The nature of statistical learning theory[M] .New York: Springer,2000: 138-167.
2He H B, Edwardo A. Learning from imbalanced data[J] .IEEE Trans on Knowledge and Data Engineering, 2009,21(8): 1263-1284.
3Liu X Y, Zhou Z H. Exploratory under-sampling for class-imbalance learing[J] . IEEE Trans on Systems, Man andCybernetics, 2009, 39(2): 539-550.
4Liu X Y, Zhou Z H. Training cost-sensitive neural networkswith methods addressing the class imbalance problem[J] .IEEE Trans on Knowledage and Data Engineering,2006,18(1): 63-77.
5Van H J, Khoshgoftaar T M,Napolitano A. Experimentalperspectives on learning from imbalanceed data[C] . Proc ofthe 24th Int Conf on Machine Learning. New York: ACM,2007: 143-146.
6Weiss G M. Mining with rarity: A unifying framework[J] .ACM SIGKDD Explorations Newsletter,2004,6(1): 7-19.
7Estabrooks A, Jo T. A multiple resampling method forlearning from imbalanced data sets[J] . ComputationalIntelligence, 2004, 20(11): 18-36.
8Han H,Wang W Y, Mao B H. Borderline-SMOTE: A newover-sampling method in imbalanced data sets leaming[C] .Proc Int Conf of Intelligent Computing. Berlin Heidelberg:Springer, 2005: 878-887.
9Akban I R, Kwek S, Japkow I. Applying support vectormachines to imbalanced datasets[C] . Proc of the 15thEuropean Conf on Machines Learning. Berlin Heidelberg:Springer, 2004: 39-50.
10Bastista G E, Prati R C, Monard M C. A study ofthe Behavior of several methods for balancing machinelearning training data[J] . ACM SIGKDD ExplorationNewsletter, 2004’ 6(1): 20-29.

共引文献33

1郑凌铭,舒胜文,陈彬,吴涵,黄建业,钱健.强台风环境下基于格点化和支持向量机的10 kV杆塔受损量预测方法[J].高电压技术,2020,46(1):42-51. 被引量：14
2朱庆文.重中之重:加强和改进思想政治工作[J].理论学习（浙江）,2000(2):42-42.
3李勇,刘战东,张海军.不平衡数据的集成分类算法综述[J].计算机应用研究,2014,31(5):1287-1291. 被引量：74
4涂歆,严洪森.基于具有自适应分段损失函数支持向量机的产品销售预测模型[J].控制与决策,2015,30(10):1803-1809.
5曹路,王鹏.基于SMOTE采样和支持向量机的不平衡数据分类[J].五邑大学学报（自然科学版）,2015,29(4):27-31. 被引量：2
6蒋文亮,陆千里,于真,朱银玉,黄金泉.一种基于FPGA的航空发动机独立超转保护系统[J].航空动力学报,2016,31(2):477-483. 被引量：2
7衣柏衡,朱建军,李杰.基于改进SMOTE的小额贷款公司客户信用风险非均衡SVM分类[J].中国管理科学,2016,24(3):24-30. 被引量：54
8逄凯,杨森,于建星,庞辉,于阳,陶育纯,金丽娜.随机森林法在吉林省冠心病筛查中的应用及优化[J].医学与社会,2016,29(6):54-56. 被引量：2
9郭亚伟,白治江.基于混合重采样的非平衡数据SVM训练方法[J].微型机与应用,2016,35(12):52-54. 被引量：3
10陶新民,李震,刘福荣,张越.基于精简集支持向量机的变压器故障检测方法[J].高电压技术,2016,42(10):3199-3206. 被引量：21

同被引文献13

1王卫星,杨亚新,王雷明,刘庆成,夏元复.广东下庄铀矿田土壤的天然放射性研究[J].中国环境科学,2005,25(1):120-123. 被引量：37
2王南萍,肖磊,刘少敏,黄英,裴少英,高宝龙,程业勋.广东珠海地表介质中^(238)U、^(232)Th和^(40)K的活度水平[J].同位素,2005,18(1):73-78. 被引量：4
3秦苏云,戚勇,李淑琴,周彩云,张静娟,李继开,李学群.^（90）Sr、^（137）Cs、天然铀、^（226）Ra和￣（239）Pu在[J].辐射防护,1995,15(4):241-252. 被引量：9
4陈彬,洪家荣,王亚东.最优特征子集选择问题[J].计算机学报,1997,20(2):133-138. 被引量：96
5赵希岳,樊国华,蔡志强,陈巧丽,杨百科,王寿祥.放射性核素^(60)Co在土壤中的淋溶和迁移分布[J].中国环境科学,2010,30(8):1118-1122. 被引量：10
6岳玉美,宋刚,张志强,富英杰,陈迪云.广州市北部土壤天然放射性水平研究[J].中国环境科学,2011,31(4):657-661. 被引量：11
7曹龙生,杨亚新,张叶,郑勇明,杨婷.中国大陆主要省份土壤中天然放射性核素含量分布规律研究[J].东华理工大学学报（自然科学版）,2012,35(2):167-172. 被引量：26
8姚登举,杨静,詹晓娟.基于随机森林的特征选择算法[J].吉林大学学报（工学版）,2014,44(1):137-141. 被引量：250
9许旻.高维数据下基于云平台的随机森林算法的研究与实现[J].科技通报,2014,30(6):222-224. 被引量：3
10尹华,胡玉平.基于随机森林的不平衡特征选择算法[J].中山大学学报（自然科学版）,2014,53(5):59-65. 被引量：33

引证文献2

1张鑫,吴海涛,曹雪虹.Hadoop环境下基于随机森林的特征选择算法[J].计算机技术与发展,2018,28(7):88-92. 被引量：1
2杨剑洲,龚晶晶,唐世新,胡树起.广东省部分地区土壤放射性核素的测定和剂量评估[J].物探与化探,2020,44(2):419-425. 被引量：5

二级引证文献6

1牛得清,伍友利,徐洋,吴鑫,张丹旭,杨鹏飞.红外空空导弹抗干扰效能评估建模[J].北京航空航天大学学报,2021,47(9):1874-1883. 被引量：6
2赵迷迷.粤港澳大湾区某地天然放射性环境调查与评价[J].城市地质,2021,16(4):432-438. 被引量：2
3马婷婷,李冠超,阙泽胜,孙功明,胡颖,杨波,林敏.基于GIS的某流域土壤放射性分布特征和健康风险评价[J].有色金属（冶炼部分）,2023(8):120-128. 被引量：1
4阙泽胜,李冠超,胡颖,简锐敏,刘兵.基于GIS的土壤环境放射性水平和风险评价[J].物探与化探,2023,47(5):1336-1347.
5宁健,程晓波,苏超丽,汤泽平,余泽峰.广东省伴生放射性矿周围土壤放射性水平分析[J].生态环境学报,2023,32(9):1692-1699.
6蒋兵,舒奕嘉,罗茂丹,徐僳,张红帆.四川省土壤天然核素放射性水平调查和分析[J].四川环境,2023,42(6):130-135. 被引量：4

1徐浩,周杉,王彬.车载自组织网络分布式入侵检测算法研究[J].现代计算机（中旬刊）,2014(12):27-29.
2王艳红,周军.基于Hadoop的网络爬虫技术研究[J].吉林工程技术师范学院学报,2014,30(8):87-89. 被引量：4
3吴晓婷,刘学超.浅谈Hadoop云计算的认识[J].无线互联科技,2014,11(8):45-45. 被引量：2
4李寒,唐兴兴.基于参数优化的Hadoop云计算平台[J].计算机系统应用,2013,22(3):21-24. 被引量：2
5金伟健,王春枝.适于进化算法的迭代式MapReduce框架[J].计算机应用,2013,33(12):3591-3595. 被引量：16
6吕亚奇,郁梅,刘姗姗,王颖,王晓东.感知特征集和随机森林的立体图像质量评价[J].光电工程,2015,42(8):60-65. 被引量：1
7王静蕾.Hadoop云计算框架中的分布式数据库HBase研究[J].商丘职业技术学院学报,2014,13(2):18-20. 被引量：1
8董立勉,左晓军,曲武,王莉军.一种基于机器学习的分布式恶意代码检测方法[J].情报工程,2015,1(6):90-101. 被引量：2
9杜旭,刘森,颜璟仪.基于Hadoop的智能电网监控系统的设计与实现[J].科技创新与应用,2014,4(17):5-7. 被引量：2
10代栋,周学海,杨峰,王超.一种基于模糊推理的Hadoop异构机群自动配置工具[J].中国科学院研究生院学报,2011,28(6):793-800. 被引量：5

内蒙古科技大学学报

2016年第3期

浏览历史

内容加载中请稍等...

Hadoop环境下基于敏感度的随机森林分类算法研究被引量：2

参考文献16

二级参考文献31

共引文献33

同被引文献13

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

Hadoop环境下基于敏感度的随机森林分类算法研究 被引量：2

参考文献16

二级参考文献31

共引文献33

同被引文献13

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

Hadoop环境下基于敏感度的随机森林分类算法研究被引量：2