期刊文献+

基于MapReduce的分布式改进随机森林学生就业数据分类模型研究 被引量:7

MapReduce based distributed improved random forest model for graduates career classification
原文传递
导出
摘要 教育数据挖掘(educational data mining)是当代教育信息化发展的前沿研究领域,正在吸引越来越多教育学家和数据科学家的关注."大数据"时代背景下,随着数据处理规模的不断激增,现有的数据挖掘模型在单一处理节点的计算能力遭遇瓶颈,各类面向大数据处理的分布式计算框架应运而生.借助这些框架,面向解决高校就业数据挖掘问题的机器学习模型便可以满足未来大规模数据处理的需求,在未来数据集体量庞大的信息集成系统中为数据挖掘和决策支持提供帮助.以此为背景,本研究对比现有数据模型对研究目标对象的分类性能,提出了以引入输入特征加权系数来计算特征的信息增益作为特征最优分裂评判指标的改进随机森林模型来提升数据分类性能,通过仿真测试改进模型对于现有模型分类性能的提升情况,与此同时为解决大数据时代背景下面向海量数据分类任务的单节点性能瓶颈问题,提出了基于分布式改进随机森林算法的大规模学生就业数据分类预测模型.通过使用MapReduce分布式计算框架实现已训练模型在本地磁盘与分布式文件系统之间的序列化写入与反序列化加载过程,进而实现了基于改进随机森林模型的大规模数据分类模型的分布式扩展. Educational data mining is a research area of using data mining technology in education industry. In the research of EDM, data mining technology is used to modeling dataset samples in the field of education, which aims to study and forecast the testing data set with the help of effective statistical machine learning models. Machine learning models with distributed computing frameworks in the EDM can meet the needs of large-scale data processing meanwhile provide tailored data recommendation and then support decision-making in the future. Based on this background, this study first put all kinds of data models into the data training and predicting for simulation, propose an improved model to ameliorate the classification performance of the data model by adjusting the data model and by using an improved algorithm based on a new equation of information gain when calculating the optimal feature to split. Based on the best-performance data model in previous study combined with the application background of the "big data" era, we proposed a new random forest algorithm model focusing on giving classification to large scale datasets based on distributed computing framework called MapReduce. By using the MapReduce,we design and realize a new system to meet this requirement. In this system, the model that has been trained can be serialized and deserialization between local disks and the distributed file system.
出处 《系统工程理论与实践》 EI CSSCI CSCD 北大核心 2017年第5期1383-1392,共10页 Systems Engineering-Theory & Practice
基金 国家自然科学基金(71690234)~~
关键词 机器学习 数据分类模型 大数据处理 MAPREDUCE machine learning data classification model big data processing MapReduce
  • 相关文献

同被引文献55

引证文献7

二级引证文献39

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部