摘要
为了探究学生成绩与其毕业去向之间存在的内在关系,提出基于Hadoop的Canopy-Kmeans并行算法并进行分析.首先基于"最小最大原则"确定Canopy的初始中心点并快速粗糙聚类,将其作为K-means算法的初始聚类中心,并基于MapReduce计算框架实现其并行化.然后以西安工程大学2017届毕业生的教务数据为基础,进行海量教务数据的挖掘分析实验,完成相同毕业流向类型学生的聚类,同时分析各毕业流向与课程之间的内在联系.实验结果证明,改进后的Canopy-K-means算法在处理海量数据时,相比传统K-means算法,聚类收敛速度提高约2.1倍,准确率提高约15%,具有良好的聚类效果.
In order to explore the intrinsic relationship between student grades and graduation destination,Canopy-K-means parallel algorithm based on Hadoop was used for analysis.Firstly,based on the“minimum and maximum principle”,the initial center point of Canopy was determined,clustering fastly.K-means algorithm uses it as the initial clustering center,and achieves parallelization based on MapReduce.Then mining analysis experiment was conducted with the educational data of the2017graduates of Xi′an Polytechnic University,clustering the students with the same graduation type,and get the result of the internal relationship between graduation types and courses.The experimental results show that when processing massive data,compared with the traditional K-means algorithm,Canopy-K-means algorithm improves the cluster convergence speed by about2.1times,and increases the accuracy rate by around15percentage points,which has better clustering effect.
作者
郭卫霞
薛涛
李婷
GUO Weixia;XUE Tao;LI Ting(School of Computer Science, Xi′an Polytechnic University, Xi′an 710048, China)
出处
《西安工程大学学报》
CAS
2018年第6期705-712,共8页
Journal of Xi’an Polytechnic University
基金
陕西省自然科学基础计划一般项目(2018JQ6103)