摘要
针对传统的分簇算法在解决超大规模数据集的分簇问题上不具有高效的时间和空间复杂度且易于陷入局部最优的问题,提出了改进型灰狼分簇算法(Improved Gray Wolf Clustering Algorithm,IGWCA),将灰狼行为规则与灰狼狩猎策略相融合,同时引入狄利克雷分布(Dirichlet Distribution)实现先验,在基准数据集上完成IGWCA与其他分簇算法的对比分析。实验结果表明IGWCA不仅具有较强的探索和开发能力,还具有较小的分散度。使用Hadoop框架的MapReduce模型实现IGWCA的并行化(IGWCA on MapReduce,IGWCA-MR),通过F-Measure和平均运行时间验证IGWCA-MR的分簇质量,并在真实数据集上验证了IGWCA-MR的运行时间和加速性能。实验结果证明,IGWCA-MR可以有效解决超大规模数据集的分簇问题,是一种高效的替代算法。
For the problem that the traditional clustering algorithm does not have efficient time and space complexity in solving the clustering problem of very large-scale data sets,and is easy to fall into local optimization,the improved gray wolf clustering algorithm(IGWCA)is proposed in which gray wolf behavior rules are combined with gray wolf hunting strategies.Dirichlet distribution is introduced to achieve a priori,and comparative analysis between IGWCA and other clustering algorithms on the benchmark data set shows that IGWCA has not only strong exploration and development capabilities,but also a small degree of dispersion.The MapReduce model of the Hadoop framework is used to realize the parallelization of IGWCA,or IGWCA on MapReduce(IGWCA-MR),the clustering quality of IGWCA-MR is verified by F-Measure and average running time,as well as the running time and acceleration performance of IGWCA-MR on the real data set.Experimental results prove that IGWCA-MR can effectively solve the clustering problem of very large-scale data sets,and is an efficient alternative algorithm.
作者
赵彦
孙俊
ZHAO Yan;SUN Jun(Internet of Things Engineering College,Jiangsu Vocational College of Information Technology,Wuxi 214153,China;International Joint Laboratory of Pattern Recognition and Artificial Intelligence,Jiangnan University,Wuxi 214122,China)
出处
《电讯技术》
北大核心
2020年第10期1214-1221,共8页
Telecommunication Engineering
基金
国家自然科学基金资助项目(61672263)
江苏省自然科学基金资助项目(BK20131097)
江苏省高职院校教师专业带头人高端研修(个人访学研修)基金项目(2019GRGDYX015)
2017年江苏高校“青蓝工程”基金资助项目(苏教师〔2017〕15号)
江苏省第五期“333工程”第三层次培养对象基金资助项目(苏人才办〔2018〕6号)
学院教学团队项目(苏信院教〔2020〕4号)
学院科研课题(JSITKY201804)。
关键词
大数据分析
数据挖掘
分簇算法
灰狼算法
狄利克雷分布
big data analysis
data mining
clustering algorithm
gray wolf algorithm
Dirichlet distribution