摘要
为了提高遗传k-means算法时间效率和聚类结果的正确率,利用遗传算法的粗粒度并行化设计思想,提出了在Hadoop平台下将遗传k-means算法进行并行化设计。将各个子种群编号作为个体区分,个体所包含的各个聚类中心和其适应度作为值共同作为个体的输入;在并行化过程中,设计了较优的种群迁移策略来避免早熟现象的发生。实验对不同的数据集进行处理,实验结果表明,并行化的遗传k-means算法在处理较大数据集时比传统的串行算法在时间上和最后的结果上都具有明显的优越性。
To improve the time efficiency and the accuracy rate of clustering results of genetic k-means algorithm, using the idea of coarse grain parallel genetic algorithm, parallel genetic k-means algorithm is proposed on the Hadoop platform. Every sub populations is numbered to identify individuals which contain cluster centers and fitness value. The two numbers are used as indi- vidual input. At the same time, a preferable population migration strategy is designed to avoid the premature phenomena in the parallelization process. Different data sets are processed in the experiment to verify that parallel genetic k-means algorithm in processing larger data sets is obvious superior both in time and the final results.
出处
《计算机工程与设计》
CSCD
北大核心
2014年第2期657-660,共4页
Computer Engineering and Design
基金
安徽省教育厅自然科学研究基金重点项目(2011A006)