摘要
为了提高海量高维小样本数据的聚类准确率和效率,提出一种基于递归文化基因和云计算分布式计算的高维大数据聚类系统。基于Spark分布式计算平台设计迭代的聚类系统,分为基于递归文化基因的特征归简处理和基于密度的聚类处理。前者将基因微阵列的聚类准确率结果作为主目标,特征数量作为次目标,递归地化简特征空间;后者基于犹豫模糊集理论设计基于密度的聚类算法,采用加权的犹豫模糊集相关系数度量数据之间的距离。基于人工合成数据集和临床实验数据集均进行仿真实验,结果表明该算法在聚类准确率、扩展性和时间效率上均实现了较好的效果。
In order to improve the clustering accuracy and efficiency of massive high dimensional small sample size datasets,this paper proposes a high dimensional big data clustering system based on recursive memetic algorithm and cloud distributed computing.We designed a iterative clustering system based on Spark distributed computing platform,and the system consisted of recursive memetic-based feature reduction and density-based clustering.The former treated the clustering accuracy results of gene microarrays as major objective,and treated feature number as secondary objective,it reduced the feature space recursively;the latter designed the density based clustering algorithm based on the hesitant fuzzy set theory,adopted weighted hesitant fuzzy set correlation coefficient to measure the distances between data points.Simulation experiments were done based on both synthetic datasets and clinical datasets,experimental results indicate that the proposed algorithm realizes good results in clustering accuracy,scalability and time efficiency.
作者
王超英
Wang Chaoying(Dongguan Polytechnic,Dongguan 523808,Guangdong,China)
出处
《计算机应用与软件》
北大核心
2021年第4期295-304,共10页
Computer Applications and Software
关键词
大数据分析
高维小样本数据
文化基因算法
分布式计算
犹豫模糊集
Big data analysis
High dimensional small sample size data
Memetic algorithm
Distributed computing
Hesitant fuzzy set