摘要
在数据挖掘中,针对聚类过程中数据存在的稀疏性问题,如果仍用传统的欧氏距离作为聚类指标,聚类的质量和效率将会受到一定的影响。受到信息论中KL散度的启发,文中提出一种基于Spark开源数据框架下利用KL散度的相似性度量方法,对目前使用的聚类算法进行优化。首先,通过预聚类,对数据的整体分布进行分析;然后,借助KL散度作为聚类的距离指标,充分利用数据集中元素提供的信息来度量不同数据集的相互关系,指导数据的聚类,在一定程度上改善了数据分布稀疏性的问题。整个过程基于Spark分布式数据处理框架,充分利用集群的能力对数据进行处理,提升数据处理的准确度和算法的时间效率;同时利用KL散度作为数据聚类距离指标,以充分考虑数据内部蕴藏的信息,使得聚类的质量得到了提升。最后通过一个实验来验证所提算法的有效性。
In the data mining,if the traditional Euclidean distance is still used as the clustering index to deal with the data sparseness in the clustering process,the clustering quality and efficiency would be affected to a certain extent. On the basis of the inspiration of KL divergence in information theory,a similarity measure method using KL divergence and based on Spark open source data framework is proposed to optimize the clustering algorithm used at present. The entire distribution of data is analyzed by pre-clustering. By taking the KL divergence as the distance index of clustering,the information provided by elements in data sets is fully utilized to measure the mutual relationship of different data sets and guide the data′s clustering,by which the sparseness of data distribution is improved to a certain extent. The whole process is based on Spark distributed data processing framework,by which the data is processed by making full use of the cluster ability to improve the accuracy of data processing and the time efficiency of the algorithm. KL divergence is used as the distance index of data clustering,so that the information hided in the data is fully considered,which may make the clustering quality improved. An experiment was carried out to verify the effectiveness of the proposed algorithm.
作者
赵玉明
舒红平
魏培阳
刘魁
ZHAO Yuming;SHU Hongping;WEI Peiyang;LIU Kui(College of Software Engineering,Chengdu University of Information Technology,Chengdu 610225,China;Key Laboratory of Software Automatic Generation and Intelligent Information Service,Chengdu University of Information Technology,Chengdu 610225,China)
出处
《现代电子技术》
北大核心
2020年第8期52-55,59,共5页
Modern Electronics Technique
基金
四川省科技厅科技支撑项目(18ZDYF3256)
四川省教育厅科研资助项目(18ZB0126)。
关键词
聚类算法优化
SPARK
数据分布分析
数据聚类
聚类分析
数据处理
clustering algorithm optimization
Spark
data distribution analysis
data clustering
clustering analysis
data processing