摘要
聚类是数据挖掘和机器学习中的基本任务之一.传统聚类方法由于其设计中对簇结构假设的限制,导致算法在不符合其假设的数据集上,尤其是大型高维数据集上的聚类效果较差.本文引入了最大平均熵率的概念,设计了一种基于图的关联聚类算法.该算法将关联聚类问题分解为多个独立的单类优化问题,并利用邻域消除了关联聚类对大数据的限制.算法实现通过启发式邻域搜索和类生成简化了对最优邻域和关联聚类的求解过程,并且设计了适应分布式计算平台的图迭代方法.与其他聚类算法相比,该算法在提高计算效率的同时,对簇结构假设相对灵活,可适用于多种分布数据.在聚类实验中,算法的f1-measure和purity指数均好于其他6种聚类算法,而且对于高维大数据集,算法的运行时间远远低于其他聚类算法.
Clustering is one of fundamental tasks of data mining and machine learning. Due to the limitation of cluster assumption, lots of clustering algorithms perform poorly on some datasets against their assumptions,especially high-dimensional big data. This paper presents a maximum average entropy-rate based correlation clustering algorithm which is a kind of a graph-based correlation clustering. The objective function of original correlation clustering is decomposed into several single cluster optimizations and the limitation of big data in correlation clustering is removed by the neighboring connected graph. In algorithm implementation, the optimization of proper neighbor searching and correlation clustering are performed by heuristic neighbor searching and cluster generating respectively, and there is also an efficient graph-iterated implementation on distributed computation platform. Compared with other clustering algorithms, the proposed clustering algorithm is moreflexible in cluster assumption, when accelerating the clustering process. In an experimental study we demonstrate the performance of the proposed algorithms on several datasets. The proposed clustering algorithm performed better than the other six clustering algorithms on the highest f1-measure and purity values, while its running time on high-dimensional big data is much lower than other clustering algorithms.
作者
张俪文
王涛
罗坚
杨树森
徐宗本
Liwen ZHANG;Tao WANG;Jian LUO;Shusen YANG;Zongben XU(Faculty of Electronic and Information Engineering,Xi'an Jiaotong Universityy XVan 710049,China;School of Mathematics and Statistics,Xi'an Jiaotong University,XVan 710049,China)
出处
《中国科学:信息科学》
CSCD
北大核心
2019年第12期1572-1585,共14页
Scientia Sinica(Informationis)
基金
国家自然科学基金(批准号:61772410,61802298,11690011,U1811461)
国家重点研发计划(批准号:2017YFB1010004)资助项目
关键词
聚类
相关聚类
熵率
图聚类
大数据
clustering
correlation clustering
entropy-rate
graph-based clustering
big data