摘要
本文针对数据流演化聚类问题,建立了基于模糊最大熵的优化模型,利用模糊隶属度表达类别划分的模糊性,通过信息熵描述类别划分的有效性.在此基础上定义了优化目标函数,在滑动窗口下将数据子集的聚类过程理解为一个优化问题,使聚类结果能有效描述数据内在结构特征,同时维持相邻窗口间聚类模型的连续性.将优化问题的解作为概念漂移检测的依据,保证了检测结果的有效性,有利于捕获聚类结构的变化趋势.在仿真实验中,利用人造数据集和真实数据集对新算法的有效性进行了验证,并通过实验与多种演化聚类方法在聚类精度、概念漂移检测精度以及计算效率等多个方面进行了比较.仿真结果表明了该算法的有效性,在相同条件下其聚类精度和概念漂移检测精度相比其他聚类算法具有显著优势,能够同时降低计算耗费时间和存储空间.
An optimization model based on the fuzzy maximum entropy method is proposed for the data stream evolving clustering problem. In the model, the fuzziness and effectiveness of cluster partition are described by fuzzy membership and information entropy, respectively. An optimization object function is defined. In the sliding window, the clustering processing of the data subset is construed as an optimization problem. In this way, the inner structural features can be depicted effectively, and the continuity between contiguous windows is preserved simultaneously. The solution of the optimization problem is used as the basis of concept drift detection;as a result, the validity of the detection result is guaranteed and the varying trends in cluster structure can be easily captured. In the simulation, artificial and real datasets are used to verify the performance of the proposed method, and existing evolving clustering algorithms are introduced for comparison with our algorithm for testing purposes. The simulation results demonstrate the validity of the developed algorithm. Under the same conditions,the new method is superior to other clustering algorithms with respect to the accuracy of clustering and concept drift detection; it also reduces computational load and memory usage effectively.
出处
《中国科学:信息科学》
CSCD
北大核心
2017年第11期1464-1482,共19页
Scientia Sinica(Informationis)
基金
国家自然科学基金重点项目(批准号:61432011
U1435212)
国家自然科学基金(批准号:61673249)
山西省青年科技研究基金(批准号:201701D221097)资助项目
关键词
数据流
演化聚类
优化模型
模糊隶属度
信息熵
data stream, evolving clustering, optimization model, fuzzy membership, information entropy