摘要
为解决数据流聚类中的"链式数据"问题以及文本数据流存在的高维、稀疏、多主题问题,以Squeezer聚类算法为基础,重新定义了聚类过程中类的质心、半径和判别距离.提出了一种改进算法,通过加入数据预处理环节来提高聚类精度,通过投影聚类提高聚类效率并为簇赋予语义.最后通过在互联网新闻语料的聚类实验,表明了所提出的算法能够以较小的速度代价换来聚类效果的大幅提升,性能显著优于Squeezer算法.
To solve the problems of "chain data" and "high-dimension, multi-topic, large-scale text stream" in data stream clustering, a modified Squeezer clustering algorithm is proposed, which combines the idea of projected clustering and redefines the class centroid, radius, and judging distance. The preprocessing stage and the projected clustering stage are introduced to improve the performance significantly and attach the semantic to the clusters for better understanding respectively. The experiment on the Internet corpus shows that the cluster result is significantly improved at a small cost of speed decrease and the performance of the proposed algorithm is better than that of Squeezer algorithm.
出处
《控制与决策》
EI
CSCD
北大核心
2012年第4期542-546,共5页
Control and Decision
基金
国家自然科学基金项目(90924020)
教育部博士点基金项目(200800060005)
阿里巴巴青年学者支持计划项目(活水计划Ali-2010-B-6)