期刊文献+

基于Squeezer算法的文本数据流聚类 被引量:3

Text stream clustering based on Squeezer algorithm
原文传递
导出
摘要 为解决数据流聚类中的"链式数据"问题以及文本数据流存在的高维、稀疏、多主题问题,以Squeezer聚类算法为基础,重新定义了聚类过程中类的质心、半径和判别距离.提出了一种改进算法,通过加入数据预处理环节来提高聚类精度,通过投影聚类提高聚类效率并为簇赋予语义.最后通过在互联网新闻语料的聚类实验,表明了所提出的算法能够以较小的速度代价换来聚类效果的大幅提升,性能显著优于Squeezer算法. To solve the problems of "chain data" and "high-dimension, multi-topic, large-scale text stream" in data stream clustering, a modified Squeezer clustering algorithm is proposed, which combines the idea of projected clustering and redefines the class centroid, radius, and judging distance. The preprocessing stage and the projected clustering stage are introduced to improve the performance significantly and attach the semantic to the clusters for better understanding respectively. The experiment on the Internet corpus shows that the cluster result is significantly improved at a small cost of speed decrease and the performance of the proposed algorithm is better than that of Squeezer algorithm.
出处 《控制与决策》 EI CSCD 北大核心 2012年第4期542-546,共5页 Control and Decision
基金 国家自然科学基金项目(90924020) 教育部博士点基金项目(200800060005) 阿里巴巴青年学者支持计划项目(活水计划Ali-2010-B-6)
关键词 文本数据流 Squeezer算法 投影聚类 text stream Squeezer algorithm projected clustering
  • 相关文献

参考文献10

  • 1Aggarwal C C,Yu P S.On clustering massive text andcategorical data streams[J].Knowledge and InformationSystems,2010,24(2):171-196.
  • 2He Z,Xu X,Deng S.Squeezer:An efficient algorithm forclustering categorical data[J].J of Computer Science andTechnology,2002,17(5):611-624.
  • 3李岩,王惠文,叶明,刘丹.基于Squeezer算法的大规模矩阵聚类分析[J].北京航空航天大学学报,2009,35(12):1499-1502. 被引量:2
  • 4Aggarwal C C,Han J,Wang J,et al.A framework forprojected clustering of high dimensional data streams[C].Proc of VLDB.Toronto:Morgan Kaufmann,2004:852-863.
  • 5Aggarwal C C,Wolf J L,Yu P S,et al.Fast algorithms forprojected clustering[M].New York:ACM,1999:61-72.
  • 6刘丹.客户知识管理中的文本挖掘方法与技术研究[D].北京:北京航空航天大学经济管理学院,2010:63-64.
  • 7Singhal A.Modern information retrieval:A briefoverview[J].IEEE Data Engineering Bulletin,2001,24(4):35-43.
  • 8Internet Web News Corpus from Sogou.SogouCreduced[EB/OL].[2009-12-01].http://www.sogou.com/labs/resources.html.
  • 9Halkidi M,Batistakis Y,Vazirgiannis M.On clusteringvalidation techniques[J].J of Interlligent InformationSystems,2001,17(2/3):107-145.
  • 10Wu J,Xiong H,Chen J.Adapting the right measures for-means clustering[C].KDD’09 Proc of the 15th ACMSIGKDD Int Conf on Knowledge Discovery and DataMining.New York:ACM Press,2009:877-885.

二级参考文献6

  • 1胡庆林,叶念渝,朱明富.数据挖掘中聚类算法的综述[J].计算机与数字工程,2007,35(2):17-20. 被引量:36
  • 2Ye Ming, Wang Huiwen, Wang Lanhui. Application of improved hierarchical clustering method to classification of curves [ C]//The 9th International Conference on Industrial Management. Beijing:China Aviation Industry Press,2008:325 - 330.
  • 3Oyanagi S, Kubota K, Nakase A. Application of matrix clustering to web log analysis and access prediction[C]//Proceedings of the ACM Web KDD Workshop on Mining Log Data across all Customer Touch Points. Berlin:Springer-Verlag,2001.
  • 4陈祖民,周家胜.矩阵论引论[M].北京:北京航空航天大学出版社,1998:281-288.
  • 5Li Yan, Ye Ming, Wang Hulwen, et al, A data streams clustering algorithm based on interval data[ C ]//Beijing: The 38th International Conference on Computers and Industrial Engineering. Beijing: Publishing House of Electronics Industry, 2008: 2775 - 2778.
  • 6何增有,徐晓飞,邓胜春.Squeezer:An Efficient Algorithm for Clustering Categorical Data[J].Journal of Computer Science & Technology,2002,17(5):611-624. 被引量:32

共引文献1

同被引文献25

  • 1陈崚,邹凌君,屠莉.多数据流的实时聚类算法[J].计算机应用,2007,27(8):1976-1979. 被引量:2
  • 2范明,孟小峰.数据挖掘概念与技术[M].2版.北京:机械工业出版社,2007:195-196.
  • 3范明,盂小峰,数据挖掘概念与技术(第二版)[M].北京:机械工业出版社,2007:306-320.
  • 4He Z, Xu X, Deng S. Squeezer: An efficient algorithm forclustering categorical data[ J ]. J of Computer Science and Technology, 2002, 17(5): 611 -624.
  • 5郑广寰,林锦贤.数据流上基于K-median聚类的算法研究[J].2006年全国开放式分布与并行计算机学术会议论文集(三)[C].2006.
  • 6O'CALLAGHAN L, MISHRA N, MEYERSON A, et al. Streaming-data algorithms for high-quality clustering[C]//IEEE International Conference on Data Engineering. San Jose:IEEE Computer Society,2002:685-694.
  • 7AGGARWAL C C, HAN Jiawei, WANG Jianyong, et al. A framework for clustering evolving data streams[C]//29th International Conference on Very Large Data Bases. Berlin: Morgan Kaufmann Publishers,2003:81-92.
  • 8K1NNUNEN T, LI H. An overview of text-independent speaker recognition: from features to supervectors [ J ]. Speech communication, 2010,52 ( 1 ) : 12 - 40.
  • 9GILL M K, KAUR R, KAUR J. Vector quantization based speaker identification [ J]. International journal of domputer applications ,2010,4( 2 ) :975 - 8887.
  • 10SINGH S, RAJAN E G. Vector quantization approach for speaker recognition using MFCC and inverted MFCC [ J]. International journal of computer applications, 2011,17 ( 1 ) : 975 - 8887.

引证文献3

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部