基于Squeezer算法的文本数据流聚类被引量：3

Text stream clustering based on Squeezer algorithm

导出

摘要为解决数据流聚类中的"链式数据"问题以及文本数据流存在的高维、稀疏、多主题问题,以Squeezer聚类算法为基础,重新定义了聚类过程中类的质心、半径和判别距离.提出了一种改进算法,通过加入数据预处理环节来提高聚类精度,通过投影聚类提高聚类效率并为簇赋予语义.最后通过在互联网新闻语料的聚类实验,表明了所提出的算法能够以较小的速度代价换来聚类效果的大幅提升,性能显著优于Squeezer算法. To solve the problems of ＂chain data＂ and ＂high-dimension, multi-topic, large-scale text stream＂ in data stream clustering, a modified Squeezer clustering algorithm is proposed, which combines the idea of projected clustering and redefines the class centroid, radius, and judging distance. The preprocessing stage and the projected clustering stage are introduced to improve the performance significantly and attach the semantic to the clusters for better understanding respectively. The experiment on the Internet corpus shows that the cluster result is significantly improved at a small cost of speed decrease and the performance of the proposed algorithm is better than that of Squeezer algorithm.

作者尤薇佳刘鲁刘丹李明

机构地区北京航空航天大学经济管理学院中国石油大学工商管理学院

出处《控制与决策》 EI CSCD 北大核心 2012年第4期542-546,共5页 Control and Decision

基金国家自然科学基金项目(90924020) 教育部博士点基金项目(200800060005) 阿里巴巴青年学者支持计划项目(活水计划Ali-2010-B-6)

关键词文本数据流 Squeezer算法投影聚类 text stream Squeezer algorithm projected clustering

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献10

1Aggarwal C C,Yu P S.On clustering massive text andcategorical data streams[J].Knowledge and InformationSystems,2010,24(2):171-196.
2He Z,Xu X,Deng S.Squeezer:An efficient algorithm forclustering categorical data[J].J of Computer Science andTechnology,2002,17(5):611-624.
3李岩,王惠文,叶明,刘丹.基于Squeezer算法的大规模矩阵聚类分析[J].北京航空航天大学学报,2009,35(12):1499-1502. 被引量：2
4Aggarwal C C,Han J,Wang J,et al.A framework forprojected clustering of high dimensional data streams[C].Proc of VLDB.Toronto:Morgan Kaufmann,2004:852-863.
5Aggarwal C C,Wolf J L,Yu P S,et al.Fast algorithms forprojected clustering[M].New York:ACM,1999:61-72.
6刘丹.客户知识管理中的文本挖掘方法与技术研究[D].北京:北京航空航天大学经济管理学院,2010:63-64.
7Singhal A.Modern information retrieval:A briefoverview[J].IEEE Data Engineering Bulletin,2001,24(4):35-43.
8Internet Web News Corpus from Sogou.SogouCreduced[EB/OL].[2009-12-01].http://www.sogou.com/labs/resources.html.
9Halkidi M,Batistakis Y,Vazirgiannis M.On clusteringvalidation techniques[J].J of Interlligent InformationSystems,2001,17(2/3):107-145.
10Wu J,Xiong H,Chen J.Adapting the right measures for-means clustering[C].KDD’09 Proc of the 15th ACMSIGKDD Int Conf on Knowledge Discovery and DataMining.New York:ACM Press,2009:877-885.

二级参考文献6

1胡庆林,叶念渝,朱明富.数据挖掘中聚类算法的综述[J].计算机与数字工程,2007,35(2):17-20. 被引量：36
2Ye Ming, Wang Huiwen, Wang Lanhui. Application of improved hierarchical clustering method to classification of curves [ C]//The 9th International Conference on Industrial Management. Beijing:China Aviation Industry Press,2008:325 - 330.
3Oyanagi S, Kubota K, Nakase A. Application of matrix clustering to web log analysis and access prediction[C]//Proceedings of the ACM Web KDD Workshop on Mining Log Data across all Customer Touch Points. Berlin:Springer-Verlag,2001.
4陈祖民,周家胜.矩阵论引论[M].北京:北京航空航天大学出版社,1998:281-288.
5Li Yan, Ye Ming, Wang Hulwen, et al, A data streams clustering algorithm based on interval data[ C ]//Beijing: The 38th International Conference on Computers and Industrial Engineering. Beijing: Publishing House of Electronics Industry, 2008: 2775 - 2778.
6何增有,徐晓飞,邓胜春.Squeezer：An Efficient Algorithm for Clustering Categorical Data[J].Journal of Computer Science & Technology,2002,17(5):611-624. 被引量：32

共引文献1

1陶佰睿,李青龙,苗凤娟,郭琴,邵慧.码本聚类矢量量化算法在说话人识别中的应用[J].河南科技大学学报（自然科学版）,2016,37(1):35-39. 被引量：4

同被引文献25

1陈崚,邹凌君,屠莉.多数据流的实时聚类算法[J].计算机应用,2007,27(8):1976-1979. 被引量：2
2范明,孟小峰.数据挖掘概念与技术[M].2版.北京:机械工业出版社,2007:195-196.
3范明,盂小峰,数据挖掘概念与技术(第二版)[M].北京:机械工业出版社,2007:306-320.
4He Z, Xu X, Deng S. Squeezer: An efficient algorithm forclustering categorical data[ J ]. J of Computer Science and Technology, 2002, 17(5): 611 -624.
5郑广寰,林锦贤.数据流上基于K-median聚类的算法研究[J].2006年全国开放式分布与并行计算机学术会议论文集(三)[C].2006.
6O'CALLAGHAN L, MISHRA N, MEYERSON A, et al. Streaming-data algorithms for high-quality clustering[C]//IEEE International Conference on Data Engineering. San Jose:IEEE Computer Society,2002:685-694.
7AGGARWAL C C, HAN Jiawei, WANG Jianyong, et al. A framework for clustering evolving data streams[C]//29th International Conference on Very Large Data Bases. Berlin: Morgan Kaufmann Publishers,2003:81-92.
8K1NNUNEN T, LI H. An overview of text-independent speaker recognition: from features to supervectors [ J ]. Speech communication, 2010,52 ( 1 ) : 12 - 40.
9GILL M K, KAUR R, KAUR J. Vector quantization based speaker identification [ J]. International journal of domputer applications ,2010,4( 2 ) :975 - 8887.
10SINGH S, RAJAN E G. Vector quantization approach for speaker recognition using MFCC and inverted MFCC [ J]. International journal of computer applications, 2011,17 ( 1 ) : 975 - 8887.

引证文献3

1程军锋,王治和,刘佳,潘丽娜.一种基于滑动窗口的一趟数据流聚类算法[J].首都师范大学学报（自然科学版）,2014,35(4):38-40. 被引量：1
2程军锋.数据流挖掘中的聚类技术[J].衡水学院学报,2015,17(1):16-18.
3陶佰睿,李青龙,苗凤娟,郭琴,邵慧.码本聚类矢量量化算法在说话人识别中的应用[J].河南科技大学学报（自然科学版）,2016,37(1):35-39. 被引量：4

二级引证文献5

1何明亮,陈泽茂,左进.基于多窗口机制的聚类异常检测算法[J].信息网络安全,2016(11):33-39. 被引量：6
2邱保志,贺艳芳.多视角核K-means聚类算法的收敛性证明[J].郑州大学学报（理学版）,2017,49(3):32-38. 被引量：4
3潘刚,伍世云,孙林平,徐宝磊,周思吉.基于语音识别技术的智能小车控制系统研究[J].电子设计工程,2019,27(7):118-123. 被引量：11
4何赞园,王凯,吉立新.基于矢量量化的说话人识别系统硬件实现[J].现代电子技术,2022,45(1):171-175.
5李雨潇,吴传生,刘文,李欢欢.仿射传播和谱聚类的船舶轨迹聚类[J].河南科技大学学报（自然科学版）,2018,39(1):35-40. 被引量：4

1张新猛,蒋盛益.一种基于相似度概率的不确定分类数据聚类算法[J].山东大学学报（工学版）,2011,41(3):12-16. 被引量：5
2王超,倪志伟,朱小虎.基于Squeezer算法的数据流离群数据挖掘算法[J].计算机技术与发展,2008,18(1):87-89. 被引量：1
3文本数据流分类的新方法[J].科技导报,2008,26(5):15-15.
4李岩,王惠文,叶明,刘丹.基于Squeezer算法的大规模矩阵聚类分析[J].北京航空航天大学学报,2009,35(12):1499-1502. 被引量：2
5赵宇海,武晓新,刘志勇,印莹.一种有效的基因投影聚类算法[J].广西师范大学学报（自然科学版）,2009,27(1):105-108. 被引量：1
6曹建平,王晖,夏友清,乔凤才,张鑫.基于LDA的双通道在线主题演化模型[J].自动化学报,2014,40(12):2877-2886. 被引量：15
7黄李国,陈伟琪,王士同.基于Parzen窗的投影聚类方法[J].广西师范大学学报（自然科学版）,2006,24(4):70-73. 被引量：2
8杨丰祥,彭凯巍,唐瑞,马健.一种新的聚类方法应用于中文碎纸片拼接问题[J].电子技术与软件工程,2015(23):95-95.
9黄李国,王士同.基于Mean-Shift的投影聚类算法PCMF[J].计算机工程,2007,33(18):233-235. 被引量：1
10张玉红,陈伟,胡学钢.一种面向不完全标记的文本数据流自适应分类方法[J].计算机科学,2016,43(12):179-182.

控制与决策

2012年第4期

浏览历史

内容加载中请稍等...

基于Squeezer算法的文本数据流聚类被引量：3

参考文献10

二级参考文献6

共引文献1

同被引文献25

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于Squeezer算法的文本数据流聚类 被引量：3

参考文献10

二级参考文献6

共引文献1

同被引文献25

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于Squeezer算法的文本数据流聚类被引量：3