基于聚类特性的大规模文本聚类算法研究被引量：5

The Research on a Large-Scale Text Clustering Algorithm based on Clustering Features

下载PDF

导出

摘要一、引言随着Internet的飞速发展,人们能从网上得到更多的信息,但过多的信息常常会导致信息迷失.将信息进行分类是帮助信息利用的有效方法,聚类则是文本类别划分时常用的技术,其特点是不需训练集即可从给定的文本集合中找到聚类划分[1～5]. Large-scale text processing becomes a great challenge as the fast growing of Internet and information explosion. Clustering is an effective method to solve this problem. An incremental algorithm called Mulit-Level CFK-means methods for large-scale text clustering is presented in this paper. More cluster information can be reserved and utilized by using the clustering features (CF) structure in this algorithm. Clustering results can be achieved very fast in one scan of the data. The computing and file exchange time of the algorithm is several times less than k-means algorithm and the accuracy of the results is almost equal to k-means algorithm. The effectiveness of the algorithm is proved by the contrastive experiment on Reuters text sets.

作者唐春生金以慧

机构地区清华大学自动化系

出处《计算机科学》 CSCD 北大核心 2002年第9期13-15,共3页 Computer Science

关键词信息处理聚类特性大规模文本聚类算法计算机 Clustering features(CF),Multi-level CFK-means algorithm ,Text clustering

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献9

1Yang Y. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999, 1(1/2): 67～88
2Jain A K,Farrokhnia F. Unsupervised texture segmentation using Gabor filters. Pattern Recognition, 1991,24 (13): 1167～1186
3Anderberg M R. Cluster analysis for applications. New York,NY: Academic Press, Inc. , 1973
4Bjorner L,Chinatsu A. Fast and effective text mining using linear-time document clustering. In: KDD-99, San Diego, California, 1999
5Salton G. Developments in automatic text retrieval. Science,1991, 253:974～980
6Jain A K, Murty M N, Flynn P J. Data clustering: A review.ACM Computing Surveys, 1999, 31(3): 264-323
7Salton G, et al. A vector space model for automatic indexing.Communications of the ACM, 1975, 18:613～620
8Zhang T,Rughu R,Miron L. BIRCH: an efficient data clustering method for very large databases. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, ACM, 1996. 103～114
9http://www.research.att.com/～lewis/reuters21578.html

同被引文献26

1Jain A K, Farrokhnia F. Unsupervised texture segmentation using Gabor filters [J ]. Pattern Recognition, 1991,24 ( 13 ) : 1167 - 1186.
2Han Jiawei, Kamber M. Data Mining Concepts and Techniques[M].范明,孟小峰,等译.北京:机械工业出版社,2006.
3Jain A K, Murty M N, Flynn P J. Data Clustering: A Review [ J ]. ACM Computing Surveys, 1999: 31 (3) : 264 - 323.
4中国互联网络信息中心(CNNIC).中国互联网络发展状况统计报告(2008.1)[EB].http:∥www.ennic.coin.cn,2008-03-02.
5G. Salton, A. Wong, C. S. Yang. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, (18): 613-620.
6Jiawei Han, Micheline Kamber. Data Mining Concepts and Techniques[M]. San Francisco: Morgan Kaufrnann Publishers, 2000.
7Jain A K,Murty M N,Flyun P J.Data Clustering:A Review[J].ACM Computer Surveys,31(3):264-323.
8Cutting D R,Karger D R,Pedersen J O,et al.Scatter/Gather:A Cluster-based Approach to Browsing Large Document Collections[J].Proc.SIGIP,1992:318-329.
9Lin S H,Chen M C.ACIRD:Intelligent Internet Document Organization and Retrieval[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(3):599-614.
10Yunjae Jung.Design and Evaluation of Clustering Criterion for Optimal Hierarchical Agglomerative Clustering[D].Phd.Thesis.University of Minnesota,2001.

引证文献5

1杨彩莲,谢福鼎.基于主题概念聚类的中文文本聚类[J].现代电子技术,2007,30(22):161-163. 被引量：2
2黄宇栋,李翔,林祥.互联网媒体信息热点主动发现技术研究与应用[J].计算机技术与发展,2009,19(5):1-4. 被引量：5
3黄继征.基于Multi-Agent的Web个性化信息推送系统[J].现代情报,2009,29(8):117-121. 被引量：3
4钱政.Android平台下基于改进的K-means酒店信息聚类算法[J].淮海工学院学报（自然科学版）,2014,23(4):22-25. 被引量：2
5何飞,蒋冬初.基于向量空间模型的文档聚类算法研究[J].湖南城市学院学报,2003,24(3):114-116. 被引量：8

二级引证文献20

1原福永,杨治秋,王海霞.一种基于向量空间模型的文档聚类算法研究[J].信号处理,2005,21(z1):606-608.
2包金龙.基于向量空间模型的信息检索系统的设计[J].情报杂志,2005,24(7):44-45. 被引量：16
3魏建香,苏新宁.基于关键词和摘要相关度的文献聚类研究[J].情报学报,2009,28(2):220-224. 被引量：4
4麻雪云,肖诗斌,王弘蔚,施水才.基于关键名词短语聚类的中文搜索结果聚类[J].计算机工程与应用,2009,45(31):118-121. 被引量：1
5马晓佳.基于潜在语义标引的文本聚类研究[J].情报探索,2010(7):3-5. 被引量：3
6张文明,吴江,袁小蛟.基于密度和最近邻的K-means文本聚类算法[J].计算机应用,2010,30(7):1933-1935. 被引量：29
7高天惠.网络环境下高校图书馆信息推送服务模式研究[J].产业与科技论坛,2011(6):185-186. 被引量：7
8陆伟,刘屹,孟睿,陈英杰.基于域加权聚类算法的网络舆情热点话题探测[J].数字图书馆论坛,2011(8):50-56. 被引量：2
9郭金龙,许鑫,陆宇杰.人文社会科学研究中文本挖掘技术应用进展[J].图书情报工作,2012,56(8):10-17. 被引量：23
10刘敏娜.基于向量空间模型的信息检索技术研究[J].现代电子技术,2012,35(11):186-187.

1谭琼,李晓黎,史忠植.一种实现搜索引擎个性化服务的方法[J].计算机科学,2002,29(1):23-25. 被引量：33
2程淑玉.基于聚类协同过滤的个性化推荐系统[J].宜宾学院学报,2013,13(6):82-85. 被引量：3
3郭运宏.数据挖掘、Web挖掘与Web日志挖掘之研究[J].郑州铁路职业技术学院学报,2006,18(2):40-42. 被引量：3
4陈晨.基于主题爬虫的个性化搜索引擎技术研究[J].黑龙江科技信息,2010(31):87-87. 被引量：2
5张立燕.一种基于用户事务模式的推荐算法[J].福建电脑,2009,25(3):79-79.
6成思聪.基于语义理解的中文问答系统的设计与实现[J].中国电子商情（通信市场）,2013(5):76-83. 被引量：1

计算机科学

2002年第9期

浏览历史

内容加载中请稍等...

基于聚类特性的大规模文本聚类算法研究被引量：5

参考文献9

同被引文献26

引证文献5

二级引证文献20

相关作者

相关机构

相关主题

浏览历史

基于聚类特性的大规模文本聚类算法研究 被引量：5

参考文献9

同被引文献26

引证文献5

二级引证文献20

相关作者

相关机构

相关主题

浏览历史

基于聚类特性的大规模文本聚类算法研究被引量：5