期刊文献+

基于聚类特性的大规模文本聚类算法研究 被引量:5

The Research on a Large-Scale Text Clustering Algorithm based on Clustering Features
下载PDF
导出
摘要 一、引言 随着Internet的飞速发展,人们能从网上得到更多的信息,但过多的信息常常会导致信息迷失.将信息进行分类是帮助信息利用的有效方法,聚类则是文本类别划分时常用的技术,其特点是不需训练集即可从给定的文本集合中找到聚类划分[1~5]. Large-scale text processing becomes a great challenge as the fast growing of Internet and information explosion. Clustering is an effective method to solve this problem. An incremental algorithm called Mulit-Level CFK-means methods for large-scale text clustering is presented in this paper. More cluster information can be reserved and utilized by using the clustering features (CF) structure in this algorithm. Clustering results can be achieved very fast in one scan of the data. The computing and file exchange time of the algorithm is several times less than k-means algorithm and the accuracy of the results is almost equal to k-means algorithm. The effectiveness of the algorithm is proved by the contrastive experiment on Reuters text sets.
出处 《计算机科学》 CSCD 北大核心 2002年第9期13-15,共3页 Computer Science
关键词 信息处理 聚类特性 大规模文本聚类算法 计算机 Clustering features(CF),Multi-level CFK-means algorithm ,Text clustering
  • 相关文献

参考文献9

  • 1Yang Y. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999, 1(1/2): 67~88
  • 2Jain A K,Farrokhnia F. Unsupervised texture segmentation using Gabor filters. Pattern Recognition, 1991,24 (13): 1167~1186
  • 3Anderberg M R. Cluster analysis for applications. New York,NY: Academic Press, Inc. , 1973
  • 4Bjorner L,Chinatsu A. Fast and effective text mining using linear-time document clustering. In: KDD-99, San Diego, California, 1999
  • 5Salton G. Developments in automatic text retrieval. Science,1991, 253:974~980
  • 6Jain A K, Murty M N, Flynn P J. Data clustering: A review.ACM Computing Surveys, 1999, 31(3): 264-323
  • 7Salton G, et al. A vector space model for automatic indexing.Communications of the ACM, 1975, 18:613~620
  • 8Zhang T,Rughu R,Miron L. BIRCH: an efficient data clustering method for very large databases. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, ACM, 1996. 103~114
  • 9http://www.research.att.com/~lewis/reuters21578.html

同被引文献26

  • 1Jain A K, Farrokhnia F. Unsupervised texture segmentation using Gabor filters [J ]. Pattern Recognition, 1991,24 ( 13 ) : 1167 - 1186.
  • 2Han Jiawei, Kamber M. Data Mining Concepts and Techniques[M].范明,孟小峰,等译.北京:机械工业出版社,2006.
  • 3Jain A K, Murty M N, Flynn P J. Data Clustering: A Review [ J ]. ACM Computing Surveys, 1999: 31 (3) : 264 - 323.
  • 4中国互联网络信息中心(CNNIC).中国互联网络发展状况统计报告(2008.1)[EB].http:∥www.ennic.coin.cn,2008-03-02.
  • 5G. Salton, A. Wong, C. S. Yang. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, (18): 613-620.
  • 6Jiawei Han, Micheline Kamber. Data Mining Concepts and Techniques[M]. San Francisco: Morgan Kaufrnann Publishers, 2000.
  • 7Jain A K,Murty M N,Flyun P J.Data Clustering:A Review[J].ACM Computer Surveys,31(3):264-323.
  • 8Cutting D R,Karger D R,Pedersen J O,et al.Scatter/Gather:A Cluster-based Approach to Browsing Large Document Collections[J].Proc.SIGIP,1992:318-329.
  • 9Lin S H,Chen M C.ACIRD:Intelligent Internet Document Organization and Retrieval[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(3):599-614.
  • 10Yunjae Jung.Design and Evaluation of Clustering Criterion for Optimal Hierarchical Agglomerative Clustering[D].Phd.Thesis.University of Minnesota,2001.

引证文献5

二级引证文献20

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部