

DKTC:A Method of Chinese Text Clustering
摘要 在对DBSCN与K-means两种经典聚类算法分析研究基础上,结合中文文本数据的特点,对这两种方法进行结合与改进,提出一种中文文本聚类方法:DKTC。该算法能自动产生簇的个数,且对"噪声"或异常数据不敏感,对数据的输入顺序不敏感,另外,与DBSCAN相比,该算法有更高的处理效率。实验表明,DKTC算法不仅能对中文文本进行聚类,且与传统DBSCN与K-means法相比,聚类效果都有一定程度的改善。 Based on the analysis of two classic clustering algorithm: DBSCN and K-means, combining with the characteristics of Chinese text data, this paper puts forward a Chinese text clustering algorithm by improving those 2 ways above: DKTC. It can automatically generate the number of clusters, and doesn't have close relation with ‘information noise' or abnormal data and the order of the input data. In addition, compared with DBSCAN, DKTC has a higher efficiency. Experiments show that DKTC is able to cluster Chinese text, and improves the traditional DBSCN and K-means algorithm to some degree.
出处 《图书情报工作》 CSSCI 北大核心 2009年第1期109-112,33,共5页 Library and Information Service
关键词 文本聚类 聚类算法 中文信息处理 text clustering clustering algorithm Chinese information processing
  • 相关文献


  • 1El-Hamdouchi A, Willet P. Comparison of hierarchic agglomerative clustering methods for document Retrieval. The Computer Journal,1989, 32(3) :220 -227.
  • 2李雪蕾,张冬茉.一种基于向量空间模型的文本分类方法[J].计算机工程,2003,29(17):90-92. 被引量:31
  • 3庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究,2001,18(9):23-26. 被引量:293
  • 4蓝海洋,周杰韩,张和明.文本索引词项相对权重计算方法与应用[J].计算机工程与应用,2003,39(15):68-70. 被引量:9
  • 5Han Jiawei Kamber M.数据挖掘概念与技术[M].北京:机械工业出版社,2001..
  • 6Saracoglu R, Tutuncu K, Allahverdi N. A fuzzy clustering approach for finding similar documents using a novel similarity measure, Information Processing and Management, 2006, 6(2) :600 -605.
  • 7Zhao Y, Karypis G. Criterion functions for document clustering experiments and analysis. Technical Report #01 -40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001.
  • 8Mark S, Kalervo J. SIGIR'2004. New York: CM Press, 2004.


  • 1黄萱青 吴立德.独立于语种的文本分类方法[M].,2000.37-43.
  • 2鲁松 白硕 等.文本中词语权重计算方法的改进[M].,2000.31-36.
  • 3卜东波.聚类/分类理论研究及其在大模型文本挖掘的应用:博士论文[M].,2000..
  • 4Sahon,Gerard.Introduction to modem information retrieval[M].Auckland: McGraw-Hill, 1983.
  • 5Koller D. Hierarchically Classifying Documents Using Very Few Words. Proceedings of tile Fourteenth International Conference on Machine Learning (ICML-97), 1997.
  • 6Zhang Li, Li Xing. Net-compass, A Search Engine for Chinese Web Pages[A]. The First AEARU Workshop on Web Technology[C] ,Kyoto, Japan, 1998: 1 0-15.
  • 7黄萱菁,2000 International Conference on Multilingual Information Processing,2000年,37页
  • 8鲁松,2000 International Conference on Multilingual Information Processing,2000年,31页
  • 9卜东波,博士学位论文,2000年
  • 10Yang Yiming,Proceedings of ACMSIGIR Conference on Research and Development in Information Retrieval(SIGIR),1999年,42页









使用帮助 返回顶部