摘要
在对DBSCN与K-means两种经典聚类算法分析研究基础上,结合中文文本数据的特点,对这两种方法进行结合与改进,提出一种中文文本聚类方法:DKTC。该算法能自动产生簇的个数,且对"噪声"或异常数据不敏感,对数据的输入顺序不敏感,另外,与DBSCAN相比,该算法有更高的处理效率。实验表明,DKTC算法不仅能对中文文本进行聚类,且与传统DBSCN与K-means法相比,聚类效果都有一定程度的改善。
Based on the analysis of two classic clustering algorithm: DBSCN and K-means, combining with the characteristics of Chinese text data, this paper puts forward a Chinese text clustering algorithm by improving those 2 ways above: DKTC. It can automatically generate the number of clusters, and doesn't have close relation with ‘information noise' or abnormal data and the order of the input data. In addition, compared with DBSCAN, DKTC has a higher efficiency. Experiments show that DKTC is able to cluster Chinese text, and improves the traditional DBSCN and K-means algorithm to some degree.
出处
《图书情报工作》
CSSCI
北大核心
2009年第1期109-112,33,共5页
Library and Information Service
关键词
文本聚类
聚类算法
中文信息处理
text clustering clustering algorithm Chinese information processing