摘要
分析了聚类数目的确定对大样本数据聚类效果的影响,对目前聚类质量衡量指标的几个主要流行观点进行了剖析。利用文本相似度的概念对文本语义最佳聚类数问题进行了研究,提出了一种基于聚类过程的文本最佳聚类数算法CTBP,其主要思想是在文本向量集的每个文本向量中抽取出一个词汇,按相似度有序排列,用增量逐层划分以得到最优划分所对应的簇类数。这样通过扫描一遍数据就可以获得多个统计信息,最后求出最优解。实验结果表明了该算法的高质量和高效率。
The effect of the cluster numbers on the large sample data cluster is analyzed, and some prevailing ideas of measurement index for the clustering quality are expounded. The optimal class number of text semantic are studied by the concept of text similarity, and an optimal number of clusters algorithm CTBP in clustering process is presented, and the main idea is to extract a word in each text vector and came into being ordered to array with text similarity, and the class number in optimal dividing has been used to get from the increment which is divided layer by layer. Statistical information can get from using scanning the data a time, and finally obtained the optimal solution. The experimental result shows that our method is helpful to develop speed and quality.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第9期2034-2036,2100,共4页
Computer Engineering and Design
关键词
文本聚类
聚类簇数
增量
划分
CTBP
text clustering
cluster class number
increment
division
CTBP