摘要
文本聚类是自然语言处理中的一项重要研究课题,主要应用于信息检索和Web挖掘等领域。其中的关键是文本的表示和聚类算法。在层次聚类的基础上,提出了一种新的基于边界距离的层次聚类算法,该方法通过选择两个类间边缘样本点的距离作为类间距离,有效地利用类的边界信息,提高类间距离计算的准确性。综合考虑不同词性特征对文本的贡献,采用多向量模型对文本进行表示。不同文本集上的实验表明,基于边界距离的多向量文本聚类算法取得了较好的性能。
Document clustering is an important task of natural language processing and is widely applicable in areas such as information retrieval and web mining.The representation of document and the clustering algorithm are the key issues of document clustering.In order to improve the precision of distance calculation,this paper put forward a novel border distance based document clustering approach,which chooses the average of distances between documents at the border of different clusters as the similarity between this pairwise of clusters and takes advantage of the border information of the clusters.Considering the contribution of different kinds of terms,documents are represented by multi-vector.Experimental results of different corpus have shown that the proposed approach outperforms other widely used hierarchical clustering methods.
出处
《计算机工程与应用》
CSCD
北大核心
2008年第3期198-201,共4页
Computer Engineering and Applications
基金
国家高技术研究发展计划(863)(the National High- Tech Research and Development Plan of China under Grant No.2006AA01Z148)
教育部科学技术研究重点项目(the Scientific Key Project of Ministry of Education of China under Grant No.207148)
关键词
距离计算
文本表示
多向量
文本聚类
distance computation
document representation
multi-vector
document clustering