摘要
样本加权聚类算法是一种最近才引起人们注意的算法,还存在一些需要解决的问题,例如,聚类对象之间的结构信息对样本加权聚类是否有帮助,如何将结构信息自动转换为样本或对象的权重?针对该问题,本文以学术论文为聚类对象,以K-Means算法为聚类算法基础,利用论文之间的引用关系计算每篇论文的PageRank值,并将其作为权重,提出一种基于样本加权的新的文本聚类算法。实验结果表明,基于论文PageRank值加权的聚类算法能改善文本聚类效果。该算法可推广到网页的聚类中,利用网页的PageRank进行加权聚类,来改善网页的聚类效果。
Sample weighting clustering algorithm has been noticed only recently. There are some unsolved problems, for example, whether the structure information among the clustering objects is helpful to sample weighting clustering? How to transform structure information into the weight of samples or not? To solve these problems, a novel sample weighting clustering algorithm is presented based on K-Means algorithm. The algorithm uses academic documents as the clustering objects. The PageRank value of each document is calculated according to the cited relationship among them, and it is used as the weight in the algorithm. Experiments show that the proposed algorithm is an effective solution to improve the performance of document clustering, and it can be extended to Web pages clustering based on PageRank value of each Web page.
出处
《情报学报》
CSSCI
北大核心
2008年第1期42-48,共7页
Journal of the China Society for Scientific and Technical Information
基金
本研究受“十一五”国家科技支撑计划重点项目(2006BAH03804)子课题“科技热点动态监测技术研究与应用”、2006年江苏省研究生培养创新工程项目资助.