期刊文献+

基于样本加权的文本聚类算法研究 被引量:10

Document Clustering Algorithm Based on Sample Weighting
下载PDF
导出
摘要 样本加权聚类算法是一种最近才引起人们注意的算法,还存在一些需要解决的问题,例如,聚类对象之间的结构信息对样本加权聚类是否有帮助,如何将结构信息自动转换为样本或对象的权重?针对该问题,本文以学术论文为聚类对象,以K-Means算法为聚类算法基础,利用论文之间的引用关系计算每篇论文的PageRank值,并将其作为权重,提出一种基于样本加权的新的文本聚类算法。实验结果表明,基于论文PageRank值加权的聚类算法能改善文本聚类效果。该算法可推广到网页的聚类中,利用网页的PageRank进行加权聚类,来改善网页的聚类效果。 Sample weighting clustering algorithm has been noticed only recently. There are some unsolved problems, for example, whether the structure information among the clustering objects is helpful to sample weighting clustering? How to transform structure information into the weight of samples or not? To solve these problems, a novel sample weighting clustering algorithm is presented based on K-Means algorithm. The algorithm uses academic documents as the clustering objects. The PageRank value of each document is calculated according to the cited relationship among them, and it is used as the weight in the algorithm. Experiments show that the proposed algorithm is an effective solution to improve the performance of document clustering, and it can be extended to Web pages clustering based on PageRank value of each Web page.
出处 《情报学报》 CSSCI 北大核心 2008年第1期42-48,共7页 Journal of the China Society for Scientific and Technical Information
基金 本研究受“十一五”国家科技支撑计划重点项目(2006BAH03804)子课题“科技热点动态监测技术研究与应用”、2006年江苏省研究生培养创新工程项目资助.
关键词 文本聚类 样本加权聚类 PAGERANK 被引频次 document clustering, sample weighted clustering, PageRank, citied frequency
  • 相关文献

参考文献16

  • 1Hatzivassiloglou V,Klavans J L,Holcombe M L,et al.Simfinder:A flexible clustering tool for summarization[C]∥Proceedings of the NAACL 2001 Workshop on Automatic Summarization,2001:41-49.
  • 2Cutting D R,Karger D R,Pedersen J O,Tukey J W.Scatter/Gather:A cluster-based approach to browsing large document collections[C]∥Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92),1992:318-329.
  • 3Hearst M,Pedersen P.Reexamining the cluster hypothesis:Scatter/gather on retrieval results[C]∥Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96),1996:76-84.
  • 4Han J,Kamber M.Data Mining:Concepts and Techniques[M].Morgan Kaufmann,2000.
  • 5MacQueen J.Some methods for classification and analysis of multivariate observations[C]∥Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability,Berkeley,USA,1967:281-297.
  • 6Bezdek J C.Pattern Recognition with Fuzzy Objective Function Algorithms[M].New York:Plenum Press,1981.
  • 7Dempster A P,Laird N M,Rubin D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of Royal Statistical Society:Series B,1977,39:1-38.
  • 8Pedrycz W.Conditional fuzzy c-means[J].Pattern Recognition Letters,1996,17:625-632.
  • 9Rose K.Deterministic annealing for clustering,compression,classification,regression,and related optimization problems[C]∥Proceedings of the IEEE,1998,86(11):2210-2239.
  • 10Jian Yu.Sample weighting clustering.Technical Report of Institute of Computer Science (TR 002),Beijing Jiaotong University,2006.

同被引文献150

引证文献10

二级引证文献54

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部