期刊文献+

一种基于空间映射及尺度变换的聚类框架 被引量:2

A Mapping and Rescaling Framework for Document Clustering
下载PDF
导出
摘要 传统聚类算法通常建立在显式的模型之上,很少考虑泛化模型以适应不同的数据,由此导致了模型不匹配问题。针对此问题,该文提出了一种基于空间映射(Mapping)及尺度变换(Rescaling)的聚类框架(简称M-R框架)。具体而言,M-R框架首先将语料映射到一组具有良好区分度的方向所构建的坐标系中,以统计各个簇的分布特性,然后根据这些分布特性对各个坐标轴进行尺度变换,以归一化语料中各个类簇的分布。如上两步操作伴随算法迭代执行,直至算法收敛。该文将M-R框架应用到K-means算法及谱聚类算法上以验证其性能,在国际标准评测语料上的实验表明,应用了M-R框架的K-means及谱聚类在所有语料集上获得了全面的性能提升。 Traditional clustering algorithms suffer from model mismatch problem when the distribution of real data does not fit the model assumptions. To address this problem, a mapping and rescaling framework (referred as M-R framework) is proposed for document clustering. Specifically, documents are first mapped into a discriminative co- ordinate so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained, a rescaling operation is then applied to normalize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to improve the clustering performance. In the experiment, the M-R framework is applied on traditional k-means and the state-of-art spectral clustering algorithm Ncut. Resultss on well known datasets show that M-R framework brings performance improvements in all datasets.
出处 《中文信息学报》 CSCD 北大核心 2010年第3期81-88,共8页 Journal of Chinese Information Processing
基金 国家973基础研究计划项目资助(2007CB311100) 国家自然科学基金重点项目资助(60933005)
关键词 计算机应用 中文信息处理 文本聚类 空间映射 尺度变换 模型不匹配 computer application Chinese information processing document clustering space mapping rescaling model misfit
  • 相关文献

参考文献14

  • 1Dumais S.T.LSI Meets TREC:A Status Report[C]// D.Harman (Ed.) Prof,of The First Text RE-trieval Conference (TREC1),National Institute of Standards and Technology Special Publication 500-207,1993:137-152.
  • 2Liu X.,Croft W.R Cluster-Based Retrieval Using Language Models[C]// Proc.of SIGIR,2004:186-193.
  • 3Zamir O.,Etzioni O.,Madani O.,et al.Fast and Intuitive Clustering of Web Documents[C]// Proc.of KDD,1997:287-290.
  • 4Han J.and Kamber M.Data Mining:Concepts and Techniques,Second Edition[M].Morgan Kaufmann Publishes,2006.
  • 5Wu H.,Phang T.H.,Liu B.,et al.A Refinement Approach to Handling Model Misfit in Text Categorization[C]// SIGKDD,2002:207-216.
  • 6Tan S.,Cheng X.,Ghanem MM,et al.A Novel Refinement Approach for Text Categorization[C]//Proc.of the 14th ACM CIKM,2005:469-476.
  • 7Shawe-Taylor J.,Cristianini N.Kernel Methods for Pattern Analysis[M].Cambridge University Press,2004.
  • 8Ng A.,Jordan M.,Weiss Y.On Spectral Clustering:Analysis and an Algorithm[J].T.Dietterich,S.Becker,and Ghahramani Z.(Eds.),Advances in Neural Information Processing Systems 14,MIT Press,2002.
  • 9Shi,J.and Malik,J.Normalized Cuts and Image Segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
  • 10Chan P.K.,Schlag D.F.,Zien J.Y.Spectral K-way Ratio-Cut Partitioning and Clustering[J].IEEE Trans.Computer-Aided Design,1994,13:1088-1096.

同被引文献11

引证文献2

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部