一种基于空间映射及尺度变换的聚类框架被引量：2

A Mapping and Rescaling Framework for Document Clustering

下载PDF

导出

摘要传统聚类算法通常建立在显式的模型之上,很少考虑泛化模型以适应不同的数据,由此导致了模型不匹配问题。针对此问题,该文提出了一种基于空间映射(Mapping)及尺度变换(Rescaling)的聚类框架(简称M-R框架)。具体而言,M-R框架首先将语料映射到一组具有良好区分度的方向所构建的坐标系中,以统计各个簇的分布特性,然后根据这些分布特性对各个坐标轴进行尺度变换,以归一化语料中各个类簇的分布。如上两步操作伴随算法迭代执行,直至算法收敛。该文将M-R框架应用到K-means算法及谱聚类算法上以验证其性能,在国际标准评测语料上的实验表明,应用了M-R框架的K-means及谱聚类在所有语料集上获得了全面的性能提升。 Traditional clustering algorithms suffer from model mismatch problem when the distribution of real data does not fit the model assumptions. To address this problem, a mapping and rescaling framework （referred as M-R framework） is proposed for document clustering. Specifically, documents are first mapped into a discriminative co- ordinate so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained, a rescaling operation is then applied to normalize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to improve the clustering performance. In the experiment, the M-R framework is applied on traditional k-means and the state-of-art spectral clustering algorithm Ncut. Resultss on well known datasets show that M-R framework brings performance improvements in all datasets.

作者曾依灵许洪波吴高巍程学旗白硕

机构地区中国科学院计算技术研究所上海证券交易所

出处《中文信息学报》 CSCD 北大核心 2010年第3期81-88,共8页 Journal of Chinese Information Processing

基金国家973基础研究计划项目资助(2007CB311100) 国家自然科学基金重点项目资助(60933005)

关键词计算机应用中文信息处理文本聚类空间映射尺度变换模型不匹配 computer application Chinese information processing document clustering space mapping rescaling model misfit

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1Dumais S.T.LSI Meets TREC:A Status Report[C]// D.Harman (Ed.) Prof,of The First Text RE-trieval Conference (TREC1),National Institute of Standards and Technology Special Publication 500-207,1993:137-152.
2Liu X.,Croft W.R Cluster-Based Retrieval Using Language Models[C]// Proc.of SIGIR,2004:186-193.
3Zamir O.,Etzioni O.,Madani O.,et al.Fast and Intuitive Clustering of Web Documents[C]// Proc.of KDD,1997:287-290.
4Han J.and Kamber M.Data Mining:Concepts and Techniques,Second Edition[M].Morgan Kaufmann Publishes,2006.
5Wu H.,Phang T.H.,Liu B.,et al.A Refinement Approach to Handling Model Misfit in Text Categorization[C]// SIGKDD,2002:207-216.
6Tan S.,Cheng X.,Ghanem MM,et al.A Novel Refinement Approach for Text Categorization[C]//Proc.of the 14th ACM CIKM,2005:469-476.
7Shawe-Taylor J.,Cristianini N.Kernel Methods for Pattern Analysis[M].Cambridge University Press,2004.
8Ng A.,Jordan M.,Weiss Y.On Spectral Clustering:Analysis and an Algorithm[J].T.Dietterich,S.Becker,and Ghahramani Z.(Eds.),Advances in Neural Information Processing Systems 14,MIT Press,2002.
9Shi,J.and Malik,J.Normalized Cuts and Image Segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
10Chan P.K.,Schlag D.F.,Zien J.Y.Spectral K-way Ratio-Cut Partitioning and Clustering[J].IEEE Trans.Computer-Aided Design,1994,13:1088-1096.

同被引文献11

1MShahriar Hossain, Praveen Kumar Reddy Ojili, Cin- dy Grimm, et al. Scatter/Gather Clustering~ Flexibly Incorporating User Feedback to Steer Clustering Re- suits~J]. IEEE TRANSACTIONS ON VISUALIZA- TION AND COMPUTER GRAPHICS, 2012, 18 (12) : 2829-2838.
2Jinjiang Li, Hui Fan, Da Yuan, et al. Kernel Function Clustering Based on Ant Colony Algorithm[C]//Guo Maozu. ICNC 2008. Jinan, China. 2008: 645-649.
3Nisha M N, Mohanavalli S, Swathika R. Improving the quality of Clustering using Cluster Ensembles [-C~//Proceedings of 2013 IEEE Conference on Infor- mation and Communication Technologies. 2013: 88- 92.
4Amineh Amini, Teh Ying Wah, Mahmoud Reza Say- bani, et al. A Study of Density-Grid based Clustering Algorithms on Data StreamsEC] //Ding Yongsheng. FSKD 2011. Shanghai, China. 2011: 1652-1656.
5季铎,王智超,蔡东风,张桂平.基于高斯分布的簇间距离计算方法[J].中文信息学报,2008,22(3):50-55. 被引量：10
6曾依灵,许洪波,吴高巍,白硕.一种基于语料特性的聚类算法[J].软件学报,2010,21(11):2802-2813. 被引量：8
7陈建超,胡桂武,杨志华,严桂夺.基于全局性确定聚类中心的文本聚类[J].计算机工程与应用,2011,47(10):147-150. 被引量：5
8刘金岭,冯万利,张亚红.初始化簇类中心和重构标度函数的文本聚类[J].计算机应用研究,2011,28(11):4115-4117. 被引量：2
9王骏,王士同,邓赵红.特征加权距离与软子空间学习相结合的文本聚类新方法[J].计算机学报,2012,35(8):1655-1665. 被引量：22
10Hyeong-Il Kim,Jae-Woo Chang.k-Nearest Neighbor Query Processing Algorithms for a Query Region in Road Networks[J].Journal of Computer Science & Technology,2013,28(4):585-596. 被引量：7

引证文献2

1刘作国,陈笑蓉.高斯加权的重构性K-NN算法研究[J].中文信息学报,2015,29(5):112-116. 被引量：1
2陈笑蓉,刘作国.文本聚类的重构策略研究[J].中文信息学报,2016,30(2):189-195. 被引量：5

二级引证文献6

1曾珍珍.一种计算机视觉算法的图像处理技术[J].信息技术,2018,42(4):74-78. 被引量：8
2杨姣,高仲合,王来花,韦锦涛.数据流聚类挖掘算法优化研究[J].曲阜师范大学学报（自然科学版）,2018,44(3):38-40. 被引量：1
3刘作国,陈笑蓉.面向文本聚类的实体—动作关联模型研究[J].中文信息学报,2018,32(5):22-30. 被引量：3
4李欣,李旸,王素格.面向情感聚类的文本相似度计算方法研究[J].中文信息学报,2018,32(5):97-104. 被引量：8
5徐菲菲,陈赛红.中文文本主题聚类算法研究综述[J].上海电力大学学报,2021,37(6):613-619. 被引量：4
6王贤明,潘佳玲,胡智文.AR-Grams:一种应用于网络舆情热点发现的文本聚类方法[J].中国传媒大学学报（自然科学版）,2021,28(5):59-65. 被引量：2

1林加强.分析统一建模语言在面向对象分析与设计中的应用[J].信息系统工程,2016,29(7):45-45. 被引量：2
2尚继良,王晓燕,蒋金星.内模控制滤波器时间常数模糊自整定及其应用[J].化工自动化及仪表,2008,35(2):7-9. 被引量：5
3曾依灵,许洪波,吴高巍,白硕.一种基于语料特性的聚类算法[J].软件学报,2010,21(11):2802-2813. 被引量：8
4鞠恒荣,杨习贝,戚湧,杨静宇.量化粗糙集的单调性属性约简方法[J].计算机科学,2015,42(8):36-39. 被引量：4
5张志秀.鲁棒稳定的改进型Smith预估补偿器[J].山东轻工业学院学报（自然科学版）,2003,17(1):62-65. 被引量：2
6刘哲,徐涛.基于正交多项式密度函数的CT图像分割方法[J].吉林大学学报（理学版）,2014,52(2):295-302. 被引量：1
7王金萍,赵忠盖,刘飞.一种融合无时滞测量值和含时滞测量值的状态估计方法[J].化工学报,2016,67(3):940-946. 被引量：1
8张志秀,刘星萍,张新荣.Smith预估补偿器的非对称非线性校正[J].沈阳化工学院学报,2003,17(3):233-235.
9刘仁轩.基于.NET平台的C/S泛化模型[J].电脑编程技巧与维护,2010(15):35-38.
10武鹏程,袁兆山.混合关联规则及其挖掘算法[J].小型微型计算机系统,2003,24(5):895-898. 被引量：3

中文信息学报

2010年第3期

浏览历史

内容加载中请稍等...

一种基于空间映射及尺度变换的聚类框架被引量：2

参考文献14

同被引文献11

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

一种基于空间映射及尺度变换的聚类框架 被引量：2

参考文献14

同被引文献11

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

一种基于空间映射及尺度变换的聚类框架被引量：2