期刊文献+

基于Kolmogorov复杂性的文本聚类算法改进 被引量:4

Improved Text Clustering Algorithm Based on Kolmogorov Complexity
下载PDF
导出
摘要 基于Kolmogorov复杂性的聚类算法虽然具有普适性、参数无关性的优点,但是应用到文本内容语义信息聚类时往往准确率较低。针对这一问题,提出了一种基于特征扩展的文本聚类改进算法——DEF-KC算法。该算法通过引用百度百科中特定词条的信息,对预处理过的文本中的关键词进行特征扩展,从而提高特征词的主题贡献度,增强文本的结构辨识度,并通过选取特定压缩算法近似计算Kolmogorov复杂性得到文本相似度,最后使用谱聚类算法进行聚类。实验结果表明,与传统的基于Kolmogorov复杂性的文本聚类算法相比,使用该算法时聚类准确率和召回率均得到了较大提升。 Clustering algorithm based on Kolmogorov complexity has the advantages of generality,parameter independence,but always shows low accuracy when applied to the text semantic information clustering.In order to solve this problem,this paper proposed a text clustering algorithm based on feature extension-DEF-KC.For improving keyword's theme contribution,DEF-KC applies feature extension to the keyword in the pretreated text by referencing information of specific entry in a baidu encyclopedia,and calculates the text similarity by approximate Kolmogorov complexity of the text.Finally it clusters text using spectral clustering algorithm.The experimental results show that the proposed algorithm has much better accuracy and recall rate compared to the traditional text clustering algorithm based on Kolmogorov complexity.
出处 《计算机科学》 CSCD 北大核心 2016年第5期243-246,共4页 Computer Science
基金 国家自然科学基金(61363028)资助
关键词 Kolmogorov复杂性 文本聚类 特征扩展 谱聚类 Kolmogorov complexity Text clustering Feature extension Spectral clustering
  • 相关文献

参考文献11

  • 1明均仁.基于本体图的文本聚类模型研究[J].情报科学,2013,31(2):29-33. 被引量:6
  • 2Nikvand N,Wang Z.Generic image similarity based on Kolmo-gorov complexity[C]∥2010 17th IEEE International Conference on Image Processing(ICIP).IEEE,2010:309-312.
  • 3Zhang L,Zhuang Y,Yuan Z.A program plagiarism detection mo-del based on information distance and clustering[C]∥The 2007 International Conference on Intelligent Pervasive Computing,2007(IPC).IEEE,2007:431-436.
  • 4Ukil A.Application of Kolmogorov complexity in anomaly de-tection[C]∥2010 16th Asia-Pacific Conference on Communications(APCC).IEEE,2010:141-146.
  • 5Belabbes S,Richard G.On Using SVM and Kolmogorov Complexity for Spam Filtering[C]∥FLAIRS Conference.2008:130-135.
  • 6Geweniger T,Schleif F M,Hasenfuss A,et al.Comparison of cluster algorithms for the analysis of text data using kolmogorov complexity[M]∥Advances in Neuro-Information Processing.Springer Berlin Heidelberg,2009:61-69.
  • 7Vita,nyi,Paul M B.Information Distance in Multiples[J].IEEE Transactions on Information Theory ,2011,57(4):2451-2456.
  • 8Vitanyi P M B,Balbach F J,Cilibrasi R L,et al.Normalized information distance[M]∥Information Theory and Statistical Learning.Springer US,2009:45-82.
  • 9Tao Xiao-lei.Study of Kolmogorov Complexity Based Clustering Algorithms[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2013(in Chinese).
  • 10王会青,陈俊杰.基于图划分的谱聚类方法的研究[J].计算机工程与设计,2011,32(1):289-292. 被引量:16

二级参考文献55

  • 1樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131. 被引量:70
  • 2张华平.计算所汉语词法分析系统ICTCLAS[EB/OL].[2002-08-16].http://www.nip.org.cn/project/project.php?pwj_id=6.
  • 3Meila M,Shi J.A random walks view of spectral segmentation [C]. 8th International Workshop on Artificial Intelligence and Statistics,2001.
  • 4Brand M,Kun H A.Unifying theorem for spectral embedding and clustering[C].Key West,Florida: Proceeding of the 9th International Conference on Artificial Intelligence and Statistics,2003.
  • 5Dhillon I, Guan Y, Kulis B. A unified view of kernel k-means, spectral clustering and graph cuts[R]. Technical report, UTCS, 2004.
  • 6Xing E P, Jordan M I.On semidefine relaxation for normalized k- cut and connections to spectral clustering[R].Technical Report, UCB/CSD-03-1265, EECS Department, University of California, Berkeley, 2003.
  • 7Meila M, Xu L. Multiway cuts and spectral clustering [R]. U. Washington Tech Report.2003.
  • 8NG A Y, JORDAN M I,WEISS Y.On spectral clustering: Analysis and an algorithm [C]. Proceedings of the 14th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press,2002:849-856.
  • 9HAN Jiawei,KAMBER M.Data mining: concept and techniques [M].America:Morgan Kaufmann Publishers,2001:223-260.
  • 10Zhang B,Hsu M,Dayal U.K-harmonic means - a data clustering algorithm[R].HP Technical Report, 1999.

共引文献69

同被引文献34

引证文献4

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部