期刊文献+

面向变异短文本的快速聚类算法 被引量:17

A Fast Clustering Algorithm for Abnormal and Short Texts
下载PDF
导出
摘要 本文主要针对近些年来大量出现在聊天语言中和手机短信中的短文本,提出了一种快速有效的聚类算法。这些短文本由于具有不规范性和大量相似性等特点,我们称其为变异短文本。本文在原有的网页去重算法的基础上,根据变异短文本的特点,采取了特定的特征串抽取方法,并融合了压缩编码的思想,从而加快了处理速度。实验表明,基于该算法的聚类系统对于大量的变异短文本处理速度可以达到每小时百万级以上,并且有比较高的准确率。 This paper discusses mainly about the short texts, which occurs on mobile short messages and chat rooms. Because of their irregular style and similarity, we call them abnormal texts. We propose an efficient clustering algorithm based on the duplication information deletion algorithm. It concerns about the features of the abnormal short texts and takes some special methods such as extracting feature code and compressing code to solve this problem. Experiments show that the clustering system based on this algorithm can depose millions of abnormal short texts per hour with high accuracy.
出处 《中文信息学报》 CSCD 北大核心 2007年第2期63-68,共6页 Journal of Chinese Information Processing
关键词 人工智能 模式识别 检索 特征串 聚类 artificial intelligence pattern recognition retrieve feature string clustering
  • 相关文献

参考文献11

  • 1吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量:41
  • 2张刚,刘挺,郑实福,车万祥,李生.大规模网页快速去重算法[A].中国中文信息学学会二十周年学术会论文集(续集)[C].2001.18—25.
  • 3J. W. Kirriemuir & P. Willett, Identification of duplicate and near-duplicate full-text records in database search outputs using hierarchic cluster analysis[J]. In:Program-automated library and information, ( 1995 ) 29(3):241-256.
  • 4孙学刚,陈群秀,马亮.基于主题的Web文档聚类研究[J].中文信息学报,2003,17(3):21-26. 被引量:31
  • 5G. Karypis, E.H. Han, and V. Kumar. Chameleon:A hierarchical clustering algorithm using dynamic modeling[J]. IEEE Computer, 1999,32(8) :68-75.
  • 6Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval[M]. Addison Wesley, 2004.
  • 7陈儒,张宇,刘挺.面向中文特定信息变异的过滤技术研究[J].高技术通讯,2005,15(9):7-12. 被引量:7
  • 8王滨华,石志刚.基于散列关键词的大规模网页去重算法[J].高性能计算技术,2004,0(5):35-38. 被引量:1
  • 9Thomas H. Cormen, Charles E. Leiserson. Introduction to Algorithms[M]. Second Edition. The MIT Press, 2002.
  • 10Larsen, Bjorner, Aone, Chinatsu.: Fast and Effective Text Mining Using Linear-time Document Clustering[J]. In: KDD'99, San Diego, California: 16-22.

二级参考文献18

  • 1[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 2[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 3[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 4[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 5[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.
  • 6M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD'96),1996.
  • 7M. Ankerst, M. Breunig, H. -P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of the Data(SIGMOD' 99),1999.
  • 8Yang, Y., Pedersen, J.O. A Comparative Study on Feature Selection in Text Categorization. Proc. of the 14th International Conference on Machine Learning ICML97.
  • 9Eui-Hong Han, George Karypis and Vipin Kumar. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Pacific-Asia Conference on Knowledge Diseovery and Data Minings, 2001.
  • 10Knuth D E, J. H. Morris Jr and V. R. Pratt. Fast Pattern Matching in Strings. SIAM J Comput, 1977, 6( 1 ) : 323.

共引文献75

同被引文献212

引证文献17

二级引证文献180

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部