摘要
本文主要针对近些年来大量出现在聊天语言中和手机短信中的短文本,提出了一种快速有效的聚类算法。这些短文本由于具有不规范性和大量相似性等特点,我们称其为变异短文本。本文在原有的网页去重算法的基础上,根据变异短文本的特点,采取了特定的特征串抽取方法,并融合了压缩编码的思想,从而加快了处理速度。实验表明,基于该算法的聚类系统对于大量的变异短文本处理速度可以达到每小时百万级以上,并且有比较高的准确率。
This paper discusses mainly about the short texts, which occurs on mobile short messages and chat rooms. Because of their irregular style and similarity, we call them abnormal texts. We propose an efficient clustering algorithm based on the duplication information deletion algorithm. It concerns about the features of the abnormal short texts and takes some special methods such as extracting feature code and compressing code to solve this problem. Experiments show that the clustering system based on this algorithm can depose millions of abnormal short texts per hour with high accuracy.
出处
《中文信息学报》
CSCD
北大核心
2007年第2期63-68,共6页
Journal of Chinese Information Processing
关键词
人工智能
模式识别
检索
特征串
聚类
artificial intelligence
pattern recognition
retrieve
feature string
clustering