摘要
由于短文本具有词频单一、结构简单等特点,基于传统特征选取方法的文本去重算法不适合短文本。为此,提出一种适合短文本特点的去重算法,利用SimHash算法产生短文本的指纹,使用共享最近邻算法对指纹进行聚类,根据聚类结果增删初始特征,迭代直至收敛,从而实现短文本的去重检测。在真实数据集上的实验结果表明,与现有的文本去重算法相比,该算法对于短文本具有更好的去重效果。
Because of the single word frequency and the simple structure of short text,algorithms based on normal feature selection methods do not fit to short text.This paper proposes an iteration method of weighting features for short text.It produces the fingerprints of short text using SimHash,and clusters these fingerprints with Shared Nearest Neighbor(SNN).Initial features are added or deleted according to the clusters.This process is circulatory so as to realize the duplicate removal of short text.Experimental results based on two real datasets show that this method fits short text well and has better duplicate removal effects than existing methods.
出处
《计算机工程》
CAS
CSCD
北大核心
2015年第12期54-57,63,共5页
Computer Engineering
基金
国家科技支撑计划基金资助项目(2012BAH13F02)
上海市科委基金资助项目(12511502403
12511509602)
关键词
SimHash算法
共享最近邻
迭代
特征选择
短文本
去重
SimHash algorithm
Shared Nearest Neighbor(SNN)
iteration
feature selection
short text
duplicate removal