摘要
传统的文本消重技术局限于消除字面完全相同或者高度相似的信息,无法满足特定领域,比如新闻消重的要求。为了去除"话题重复"的新闻报导,提出一种两层短文本消重技术,在字面消重层的基础上,添加语义消重层,结合多个词向量空间模型捕获文本语义特征,试图检测属于同一话题的重复新闻。实验表明,本算法较单纯的基于字面的文本消重算法,在保证不降低检测准确率的条件下,能较大提高检测召回率。其应用于"科技视界"新闻服务系统中,取得良好效果。
The traditional text deduplication limits to the literally identical or similar deduplication,which is not sufficient to some situa- tions such as news deduplication. To remove the "topic -duplicate" news reports, a two -layer short text deduplication algorithm is proposed, which adds a semantics - duplicate layer based on the syntax - duplicate layer, combines with multiple word vector models to capture text semantic features, attempts to detect duplicate news of the same topic. Experiments shows that our algorithm improves much in the detection recall compared to traditional algorithm under the condition of keeping the high precise. Our algorithm is applied to the "View of Technology" news system and works well.
作者
蒋旦
张翔
JIANG Dan ZHANG Xiang(School of Information Science and Technology, University of Science and Technology of China, Hefei, 230027 ,Chin)
出处
《网络新媒体技术》
2017年第1期45-51,共7页
Network New Media Technology
基金
中国科学院先导课题"海量网络数据流海云协同实时处理系统"(编号:XDA060112030)
关键词
文本消重
倒排索引
语义相似度
词向量
text deduplication, inverted index, semantic similarity, word vector