期刊文献+

基于语义的短文本消重算法研究 被引量:2

Research on Short-Text Duplication Detection Method Based on Semantics
下载PDF
导出
摘要 传统的文本消重技术局限于消除字面完全相同或者高度相似的信息,无法满足特定领域,比如新闻消重的要求。为了去除"话题重复"的新闻报导,提出一种两层短文本消重技术,在字面消重层的基础上,添加语义消重层,结合多个词向量空间模型捕获文本语义特征,试图检测属于同一话题的重复新闻。实验表明,本算法较单纯的基于字面的文本消重算法,在保证不降低检测准确率的条件下,能较大提高检测召回率。其应用于"科技视界"新闻服务系统中,取得良好效果。 The traditional text deduplication limits to the literally identical or similar deduplication,which is not sufficient to some situa- tions such as news deduplication. To remove the "topic -duplicate" news reports, a two -layer short text deduplication algorithm is proposed, which adds a semantics - duplicate layer based on the syntax - duplicate layer, combines with multiple word vector models to capture text semantic features, attempts to detect duplicate news of the same topic. Experiments shows that our algorithm improves much in the detection recall compared to traditional algorithm under the condition of keeping the high precise. Our algorithm is applied to the "View of Technology" news system and works well.
作者 蒋旦 张翔 JIANG Dan ZHANG Xiang(School of Information Science and Technology, University of Science and Technology of China, Hefei, 230027 ,Chin)
出处 《网络新媒体技术》 2017年第1期45-51,共7页 Network New Media Technology
基金 中国科学院先导课题"海量网络数据流海云协同实时处理系统"(编号:XDA060112030)
关键词 文本消重 倒排索引 语义相似度 词向量 text deduplication, inverted index, semantic similarity, word vector
  • 相关文献

参考文献1

二级参考文献50

  • 1荀恩东,颜伟.基于语义网计算英语词语相似度[J].情报学报,2006,25(1):43-48. 被引量:41
  • 2秦春秀,赵捧未,刘怀亮.词语相似度计算研究[J].情报理论与实践,2007,30(1):105-108. 被引量:30
  • 3董振东 董强.知网[EB/OL].http://www.keenage.com,2002.
  • 4刘群 李素建.基于《知网》的词汇语义相似度计算[C]..第三界汉语词汇语义研讨会[C].台北,2002..
  • 5Levenshetin V I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals [ J ]. Soviet Physics Doklady, 1966, 10 (8) :707 -710.
  • 6Wagner R A, Fischer M J. The String - to - String Correction Prob- lem[J]. Journal of the ACM(JACM), 1974, 21 ( 1 ) :168 - 173.
  • 7Cilibrasi R L, Vit6nyi P M B. Clustering by Compression [ J ]. IEEE Transaction on Information Theory, 2005, 51 (4) :1523 - 1545.
  • 8Cilibrasi R L, Vitdnyi P M B. The Google Similarity Distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19 (3) :370 -383.
  • 9Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity Between Words Using Web Search Engines [ C ]. In : Proceedings of the 16th International Conference on World Wide Web( WWW' 07 ). New York : ACM, 2007:757 - 766.
  • 10Sahami M, Heilman T. A Web - based Kernel Function for Matc- hing Short Text Snippets [ C ]. In : Proceedings of the 15th Interna- tional Conference on World Wide Web ( WWW' 06 ), Edinburgh. 20O6.

共引文献15

同被引文献12

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部