期刊文献+

基于Low-IDF-SIG的句子重复检测

Sentence Near-Duplicate Detection Based on Low-IDF-SIG
下载PDF
导出
摘要 随着互联网上数据的爆炸式增长,互联网上产生了大量的重复数据。这些重复数据给搜索引擎、观点挖掘等许多Web应用带来了严峻的问题。目前绝大部分的重复检测的算法均着重考虑文档级别,不能有效地检测出两个文档中只有一部分互为拷贝的情况。而句子级别的重复检测正是解决这类问题的一个必要步骤。该文提出了一种快速有效的句子级别的特征抽取方法——Low-IDF-Sig算法,算法依据选定的先行词从句子中抽取出改进的Shingle特征以表示句子内容。真实语料库上的实验结果证明该文提出的算法能有效地提高句子级别重复检测任务的效率和精度。 Because of the explosion of the Internet,enormous duplicated data cause serious problem for search engine,opinion mining and many other Web applications.Most existing near-duplicate detection approaches focus on the document level,incpapble of finding out the duplicated part that is just a small piece of both documents.Near-duplicate detection on sentence level is a key solution to such problem.An effective and efficient feature extraction algorithm namedLow-IDF-Sig is proposed in this paper.In order to express a specified sentence,our algorithm extracts the improved Shingle feature according to selected antecedents.Experimental results based on a real corpus show that our proposed method can improve both precision and efficiency of near-duplicate detection in sentence level.
出处 《中文信息学报》 CSCD 北大核心 2011年第1期123-128,共6页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(61073069 61003092) 国家高技术研究发展计划(863计划)资助项目(2009AA01A346)
关键词 近似重复检测 特征抽取 Low-IDF-SIG Near-Duplicate detection feature extraction Low-IDF-SIG
  • 相关文献

参考文献12

  • 1D. Fetterly, M. Manasse, and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages [C]//1st I.atin American Web Congress, 2003: 37-37.
  • 2A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web[J]. Computer Networks, 1997, 29(8-13): 1157-1166.
  • 3A. Z. Broder. Identifying and filtering near-duplicate documents [C]//Proceedings of COM2000, London, UK, 2000: 1-10.
  • 4A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection[J]. ACM Trans. Inf. Syst., 2002. 20 (2):171-191.
  • 5A. Kolcz, A. Chowdhury, I.exicon randomization for near-duplicate detection with I-Match[J]. The Journal of Supercomputing, 2008, 45(3), 255-276.
  • 6P. Indyk and R. Motwani. Approximate nearest neigh-bors: towards removing the curse of dimensionality [C]//STOC' 98, New York, NY, USA, ACM. 1998: 604-613.
  • 7M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicated erection in large weh collections[C]//SIGIR' 08, New York, NY, USA, ACM. 2008; 563-570.
  • 8N. Shivakumar and H. Garcia Molina. Building a sealable and accurate copy detection mecbanism[C]// ACM New York, NY, USA, 19.96:160-168.
  • 9N. Shivakumar and H. Garcia Molina. Finding nearreplicas of documents and servers on the web [C]// Proceedings of WebDB 1998, London, UK, Springer- Verlag. 1999:204-212.
  • 10A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing[C]//VLDB'99, pages 518-529, San Francisco, CA, USA, 1999: 204-212.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部