基于Low-IDF-SIG的句子重复检测

Sentence Near-Duplicate Detection Based on Low-IDF-SIG

下载PDF

导出

摘要随着互联网上数据的爆炸式增长,互联网上产生了大量的重复数据。这些重复数据给搜索引擎、观点挖掘等许多Web应用带来了严峻的问题。目前绝大部分的重复检测的算法均着重考虑文档级别,不能有效地检测出两个文档中只有一部分互为拷贝的情况。而句子级别的重复检测正是解决这类问题的一个必要步骤。该文提出了一种快速有效的句子级别的特征抽取方法——Low-IDF-Sig算法,算法依据选定的先行词从句子中抽取出改进的Shingle特征以表示句子内容。真实语料库上的实验结果证明该文提出的算法能有效地提高句子级别重复检测任务的效率和精度。 Because of the explosion of the Internet,enormous duplicated data cause serious problem for search engine,opinion mining and many other Web applications.Most existing near-duplicate detection approaches focus on the document level,incpapble of finding out the duplicated part that is just a small piece of both documents.Near-duplicate detection on sentence level is a key solution to such problem.An effective and efficient feature extraction algorithm namedLow-IDF-Sig is proposed in this paper.In order to express a specified sentence,our algorithm extracts the improved Shingle feature according to selected antecedents.Experimental results based on a real corpus show that our proposed method can improve both precision and efficiency of near-duplicate detection in sentence level.

作者俞昊旻张玥张奇黄萱菁

机构地区复旦大学计算机科学与技术学院

出处《中文信息学报》 CSCD 北大核心 2011年第1期123-128,共6页 Journal of Chinese Information Processing

基金国家自然科学基金资助项目(61073069 61003092) 国家高技术研究发展计划(863计划)资助项目(2009AA01A346)

关键词近似重复检测特征抽取 Low-IDF-SIG Near-Duplicate detection feature extraction Low-IDF-SIG

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献12

1D. Fetterly, M. Manasse, and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages [C]//1st I.atin American Web Congress, 2003: 37-37.
2A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web[J]. Computer Networks, 1997, 29(8-13): 1157-1166.
3A. Z. Broder. Identifying and filtering near-duplicate documents [C]//Proceedings of COM2000, London, UK, 2000: 1-10.
4A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection[J]. ACM Trans. Inf. Syst., 2002. 20 (2):171-191.
5A. Kolcz, A. Chowdhury, I.exicon randomization for near-duplicate detection with I-Match[J]. The Journal of Supercomputing, 2008, 45(3), 255-276.
6P. Indyk and R. Motwani. Approximate nearest neigh-bors: towards removing the curse of dimensionality [C]//STOC' 98, New York, NY, USA, ACM. 1998: 604-613.
7M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicated erection in large weh collections[C]//SIGIR' 08, New York, NY, USA, ACM. 2008; 563-570.
8N. Shivakumar and H. Garcia Molina. Building a sealable and accurate copy detection mecbanism[C]// ACM New York, NY, USA, 19.96:160-168.
9N. Shivakumar and H. Garcia Molina. Finding nearreplicas of documents and servers on the web [C]// Proceedings of WebDB 1998, London, UK, Springer- Verlag. 1999:204-212.
10A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing[C]//VLDB'99, pages 518-529, San Francisco, CA, USA, 1999: 204-212.

1魏诗云,杨家骏.网页近似重复检测算法研究[J].计算机光盘软件与应用,2012,15(8):135-136.
2白鸽,左万利,赵乾坤,曲仁镜.使用机器学习对汉语评论进行情感分类[J].吉林大学学报（理学版）,2009,47(6):1260-1263. 被引量：4
3左家莉,王明文,吴水秀,万剑怡.结合句子级别检索的信息检索模型[J].中文信息学报,2016,30(2):107-112. 被引量：6
4张梅山,车万翔,刘挺.使用过训练提升词性标注依存句法联合模型的速度[J].智能计算机与应用,2014,4(4):21-24.
5张文艳,李存华,仲兆满,王艺,李莉.结合规则与语义的中文人称代词指代消解[J].数据采集与处理,2017,32(1):149-156. 被引量：3
6张红斌,姬东鸿,尹兰,任亚峰,牛正雨.基于关键词精化和句法树的商品图像句子标注[J].计算机研究与发展,2016,53(11):2542-2555. 被引量：5
7卢小康,王小华,王荣波.一种句子级别的中文文本复制检测方法[J].杭州电子科技大学学报（自然科学版）,2009,29(6):45-48. 被引量：1
8王仲标.小议定语从句中“介词+关系代词”的用法[J].中学生英语（中旬刊）,2014(5):121-121.
9赵悦宏.八年级(下)Unit 10重难点句子破译[J].中学英语之友（新教材初二版）,2011(6):14-15.
10黄颖,李伟.竞争者网站的挖掘[J].电子工程师,2007,33(4):62-66.

中文信息学报

2011年第1期

浏览历史

内容加载中请稍等...

基于Low-IDF-SIG的句子重复检测

参考文献12

相关作者

相关机构

相关主题

浏览历史