期刊文献+

自然语言文本复制检测算法

Natural Language Text Copy Detection Algorithm
下载PDF
导出
摘要 复制检测就是检测文档之间是否存在雷同现象,并将检测结果报告给用户。文章算法将复制检测技术指纹比对法和词频统计法结合起来,首先对文本进行预处理如滤除介词、冠词等,采用指纹比对法判断自然段落之间的相似性;然后将一个自然段视为一个小的整体来构成整个文档,采用基于词频的加权统计法判断全文的相似性。 Copy detection actually detects illegal copies and reports results to users. The new approach presents a comparison based on fingerprint and a statistic based on the word occurrence frequencies. Firstly, we need to do some preprocessing such as throwing off preposition, article and so on. Between the detection of paragraphs, we employ the comparison based on fingerprint, then we view each paragraph as a small unit and each paragraph is given a weight value to adopt the way of statistic based on the word occurrence frequencies to detect the whole document's similarity.
作者 杨达
出处 《电脑与信息技术》 2014年第4期11-14,共4页 Computer and Information Technology
关键词 复制检测 文本指纹 词频 copy detection text fingerprint word frequency
  • 相关文献

参考文献5

  • 1鲍军鹏,沈钧毅,刘晓东,宋擒豹.自然语言文档复制检测研究综述[J].软件学报,2003,14(10):1753-1760. 被引量:69
  • 2Brin S,Davis J,Garcia-Molina H.Copy detection Mechanisms for digital documents[C].In:Proceedings of the ACM SIGMOD Annual Conference.1995.
  • 3Shivakumar N,Garcia-Molina H.SCAM:A copy detection mechanisms for digital documents[C].In:Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries (DL' 95).1995.
  • 4ShivakumarN,Garcia-Molina H.Building a scalable and accurate copy detection mechanism[C].In:Proceedings of the 1st ACM Conference on Digital Libraries(DL' 96).1996.
  • 5鲍军鹏,沈钧毅,刘晓东.一个基于网格的文本复制检测系统[J].微电子学与计算机,2004,21(9):7-10. 被引量:7

二级参考文献16

  • 1U Manber. Finding Similar Files in a Large File System.In: Proc. of Winter USENIX Conference, 1994: 1~10.
  • 2S Brin, J Davis, H Garcia-Molina. Copy Detection Mechanisms for Digital documents. In: Proc. of the ACM SIGMOD Annual Conference, 1995.
  • 3N Shivakumar, H Garcia-Molina. SCAM: A Copy Detection Mechanism for Digital Documents. In: Proc. of 2nd International Conference in Theory and Practice of Digital Libraries, 1995.
  • 4H Garcia-Molina,L Gravano,N Shivakumar. dSCAM:Finding Document Copies Across Multiple Databases. In: Proc.of 4th International Conference on Parallel and Distributed Systems (PDIS'96), 1996.
  • 5N Shivakumar, H Garcia-Molina. Finding Near-replicas of Documents on the Web. In: Proc. of Workshop on Web Data-bases (WebDB98) held in conjunction with EDBT'98,1998.
  • 6Heintze N. Scalable Document Fingerprinting. In: Proc. of the 2nd USENIX Workshop on Electronic Commerce,1996.
  • 7Broder A Z, Glassman S C, Manasse M S. Syntactic Clustering of the Web. In: Proc. of Sixth International Web Conference, 1997.
  • 8Si A, Leong H V, Lau R W H. CHECK: A Document Plagiarism Detection System. In: Proc. of ACM Symposium for Applied Computing, 1997: 70~77.
  • 9K Monostori, A Zaslavsky, H Schmidt. MatchDetectReveal:Finding Overlapping and Similar Digital Documents. In:Proc. of Information Resources Management Association International Conference (IRMA2000), 2000.
  • 10Bao Jun-Peng, Shen Jun-Yi, Liu Xiao-Dong, Liu HaiYan, Zhang Xiao-Di. Document Copy Detection Based on Kernel Method. In: Proc. of 2003 IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE'03), Beijing, Oct. 2003: 250~256.

共引文献70

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部