
基于句子相似度的文档复制检测算法研究 被引量:9

Research on the Copy Detection Based on the Similarity of Sentences
摘要 提出一种基于句子相似度的文档复制检测技术,在抓住文档的全局特征的同时又兼顾文档的结构信息,克服以往检测算法两者不可兼顾的缺陷,提高检测精度。最后,给出该算法与其他算法检测结果的比较情况。实验证明,该算法是可行的。 In the paper, a new document copy detection algorithm based on the similarity of the sentences is proposed. In order to improve the detection accuracy, the authors not only emphasize on the whole document, but also on the structure of the document. In the end, experiments and comparison are taken between the new algorithm and other typical algorithms, the result shows that it is feasible.
作者 秦新国
出处 《现代图书情报技术》 CSSCI 北大核心 2007年第11期63-66,共4页 New Technology of Library and Information Service
关键词 文档复制检测 句子相似度 指纹 Document copy detection Sentence similarity Fingerprints
  • 相关文献


  • 1史彦军,滕弘飞,金博.抄袭论文识别研究与进展[J].大连理工大学学报,2005,45(1):50-57. 被引量:36
  • 2鲍军鹏,沈钧毅,刘晓东,宋擒豹.自然语言文档复制检测研究综述[J].软件学报,2003,14(10):1753-1760. 被引量:69
  • 3NamOh Kang, Alexander Gelbukh, et al. PPCheck : Plagiarism Pattern Checker in Document Copy Detection [ EB/OL] . http:// www. gelbukh.com/CV/Publications/2006/TSD - 2006 - Plagiarism. pdf.
  • 4何明,胡彩霞.一种文本相似性的度量方法和计算方法[J].黄山学院学报,2005,7(6):71-72. 被引量:3
  • 5宋擒豹,杨向荣,沈钧毅,齐勇.数字商品非法复制的检测算法[J].计算机学报,2002,25(11):1206-1211. 被引量:16
  • 6Andrei Z B. On the Resemblance and Containment of Documents [ C ]. Compression and Complexity of SEQUENCES. 1997, Salerno, Italy, 1997:21 - 29.
  • 7Shivakumar N,Molina H G. SCAM:A Copy Detection Mechanism for Digital Documents [ C ]. The 2nd International Conference in Theory and Practice of Digital Libraries, Austin, Texas, USA, 1995:9 - 17.
  • 8Manber U. Finding Similar Files in a Large File System[ C]. USENIX Conference, SanFrancisco, CA, 1994 : 1 - 10.


  • 1董振东,董强.面向信息处理的词汇语义研究中的若干问题[J].语言文字应用,2001(3):27-32. 被引量:36
  • 2宋擒豹.电子商务环境下的数据挖掘研究:博士学位论文[M].西安:西安交通大学,2001..
  • 3[1]DONALD L M C. CAI research [EB/OL]. http:∥www.academicintegrity.org/cairesearch.asp., 2004-07-01.
  • 4[2]OTTENSTEIN K J. An algorithmic approach to the detection and prevention of plagiarism [J]. ACM SIGCSE Bull, 1976,8(4): 30-41.
  • 5[3]CLOUGH P. Plagiarism in natural and programming languages: An overview of current tools and technologies [A]. Research Memoranda: CS-00-05 [R]. Sheffield: Department of Computer Science, University of Sheffield, 2000. 1-31.
  • 6[5]BRODER A Z. On the resemblance and containment of documents [A]. Proceedings of Compression and Complexity of SEQUENCES [C]. Salerno: IEEE Computer Society, 1998. 21-29.
  • 7[6]MANDER U. Finding similar files in a large file system [A]. Proceedings of the USENIX Winter 1994 Technical Conference [C]. San Francisco: The Advanced Computing Systems Association, 1994. 1-10.
  • 8[7]MANDER U, BAKER B S. Deducing similarities in Java sources from bytecode [A]. USENIX 1998 Annual Technical Conference [C]. New Orleans: The Advanced Computing Systems Association, 1998. 179-190.
  • 9[8]BRIN S, DAVIS J, GARCIA-MOLINA H. Copy detection mechanisms for digital documents [A]. Proceedings of the ACM SIGMOD Annual Conference [C]. San Francisco: ACM Press, 1995. 398-409.
  • 10[9]SHIVAKUMAR N, GARCIA-MOLINA H. SCAM: a copy detection mechanism for digital documents [A]. Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries [C]. Austin: Texas A & M University, 1995. 201-210.












使用帮助 返回顶部