
学术论文复制检测的研究进展及新方法 被引量:1

Review and New Ideas on Duplication Detection of Articles
摘要 综述国内外学术论文复制检测的研究现状,针对存在的问题提出以后研究的新思路:构建某一学科领域学术论文语料库;以信息论为工具,针对某学科领域建立基于学术论文语料库的统计语言模型;结合学术论文抄袭剽窃的特点,通过赋予描述资源对象语义信息的不同元数据项以不同的权函数,设计相似度算法;使用Lemur工具箱,在标准的TREC文档集上对模型和算法进行检验;与Turnitin侦探剽窃系统进行实验对比,评价该模型和算法的有效率和效果。 After reviewing and analyzing the problems of retrieval models and text similarity algorithms of duplication detection, the anthor proposes some new ideas on plagiarism detection of articles to improve the recall and precision. The ideas include the followings : building article training corpus in one specialty;based on information theory, building statistical language model;computing articles similarity by different metadata with different authorized functions ; using Lemur toolbox to test recall and precision of the model and similarity algorithm ; comparing with Turnitin plagiarism detection system to evaluate the effectiveness and efficiency of the detection computation.
作者 王秀红
出处 《图书情报工作》 CSSCI 北大核心 2009年第5期111-114,共4页 Library and Information Service
基金 江苏大学博士生创新基金项目"学术论文抄袭检测模型及算法"(项目编号:CX08B-18X)研究成果之一
关键词 学术论文 复制检测 抄袭剽窃检测 统计语言模型 文本相似度算法 articles duplication detection plagiarism detection statistical language model text similarity algorithm
  • 相关文献


  • 1Manber U. Finding similar files in a large file system//Rose G. Proceedings of the USENIX Winter Conference. 1994 : 1 - 10.
  • 2Brin S, Davis J, Garcia - Molina H. Copy detection mechanisms for digital documents//Carey M, Schneider D, Systems R B. Proceedings of the ACM SIGMOD Annual Conference. New York: ACM, 1995 : 398 -409.
  • 3Shivakumar N, Garcia - Molina H. SCAM : A copy detection mechanism for digital documents//Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries. Austin:Texas, 1995:1 -13.
  • 4Shivakumar N, Garcia- Molina H. Building a scalable and accurate copy detection mechanism.//Fox E A, Tech V, Marchionini B a. Proceedings of the first ACM international conference on Digital libraries. New York: ACM, 1996:160-168.
  • 5Shivakumar N, Gareia - Molina H. Finding near - replicas of documents on the web//Atzeni P, Mendelzon A, Mecca G. Proceedings of the Workshop on Web Databases Held in Conjunction with EDBT' 98. LNCS. Berlin : Springer, 1999:204 - 212.
  • 6Prechelt L, Malpohl G, Philippsen M. Finding plagiarism among a set of programs with Jplag. Journal of Universal Computer Science, 2002,8 ( 11 ) : 1016 - 1038.
  • 7Si A, Leong H V, Lau RWH. CHECK: A document plagiarism detection system//Bryant B, Carroll J, Hightower J, et al. Proceedings of the ACM Symposium for Applied Computing. 1997:70 -77.
  • 8Stein B. Fuzzy-Fingerprints for text-based information retrieval// Tochtermann K, Maurer H. Proceedings I -KNOW 05, Graz, J. UCS, 2005:572 - 579.
  • 9Stein B, MeyerzuEissen S. Near similarity search and plagiarism analysis//Weihs C, Gaul W. Proceeding of 29th Annual Conference of the GfKI. Berlin: Springer, 2006:430 -437.
  • 10MeyerzuEissen S, Stein B, Kulig M. Plagiarism detection without reference collections. Advances in Data Analysis//Decher R, Lenz H J. Proceedings of the 30^th Annual Conference of the Gesellschaft fur Klassifikation e.V.. Freie University. Berlin: Springer, 2006: 359 - 366.


  • 1史彦军,滕弘飞,金博.抄袭论文识别研究与进展[J].大连理工大学学报,2005,45(1):50-57. 被引量:36
  • 2AUSTIN R.Word check system[EB/OL].[2002-12-02] http:∥www.wordchecksystems.com
  • 3ANTONIO S,LEONG H V,RYNSON W H.CHECK:a document plagiarism detection system[C]∥ Proceedings of ACM Symposium for Applied Computing.San Jose:[s n],1997:70-77.
  • 4HEINTZE N.Scalable document fingerprinting (extended abstract)[C]∥ Proceedings of USENIX Workshop on Electronic Commerce.Oakland:[s n],1996:69-74
  • 5UDI M.Finding similar files in a large file system[C]∥ 1994 Winter USENIX Technical Conference.San Francisco:[s n],1994:1-10
  • 6SALTON G,SALTON C.Term-weighting approaches in automatic text retrieval[J].Inf Process and Manage,1988,24:513-523
  • 7ZHANG Hua-ping.HHMM-based Chinese lexical analyzer ICTCLAS[C]∥ Second SIGHAN Workshop Affiliated with 41st ACL.Sapporo:[s n],2003:63-70
  • 8张益民,陆汝占,沈李斌.一种混合型的汉语篇章结构自动分析方法[J].软件学报,2000,11(11):1527-1533. 被引量:10
  • 9宋擒豹,沈钧毅.数字商品非法复制和扩散的监测机制[J].计算机研究与发展,2001,38(1):121-125. 被引量:38
  • 10王继成,武港山,周源远,张福炎.一种篇章结构指导的中文Web文档自动摘要方法[J].计算机研究与发展,2003,40(3):398-405. 被引量:43











使用帮助 返回顶部