摘要
综述国内外学术论文复制检测的研究现状,针对存在的问题提出以后研究的新思路:构建某一学科领域学术论文语料库;以信息论为工具,针对某学科领域建立基于学术论文语料库的统计语言模型;结合学术论文抄袭剽窃的特点,通过赋予描述资源对象语义信息的不同元数据项以不同的权函数,设计相似度算法;使用Lemur工具箱,在标准的TREC文档集上对模型和算法进行检验;与Turnitin侦探剽窃系统进行实验对比,评价该模型和算法的有效率和效果。
After reviewing and analyzing the problems of retrieval models and text similarity algorithms of duplication detection, the anthor proposes some new ideas on plagiarism detection of articles to improve the recall and precision. The ideas include the followings : building article training corpus in one specialty;based on information theory, building statistical language model;computing articles similarity by different metadata with different authorized functions ; using Lemur toolbox to test recall and precision of the model and similarity algorithm ; comparing with Turnitin plagiarism detection system to evaluate the effectiveness and efficiency of the detection computation.
出处
《图书情报工作》
CSSCI
北大核心
2009年第5期111-114,共4页
Library and Information Service
基金
江苏大学博士生创新基金项目"学术论文抄袭检测模型及算法"(项目编号:CX08B-18X)研究成果之一
关键词
学术论文
复制检测
抄袭剽窃检测
统计语言模型
文本相似度算法
articles duplication detection plagiarism detection statistical language model text similarity algorithm