摘要
针对中文文献抄袭检测提出了一种基于汉语词频的文本数字指纹,通过对具有参考性的语料库进行词频和字频统计形成一个hash词表,然后基于最大熵原理为任意长度的文本生成一个基于词频特征的文本数字指纹,对于任意两篇文献可以通过计算对应的两个数字指纹的Hamming距离来得到一个相似度的估计。通过使用维基百科zhwiki-20121129-all-titles语料库构建hash词表,对情报学领域4种核心期刊进行实验,结果表明这种数字指纹对常见的抄袭情况都能很好地识别和检测,具有很强的鲁棒性。
As a copyright protection technique, digital fingerprint has been a hot research area. This paper proposed a digital fingerprinting algorithm for text based on Chinese words frequency. A frequency list is built through statistics on word frequency and character frequency in a document repository. With this frequency list a digital fingerprint for text of any length can be generated based on the principle for maximum entropy. To get an estimation of the similarity for two texts a Hamming distance can be calculated for the two corresponding digital fingerprint. We build a hash table based on zhwiki-20121129-all-titles corpus and with this table experiment on four core journals. The result shows that normal ways of plagiarism can be detected by this robust fingerprinting algorithm.
出处
《图书情报工作》
CSSCI
北大核心
2013年第15期88-95,共8页
Library and Information Service
基金
国家社会科学基金项目"学术文献‘意抄’检测研究"(项目编号:12CTQ032)
山东省自然科学基金项目"大规模学术文献并行处理与语义分类研究"(项目编号:ZR2011GL025)研究成果之一
关键词
数字指纹
抄袭检验
最大熵原理
digital fingerprint
plagiarism detection
principle of maximum entropy