期刊文献+

基于数字指纹的文献相似度检测研究 被引量:7

Literature Similarity Detection Based on Digital Fingerprint
原文传递
导出
摘要 针对中文文献抄袭检测提出了一种基于汉语词频的文本数字指纹,通过对具有参考性的语料库进行词频和字频统计形成一个hash词表,然后基于最大熵原理为任意长度的文本生成一个基于词频特征的文本数字指纹,对于任意两篇文献可以通过计算对应的两个数字指纹的Hamming距离来得到一个相似度的估计。通过使用维基百科zhwiki-20121129-all-titles语料库构建hash词表,对情报学领域4种核心期刊进行实验,结果表明这种数字指纹对常见的抄袭情况都能很好地识别和检测,具有很强的鲁棒性。 As a copyright protection technique, digital fingerprint has been a hot research area. This paper proposed a digital fingerprinting algorithm for text based on Chinese words frequency. A frequency list is built through statistics on word frequency and character frequency in a document repository. With this frequency list a digital fingerprint for text of any length can be generated based on the principle for maximum entropy. To get an estimation of the similarity for two texts a Hamming distance can be calculated for the two corresponding digital fingerprint. We build a hash table based on zhwiki-20121129-all-titles corpus and with this table experiment on four core journals. The result shows that normal ways of plagiarism can be detected by this robust fingerprinting algorithm.
出处 《图书情报工作》 CSSCI 北大核心 2013年第15期88-95,共8页 Library and Information Service
基金 国家社会科学基金项目"学术文献‘意抄’检测研究"(项目编号:12CTQ032) 山东省自然科学基金项目"大规模学术文献并行处理与语义分类研究"(项目编号:ZR2011GL025)研究成果之一
关键词 数字指纹 抄袭检验 最大熵原理 digital fingerprint plagiarism detection principle of maximum entropy
  • 相关文献

参考文献27

二级参考文献144

共引文献673

同被引文献69

  • 1刘云峰,齐欢,Xiang’en Hu,Zhiqiang Cai.潜在语义分析权重计算的改进[J].中文信息学报,2005,19(6):64-69. 被引量:19
  • 2秦新国.基于句子相似度的文档复制检测算法研究[J].现代图书情报技术,2007(11):63-66. 被引量:9
  • 3Apache spark [ EB/OL ]. [ 2015 - 03 - 18]. http://spark, a-pache. org.
  • 4Si A, Leong H V,Lau R W H. Check: A document plagiarism de-tection system [ C ] //Proceedings of the 1997 ACM Symposium onApplied Computing. New York: ACM, 1997 : 70 -77.
  • 5Schleimer S, Wilkerson D S,Aiken A. Winnowing: Local algo-rithms for document fingerprinting [ C ] //Proceedings of the 2003ACM SIGMOD International Conference on Management of Data.New York:ACM, 2003: 76 -85.
  • 6Roul R K,Mittal S,Joshi P. Efficient approach for near duplicatedocument detection using textual and conceptual based techniques[M ] // Advanced Computing, Networking and Informatics -Volume1. Springer International Publishing, 2014 : 195 -203.
  • 7Luo Xi, Najjar W, Hristidis V. Efficient near-duplicate documentdetection using FPGAs [ C ]//Big Data, 2013 IEEE InternationalConference on. Silicon Valley : IEEE, 2013 : 54-61.
  • 8Monostori K, Zaslavsky A, Schmidt H. Parallel and distributeddocument overlap detection on the Web [ M ] //Applied ParallelComputing. New Paradigms for HPC in Industry and Academia.London:Springer-Verlag London, 2001 : 206 -214.
  • 9Apache Hadoop. Hadoop [ EB/OL]. [2015 -03 - 18]. http://hadoop. apache, org.
  • 10ApacheStorm. Storm[ EB/OL]. [2015 - 03 - 18 ]. http://storm.apache, org.

引证文献7

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部