期刊文献+

Simhash算法在试题查重中的应用 被引量:1

Application of Simhash Algorithm in Examination Questions Checking
下载PDF
导出
摘要 随着在线教育平台的兴起,为了解决大量试题带来的存储开支问题,试题查重技术应运而生。提出将改进的Simhash算法应用到试题查重中,首先根据结巴分词技术将试题文本进行切分,然后根据TF-IDF技术并结合词语的词性及词长算出关键词权重,以期达到对Simhash签名值的精确计算,最后通过带有索引功能的海明距离检测出相似试题。实验结果验证了此方案的可行性。 With the rise of online education platform,in order to solve the problem of storage costs caused by a large number of test questions,the research of examination checking technology is becoming more and more important.So,we propose the improved Simhash algorithm is applied to examination checking,first of all,we will test the text segmentation according to stutter segmentation,and then based on TF-IDF technology and to achieve the Simhash signature accurately calculated with the words part of speech and word length to calculate the weight of words,finally,with the index function of Hamming the distance detected similar questions,through experiments,we can verify the feasibility of this scheme.
出处 《软件导刊》 2018年第2期151-153,157,共4页 Software Guide
关键词 试题查重 Simhash算法 海明距离 签名值 examination checking Simhash algorithm hamming distance signature value
  • 相关文献

参考文献9

二级参考文献88

  • 1马哲,姚敏.一种改进的基于PATRICIA树的汉语自动分词词典机制[J].华南理工大学学报(自然科学版),2004,32(z1):28-31. 被引量:3
  • 2刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:197
  • 3中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/in-dex/OE/00/11/index.htm,2005,07-01
  • 4Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL]. http://gatekeeper. research.compaq.com/pub/DEC/SRC/technical--notes/SRC--1997--015 html
  • 5吴军,数学之美系列十三信息指纹及其应用[DB/OL].http://www.googlechinablog.com/2006/08/blog-post.html
  • 6Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Detecting Near--Duplicates for Web Crawlng[C]. In ternational World Wide Web Conference, Banff, Alberta, Canada, New York, USA: ACM, 2007: 141-- 150
  • 7Moses S. Charikar, Similarity Estimation Tech niques from Rounding Algorithms[C]. Annual ACM Sym posium on Theory of Computing, Montreal, Quebec, Cana da, New York, USA:ACM, 2002 : 380-388
  • 8[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 9[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 10[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.

共引文献324

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部