期刊文献+

一种基于语义相似的中文文档抄袭检测方法

A plagiarism detection approach for Chinesedocuments based on semantic textual similarity
下载PDF
导出
摘要 为解决在文本抄袭行为中由于避开检测而对文本内容进行的一些同义词替换、文本释义等操作问题,提出了一种基于语义相似计算的中文文档抄袭检测方法,将文档以句子为单位切分,利用word2vec模型将句子中的词语表示为词向量的形式,作为卷积神经网络(convolutional neural net-work,CNN)的输入,使用卷积神经网络提取和筛选句子的特征,计算句子对之间的差异,输出句子对的相似度,相似度高的句子对视为抄袭.利用大型可公开的腾讯云文本相似数据集检测试学生作业的抄袭情况,结果表明,传统的移动窗口指纹特征提取法虽然能够较为准确地找出两个文档中相同的片段,但是对于语义相似的文本容易受到噪声影响,提出的基于语义相似计算方法能够发现文档中语义相似的部分. In order to solve the problem of some operations that interfere with detection,such as synonym substitution,text paraphrase,etc.,we propose a Chinese documents plagiarism detection approach based on semantic textual similarity.Firstly,we divide the document into sentence units and use word2vec to have a vector representation of each word of a sentence as the input of the convolutional neural network(CNN).Then,the CNN is applied to extract and filter the features of sentences,calculate the difference between sentence pairs,output the similarity of sentence pairs.Pair sentences with the highest similarity are considered as the candidates for plagiarism.Finally,copy-and-paste documents and semantically similar documents are used as the dataset to verify and compare the proposed method with the traditional fingerprint feature extraction method.The proposed method is tested on a large publicly available Tencent cloud text similarity data set,and applied to the plagiarism detection of students homework.The results show that although the traditional fingerprint feature extraction method can find the same fragments in two documents accurately,it is sensitive to the noise in the semantically similar documents,while the proposed approach can overcome this disadvantage.
作者 胡布焕 张晶 张凌 HU Buhuan;ZHANG Jing;ZHANG Ling(Guangdong Province Key Laboratory of Computer Network,College of Computer Science and Technology,South China University of Technology,Guangzhou 510006,Guangdong Province,P.R.China)
出处 《深圳大学学报(理工版)》 EI CAS CSCD 北大核心 2020年第S01期107-111,共5页 Journal of Shenzhen University(Science and Engineering)
基金 中国教育和科研计算机网资助项目(NGII20190615)。
关键词 计算机科学 自然语言处理 抄袭检测 语义相似度 词向量表示 computer science natural language processing plagiarism detection semantic similarity word vector representation
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部