摘要
复制检测就是检测文档之间是否存在雷同现象,并将检测结果报告给用户。文章算法将复制检测技术指纹比对法和词频统计法结合起来,首先对文本进行预处理如滤除介词、冠词等,采用指纹比对法判断自然段落之间的相似性;然后将一个自然段视为一个小的整体来构成整个文档,采用基于词频的加权统计法判断全文的相似性。
Copy detection actually detects illegal copies and reports results to users. The new approach presents a comparison based on fingerprint and a statistic based on the word occurrence frequencies. Firstly, we need to do some preprocessing such as throwing off preposition, article and so on. Between the detection of paragraphs, we employ the comparison based on fingerprint, then we view each paragraph as a small unit and each paragraph is given a weight value to adopt the way of statistic based on the word occurrence frequencies to detect the whole document's similarity.
出处
《电脑与信息技术》
2014年第4期11-14,共4页
Computer and Information Technology
关键词
复制检测
文本指纹
词频
copy detection
text fingerprint
word frequency