摘要
为了提高搜索引擎的检索效率以及加强知识产权保护,结合汉语语言学以及自然语言处理的研究成果,提出了一种用于中文文本查重的算法。通过引入"动词中心词"的概念,扩展停用词的范围,将文本中的部分动词组成动词序列作为文本特征串,结合串匹配算法,计算出中文文本间语法相似性。同时根据IFIDF方法提取文本特征并进行权重计算,计算出中文文本间的语义相似性。结合文本间语法相似性和语义相似性得到文章的相似度,可以判断两篇中文内容的相似性,有效地进行重稿检测。
In order to improve the efficiency of search engine and protect intellectual property,this paper proposes a new duplication check algorithm for Chinese texts through integrating research result of Chinese linguistics with nature language processing.Through introducing the concept of "taking verbs as headwords",the paper first extends the scope of "stop words",so as to take verb sequences in a text as strings of feature code,which may,with string matching algorithms,be used to calculate the grammatical similarity between different Chinese texts.At the same time,through the extraction of the features of Chinese texts and weight calculation by TFIDF,the semantic similarity between different Chinese texts would be arrived at.With the grammatical similarity and semantic similarity values resulted from the above calculation,people can get the value of similarity and perform duplication text detection in a more effective way.
出处
《计算机仿真》
CSCD
2007年第12期312-314,共3页
Computer Simulation
关键词
语法相似性
语义相似性
重稿检测
Grammatical similarity
Semantic similarity
Duplication check