摘要
目的构造一种新的文本查重算法,改变传统的Shingling网页去重算法,提高文本的相似度计算率,提高查准率和查全率.方法改变传统的Shingling算法,先删除文本中无意义的虚词,再根据语意对文本进行分片,进而利用文本相似度计算公式对文本相似度进行计算.结果该算法提高了文本相似度计算的准确性,同时文本的查准率提高了10%左右,查全率提高了5%左右.结论实验表明,笔者所提算法实现简单、可行、具有良好的文本相似度计算效果,具有一定的优越性.
The objective of the paper is to construct a new text searching repetition algorithm in computer algorithm in order to change the traditional Shingling page re-algorithm,and to improve the similarity computation rate of the text,improve the precision and recall.We take measures to change the traditional shingling algorithm.First,we delete the text's meaningless function word,slice the text according to the semantic;then,use text similarity formula to calculate the similarity of the text.Through the algorithm in the calculation of text similarity,the accuracy of text similarity computation is increased,the text of the precision and recall rate are enhanced as well.The experiment shows that the algorithm is simple and feasible,with good text similarity calculation,the method is superior.
出处
《沈阳建筑大学学报(自然科学版)》
CAS
北大核心
2011年第4期771-775,共5页
Journal of Shenyang Jianzhu University:Natural Science
基金
辽宁省教育厅基金项目(L2010449)