摘要
研究一种高效的文本信息查重算法,对电子商务网站的相似信息进行自动归类排序,大幅度提高信息审核效率与正确性。测试表明,信息数量在100-1000条时,该算法十分有效,1000条的文本信息相互比较可控制在2秒之内。信息数量超过1000条后,计算时间会大幅度上升。可通过调整算法中相关参数来调整精度。对于过短信息(少于10个字),可将本算法与Levenshtein算法相结合,以提高该文本信息查重算法的灵活性。
In this paper, an efficient textual information replicas detection algorithm is studied. Similar information on the e-commerce site is automatically classified and sorted, which greatly increases the efficiency and accuracy of information auditing. Tests show that when the information number is between 100 and 1000 ,the algorithm is quite effective,for the comparison of 1000 text messages can be controlled within two seconds. When the information amount is over 1000, the computation time will be significantly increased. The precision can be rectified by adjusting the relevant parameters of the algorithm. For the case that the information is too short (less than 10 words), the algorithm can be combined with the Levenshtein algorithm in order to improve the flexibility of the textual replicas detection algorithm.
出处
《计算机应用与软件》
CSCD
2009年第1期197-199,共3页
Computer Applications and Software
关键词
查重
算法
电子商务
Replicas detection Algorithm E-commerce