摘要
随着互联网技术的迅速发展,论坛已经成为人们获取信息、发表言论的重要场所,但大量的重复评论已成为论坛舆情信息内容获取与监管系统中新的难题,因此对重复评论进行有效检测和消重就至关重要。针对重复评论在一定时间内具有数量大、密度高、内容相似度高的特点,提出了一种基于SHA-1技术的重复评论检测方法。该方法以句和段为粒度块计算评论Hash值,然后统计Hash表中相同的指纹数目以此判断评论之间的相似度,最后依据给出的相似度阈值检测评论是否为重复评论。实验结果表明,该方法可以对重复评论进行有效检测和消重,且优于传统方法。
With the rapid development of Internet, BBS had become an important place for the people to acquire information and make comments. However the existence of a vast number of repeated reviews had become a new problem, so the effective detection and duplication removal of repeated reviews were crucial for the BBS information acquisition and supervision system. A method of repeated reviews detection based on SHA-1 algorithm was proposed in consideration of its large quantity, high density and closely content similarity in a period of time. The method first calculated the Hash value of each sentence and paragraph and then counted the number of same Hash table fingerprints as a means of calculating the similarity between the different reviews. Finally the given similarity threshold was used to verify whether the reviews were repeated. The experimental results show that the proposed method is very effective and superior to traditional methods.
出处
《计算机应用》
CSCD
北大核心
2009年第B12期263-266,共4页
journal of Computer Applications
基金
国家863计划项目(2007AA01Z439)
关键词
舆情信息
重复评论
相似度计算
HASH表
public opinion information
repeated comment
similarity calculation
Hash table