期刊文献+

基于Hash技术的重复性评论检测 被引量:4

Detection of repetitive reviews base on Hash technology
下载PDF
导出
摘要 随着互联网技术的迅速发展,论坛已经成为人们获取信息、发表言论的重要场所,但大量的重复评论已成为论坛舆情信息内容获取与监管系统中新的难题,因此对重复评论进行有效检测和消重就至关重要。针对重复评论在一定时间内具有数量大、密度高、内容相似度高的特点,提出了一种基于SHA-1技术的重复评论检测方法。该方法以句和段为粒度块计算评论Hash值,然后统计Hash表中相同的指纹数目以此判断评论之间的相似度,最后依据给出的相似度阈值检测评论是否为重复评论。实验结果表明,该方法可以对重复评论进行有效检测和消重,且优于传统方法。 With the rapid development of Internet, BBS had become an important place for the people to acquire information and make comments. However the existence of a vast number of repeated reviews had become a new problem, so the effective detection and duplication removal of repeated reviews were crucial for the BBS information acquisition and supervision system. A method of repeated reviews detection based on SHA-1 algorithm was proposed in consideration of its large quantity, high density and closely content similarity in a period of time. The method first calculated the Hash value of each sentence and paragraph and then counted the number of same Hash table fingerprints as a means of calculating the similarity between the different reviews. Finally the given similarity threshold was used to verify whether the reviews were repeated. The experimental results show that the proposed method is very effective and superior to traditional methods.
出处 《计算机应用》 CSCD 北大核心 2009年第B12期263-266,共4页 journal of Computer Applications
基金 国家863计划项目(2007AA01Z439)
关键词 舆情信息 重复评论 相似度计算 HASH表 public opinion information repeated comment similarity calculation Hash table
  • 相关文献

参考文献12

  • 1韩运荣,喻国明.舆论学[M].北京:中国传媒大学出版社,2005.
  • 2鲍军鹏,沈钧毅,刘晓东,宋擒豹.自然语言文档复制检测研究综述[J].软件学报,2003,14(10):1753-1760. 被引量:69
  • 3MANBER U. Finding similar flies in a large file system[ C]//Proceedings of the USENIX Winter 1994 Technical Conference on USENIX. Washington, DC: IEEE, 1994:2-2.
  • 4BRODER A Z, GLASSMAN S C, MANASSE M S. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8/13) : 1157 - 1166.
  • 5CHOWDHURY A, FLEXLER O, GROSSMAN, et al. Collection statistics for fast duplicate document detection[ J]. ACM Transactions on Information Systems, 2002, 20(2) : 171 - 191.
  • 6GYONGYI Z, GARCIA-MOLINA H. Web spare taxonomy[ C]// First International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan: [ s. n. ], 2004:1 - 8.
  • 7LI KANG. ZHONG ZHENYU. Fast statistical spare filter by approximate classifications[ C]// Proceeding of ACM SIGMETRICS. New York: ACM, 2006:347 - 358.
  • 8NTOULAS A, NAJORK M, MANASSE M, et al. Detecting spare Web pages through content analysis[ C]// Proceedings of the 15th International Conference on World Wide Web. Washington, DC: IEEE, 2006:83-92.
  • 9WU B, GOEL V, DAVISON B D. Topical trust rank: Using topicality to combat Web spare[ C]// Proceedings of the 15th International Conference on World Wide Web. Washington, DC: IEEE, 2006:63 - 72.
  • 10SALTON G. Automatic Text Processing[M]. Boston: Addison Wesley Longman Publishing, 1988.

二级参考文献2

共引文献69

同被引文献37

  • 1苑洪亮.基于内容的“发布/订阅”若干关键技术研究[D]国防科学技术大学,国防科学技术大学2006.
  • 2徐锋.基于ESB的分布式应用系统的研究与应用[D]大庆石油学院,大庆石油学院2009.
  • 3陈靖.用企业服务总线(ESB)对SOA的改进与应用[D]重庆大学,重庆大学2005.
  • 4Global Trust in Advertising and Brand Messages [EB/OL]. [2015-08-09]. http://www.fi.nielsen.com/site/documents/NielsenTrus tinAdvertisingGlobalReportApril2012.pdf.
  • 5中国互联网信息中心(CNNIC).中国网络购物市场研究报告[EB/R].[2015-08-09]. http://www.cnnic.net.cn.
  • 6Luca M, Zervas G.Fake it till you make it:reputation,competition,and yelp review fraud[EB/OL].[2015-08-09] http://people.bu.edu/zg/ publications/fakereviews.pdf.
  • 7Dellarocas C. Strategic manipulation of internet opinion forums: implications for consumers and firms [J]. Management Science, 2006, 52(10):1577-1593.
  • 8Yoo K H, Gretzel U. Comparison of deceptive and truthful travel reviews [C].//Information and Communication Technologies in Tourism 2009. Springer Vienna, 2009: 37-47.
  • 9Yoo K H, Lee Y, Gretzel U, et al. Trust in travel-related consumer generated media [C].//Information and Communication Technologies in Tourism 2009. Springer Vienna, 2009: 49-59.
  • 10Ott M, Choi Y, Cardie C, et al. Finding deceptive opinion spam by any stretch of the imagination[C]. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011: 309-319.

引证文献4

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部