期刊文献+

AN EFFICIENT APPROACH TO COMMENT SPAM IDENTIFICATION 被引量:1

AN EFFICIENT APPROACH TO COMMENT SPAM IDENTIFICATION
下载PDF
导出
摘要 This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments. This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.
出处 《Journal of Electronics(China)》 2009年第5期644-650,共7页 电子科学学刊(英文版)
基金 Supported by the National Natural Science Foundation of China (No.60736044, 60803094)
关键词 垃圾邮件 评论 识别 向量空间模型 实验精度 过滤方法 重复性 相似性 Comment spam Automatic identification Content analysis Blog
  • 相关文献

参考文献10

  • 1Jubin Chheda.Combating link spam. http://www.cse.iitb.ac.in/~jubin/seminar_report.pdf . 2006
  • 2G. Mishne,D. Carmel,R. Lempel.Blocking blog spam with language model disagreement[].The st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’).2005
  • 3M. R. Henzinger,R. Motwani,C. Silverstein.Challenges in web search engines[].SIGIR Forum.2002
  • 4Joint statement from Yahoo, Google, and others regarding the “nofollow” tag. http://www. google.com/googlblog/2005/01/preventing-comment- spam.html .
  • 5E. Amitay,D. Carmel,A. Darlow,R. Lempel,A. Soffer.The connectivity sonar: detecting site func- tionality by structural patterns[].The th ACM Conference on Hypertext and Hypermedia (HYPERTEXT’).2003
  • 6B. Davison.Recognizing nepotistic links on the web[].AAAI- Workshop on Artificial Intelligence for Web Search.2000
  • 7D. Fetterly,M. Manasse,M. Najork.Spam, damn spam, and statistics: using statistical analysis to lo- cate spam web pages[].WebDB’: Proceedings of the th International Workshop on the Web and Databases.2004
  • 8B. P. Bailey,L. J. Gurak,J. A. Konstan.An examination of trust production in computer-mediated exchange[].The th International Conference on Hu- man Factors and the Web.2001
  • 9Jay Allen.MT-blacklist: a movable type anti-spam plug-in. http://www.jayallen.org/projects/mt-black- list/ . 2008
  • 10http://www.ysearchblog.com/archives/ 000069. html . 2006

引证文献1

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部