This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam i...This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.展开更多
着重梳理当前产品垃圾评论识别的国内外研究,总结研究特点与不足,发掘发展趋势。在中国知网、Web of Science上以'虚假评论''review spam'等为关键词检索并筛选得到54篇国内外相关文献,采用文献分析法对其进行分类分析...着重梳理当前产品垃圾评论识别的国内外研究,总结研究特点与不足,发掘发展趋势。在中国知网、Web of Science上以'虚假评论''review spam'等为关键词检索并筛选得到54篇国内外相关文献,采用文献分析法对其进行分类分析,重点阐述研究在识别特征和识别方法方面的优化创新,以及针对垃圾评论、垃圾评论发布者、发布群体等不同识别对象的方法差异。研究发现,当前垃圾评论识别的相关成果可以分为基于评论内容的方法和基于评论结构、评论者、被评论产品的方法,在未来的垃圾评论识别中,应根据数据集的特点,提取有效识别特征,选择优化识别方法。展开更多
基金Supported by the National Natural Science Foundation of China (No.60736044, 60803094)
文摘This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.
文摘着重梳理当前产品垃圾评论识别的国内外研究,总结研究特点与不足,发掘发展趋势。在中国知网、Web of Science上以'虚假评论''review spam'等为关键词检索并筛选得到54篇国内外相关文献,采用文献分析法对其进行分类分析,重点阐述研究在识别特征和识别方法方面的优化创新,以及针对垃圾评论、垃圾评论发布者、发布群体等不同识别对象的方法差异。研究发现,当前垃圾评论识别的相关成果可以分为基于评论内容的方法和基于评论结构、评论者、被评论产品的方法,在未来的垃圾评论识别中,应根据数据集的特点,提取有效识别特征,选择优化识别方法。