期刊文献+

基于随机森林的产品垃圾评论识别 被引量:12

Identification of Product Review Spam by Random Forest
下载PDF
导出
摘要 目前的产品垃圾评论识别方法只考虑评论特征的选取,忽略了评论数据集的不平衡性。因此该文提出基于随机森林的产品垃圾评论识别方法,即对样本中的大、小类有放回的重复抽取同样数量样本或者给大、小类总体样本赋予同样的权重以建立随机森林模型。通过对亚马逊数据集的实验结果表明,基于随机森林的产品评论识别方法优于其他基线方法。 Current review spam identification methods are focused on the feature selection, without addressing the imbalance of the data set. This paper presents a product review spare identification method based on the random for- est, with the same number of samples extracted from the large and small class with replacement repeatedly, or with the same weight assigned to the large and small class. The experimental results on Amazon dataset show that the random forest method outperforms other baseline methods.
作者 何珑
出处 《中文信息学报》 CSCD 北大核心 2015年第3期150-154,161,共6页 Journal of Chinese Information Processing
基金 福建省自然科学基金(2010J05133)
关键词 产品垃圾评论 不平衡问题 随机森林 product review spare imbalance problem random forest
  • 相关文献

参考文献19

  • 1赵妍妍,秦兵,刘挺.文本情感分析[J].软件学报,2010,21(8):1834-1848. 被引量:543
  • 2N Jindal and B Liu. Review Spam Deteetion[C]//Proceedings of the 16th international conference on World Wide Web. New York: ACM, 2007:1189-1190.
  • 3G Wu, D Greene, B Smyth et al. Distortion as a vali- dation criterion in the identification of suspicious reviews[C]//Proceedings of the First Workshop on Social Media Analytics. New York: ACM, 2010: 10-13.
  • 4何海江,凌云.由Logistic回归识别Web社区的垃圾评论[J].计算机工程与应用,2009,45(23):140-143. 被引量:11
  • 5F Li, M Huang, Y Yang et al. Learning to identify review Spam[C]//Proceeding of the 22nd International Joint Conference on Artificial Intelligence. 2011 : 2488- 2493.
  • 6吴敏,何珑.融合多特征的产品垃圾评论识别[J].微型机与应用,2012,31(22):85-87. 被引量:4
  • 7J Staddon and R Chow. Detecting reviewer bias through web-based association mining [C]//Proceed- ings of the 2nd ACM workshop on Information Credibility on the Web. New York: ACM, 2008: 5-10.
  • 8N Jindal, B Liu, and EP Lira. Finding Unusual Re- view Patterns Using Unexpected Rules[C]//Proceed- ings of the 19th ACM International Conference on In- formation and Knowledge Management. New York: ACM, 2009: 1549-1552.
  • 9E Lim, VA Nguyen, N Jindal et al. Detecting Product Review Spammers using Rating Behaviors [C]//Pro- ceedings of the 19th ACM International Conference on Information and Knowledge Management. New York: ACM, 2010.. 939-948.
  • 10A Mukherjee, B Liu, J Wang et al. Detecting Group Review Spam[C]//Proceedings of the 20th interna- tional conference companion on World Wide Web. New York: ACM, 2011: 93-94.

二级参考文献36

  • 1徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量:20
  • 2朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20. 被引量:326
  • 3Niu Yuan.A quantitative study of forum spamming using contextbased analysis[C]//Proeeedings of the 14th Annual Network and Distributed System Security Symposium,San Diego,CA,2007:79-92.
  • 4Mishne G,Carmel D.Blocking blog spam with language model disagreement[C]//Proceedings of the 1st AIRWeb.New York:ACM, 2005 : 1-6.
  • 5Kolari P.Detecting spam blogs:A machine learning approach[C]// Proceedings of the 21st National Conference on Artificial Intelligence.Baltimore : University of Maryland, 2006 : 1351-1356.
  • 6Lin Yu-ru.Splog detection using self-similarity analysis on blog temporal dynamics[C]//Proceedings of AIRWeb 2007.New York: ACM, 2007 : 1-8.
  • 7Brooks C H,Montanez N.Improved annotation of the blogosphere via autotagging and hierarchical clustering[C]//Proceedings of the 15th International Conference on World Wide Web.New York: ACM, 2006 : 625-632.
  • 8Lin C J,Weng R C,Keerthi S S.Trust region newton methods for large-scale logistic regression[C]//Proceedings of the 24th International Conference on Machine Learning.New York:ACM,2007: 561-568.
  • 9黄萱菁 赵军.中文文本情感倾向性分析.中国计算机学会通讯,2008,4(2):41-46.
  • 10He H, Garcia E A. Learning from imbalanced data [ J ]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9) : 1 263 - 1 284.

共引文献560

同被引文献84

引证文献12

二级引证文献70

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部