摘要
目前的产品垃圾评论识别方法只考虑评论特征的选取,忽略了评论数据集的不平衡性。因此该文提出基于随机森林的产品垃圾评论识别方法,即对样本中的大、小类有放回的重复抽取同样数量样本或者给大、小类总体样本赋予同样的权重以建立随机森林模型。通过对亚马逊数据集的实验结果表明,基于随机森林的产品评论识别方法优于其他基线方法。
Current review spam identification methods are focused on the feature selection, without addressing the imbalance of the data set. This paper presents a product review spare identification method based on the random for- est, with the same number of samples extracted from the large and small class with replacement repeatedly, or with the same weight assigned to the large and small class. The experimental results on Amazon dataset show that the random forest method outperforms other baseline methods.
出处
《中文信息学报》
CSCD
北大核心
2015年第3期150-154,161,共6页
Journal of Chinese Information Processing
基金
福建省自然科学基金(2010J05133)