摘要
针对基于众包竞赛中欺诈者筛除机制的黄金标准数据方法、聚类算法的离群点检测算法K means算法和DBSCAN算法,依赖于事先给定的参数,不适合大规模数据集检测的问题,提出基于样本连通图的离群点检测算法。首先,给定参数并重复调用离群点检测算法,识别数据中的离群点和聚类;其次,计算每两个样本之间的连接次数和连接强度,在给定连接强度下界δ的情况下,根据样本的连接强度来构造样本之间的连通图;最后,根据样本之间的连通情况,对样本进行标记,把样本标记为聚类节点和离群点。实验结果表明,该算法在放宽参数设置范围的情况下,缩小了离群点个数波动范围,提升了离群点识别准确率,优于对比算法和经典的黄金标准数据方法。
For the gold standard data method based on crowdsourcing competition fraudster screening mechanism,K-means algorithm based on clustering algorithm and DBSCAN algorithm,which depend on the given parameters in advance and are not suitable for large-scale data set detection,an outlier detection algorithm based on sample connectivity graph is proposed.Firstly,the outlier detection algorithm is invoked repeatedly to identify outliers and clustering in the data.Secondly,the number of connections and connection strength between each two samples are calculated.Under the given lower bound of connection strength,the connection graph between samples is constructed according to the connection strength of samples.Finally,according to the connection situation between samples,the samples are marked and are sampled which name as clustering nodes and outliers.The experimental results show that the algorithm reduces the fluctuation range of outlier number and improves the accuracy of outlier recognition,which is better than the comparison algorithm and the classical gold standard data method.
作者
许艳静
朱建明
丁庆洋
庄雪扬
XU Yan-jing;ZHU Jian-ming;DING Qing-yang;ZHUANG Xue-yang(School of Information,Central University of Finance and Economics,Beijing 100081,China)
出处
《统计与信息论坛》
CSSCI
北大核心
2019年第10期20-26,共7页
Journal of Statistics and Information
基金
国家重点研发项目《智能服务交易与监管技术研究》(2017YFB1400700)
国家自然科学基金项目《面向高维大数据的正则化统计方法的相关研究》(71701223)
关键词
众包竞赛
用户欺诈
离群点检测
聚类算法
样本连通图
crowdsourcing contest
user fraud
outlier detection
clustering algorithm
sample connected graph