摘要
针对广域网分布式存储环境里大数据的群体行为模式分析与挖掘问题,提出基于MapReduce和ABC蜂群算法能在广域网分布式并行执行的k-means聚类方法。该聚类方法通过MapReduce的Map、Combine和Reduce三个运算,实现了用ABC改进优化k-means聚类计算在广域网内分布式并行执行,聚类计算复杂度高的操作在各个数据源节点并行执行,同时把获得的局部聚类结果合并成数据量较小的中间结果后才传送给中心节点,避免移动集中大数据,极大地缩减了聚类计算总时间;聚类计算的对象是全量大数据,防止了因数据降维或抽样而降低数据规模的方法对聚类结果产生影响,聚类的准确率得到了保障。通过某省道路交通监控系统采集的行人过马路闯红灯轨迹数据将两种聚类方法进行比较,得到广域网分布式并行聚类方法的聚类特性更好的结论。
In order to analyze and mine the group behavior patterns of large data in WAN distributed storage environment,a k-means clustering method based on MapReduce and ABC bee colony algorithm is proposed,which can be distributed and parallel in WAN.Through MapReduce’s Map,Combine and Reduce operations,this clustering method achieves the distributed parallel execution of optimized K-means clustering computation in WAN with ABC improvement.The operation with high complexity of clustering computation is executed in parallel at each data source node.At the same time,the local clustering results obtained are merged into intermediate results with small amount of data before being transmitted to the central node,which avoids moving large data in centralized mode and greatly reduces the total time of clustering computation.The object of clustering calculation is a large amount of data,which prevents the method that reduces the size of data due to data dimensionality reduction or sampling from affecting the clustering results.The accuracy of clustering is guaranteed.By comparing the two clustering methods based on pedestrian crossing red light trajectory data collected by a provincial road traffic monitoring system,the conclusion that the clustering characteristics of WAN distributed parallel clustering method are better is drawn.
作者
洪月华
HONG Yue-hua(Dept.of Computer,Guangxi Economic Management Cadre College,Nanning,Guangxi 530007)
出处
《玉林师范学院学报》
2019年第2期145-151,共7页
Journal of Yulin Normal University