摘要
随着数据获取设备的不断进步和数据获取技术的快速发展,如何分析和挖掘应用中快速产生的数据流成为亟待解决的问题.数据流的相似性连接返回两个数据流上相似的数据对,是分析和挖掘数据流的重要操作.相比于Lp范式距离,例如曼哈顿距离和欧氏距离,EMD距离(Earth Mover’s Distance)因其可以更准确地量化直方图元组之间的相似性而受到广泛关注,被广泛应用于解决基于内容的图像检索、冗余图像识别以及视频对象跟踪等重要应用问题.然而EMD距离的计算复杂度却高达三次方,阻碍了EMD距离在数据流相似性连接问题中的应用.该文基于开源的Apache Storm数据流分布式并行处理框架,设计并实现了基于EMD距离的数据流分布式相似性连接技术,命名为EMD-DDSJ技术.该技术在数据分发时维护了连接计算节点上的数据局部性,并基于该数据局部性增强了连接算法对不相似直方图元组对间EMD计算的过滤性能,提高了各个连接计算节点的执行效率.同时基于连接计算节点的代价模型,提出了基于反馈的负载均衡策略,有效提升EMD-DDSJ技术的整体执行性能.在真实数据集上的实验结果展示了该文提出的EMD-DDSJ技术的高效性和可扩展性,比相关最好的技术在处理吞吐率上最高提升了1.4倍,在元组平均处理延迟上最多降低了44%,并且随着相似性阈值或滑动窗口大小的增大该提升比率还会进一步增大.
With the continuous progress of data acquisition equipment and the rapid development of data acquisition technologies,analyzing and mining quick-generated data streams from practical applications become an urgent problem to solve.Similarity join over data streams,which returns all similar pairs of tuples from two data streams,is an important operation for analyzing and mining data streams.Compared with Lp norm distance functions,such as Manhattan distance and Euclidean distance,Earth Mover’s Distance(EMD)is more robust in quantifying the similarities of histogram-representative tuples and thus receives widespread attentions.EMD is widely applied to solve many important application problems,such as content-based image retrieval,duplicated image recognition and video object tracking.However,the computation of EMD owns cubic time complexity,which hampers the application of EMD in solving similarity join problem over data streams.Based on Apache Storm,which is an open source distributed parallel data stream processing framework,we design and implement a distributed similarity join technique for data streams based on EMD,named EMD-DDSJ.This technique partitions histogram-representative tuples in data streams based on their similarities in terms of EMD and thus maintains good data locality on each join-computing node.It is the good data locality that enhances the filtering effect of joining algorithm to unpromising EMD calculations between dissimilar histogram-representative tuple pairs,and therefore improves the execution efficiency of join-computing nodes.On the basis of the cost model of join-computing nodes,we also propose an effective load balancing strategyon the strength of feedback messages sent by join-computing nodes,which improves the overall performance of the EMD-DDSJ technique.Experimental results on physical-world datasets show the efficiency and good scalability of the proposed EMD-DDSJ technique.Particularly,the processing throughput of EMD-DDSJ technique is at most 1.4 times higher than that of the best related technique and the average tuple processing delay of EMD-DDSJ technique decreases up to 44% compared with the best related technique.Besides,with the growth of the similarity threshold or the size of sliding window,the throughput improvement ratio of EMD-DDSJ technique compared to the related technique will be further increased.
作者
许嘉
宋超
吕品
李陶深
XU Jia;SONG Chao;LV Pin;LI Tao-Shen(College of Computer, Electronics and Information, Guangxi University, Nanning 530004;Guangxi Collegss and Universitiss Key Labrrairry of Parallel and Distributed Computing , Nanning 530004;Guangxi Key Laboratory of Multimedia Communications Network Technology , Nanning 530004)
出处
《计算机学报》
EI
CSCD
北大核心
2019年第8期1779-1796,共18页
Chinese Journal of Computers
基金
国家自然科学基金(61402494)
广西自然科学基金青年基金(2015GXNSFBA139243)
“广西八桂学者”专项经费
广西大学科研基金资助项目(XGZ141182,XGZ150322)
广西高等教育本科教学改革工程项目重点项目(2017JGZ103)资助~~