摘要
由于蕴含事物发展规律,时序数据上的数据挖掘正成为大数据决策的重要组成部分.作为时序数据挖掘的一种基本操作,时序数据相似连接可以找出给定相似度度量下的所有相似时序数据对.研究表明,动态时间规整(Dynamic Time Warping,DTW)正在文本挖掘、趋势预测等越来越多的科学与社会应用领域中成为时序数据上目前最佳的相似性度量方法.该文首次提出采用DTW作为相似性度量方法的时序数据相似连接问题.特别地,该文首次提出了基于阈值和基于Top-k的两种DTW度量上的时间序列相似连接任务.除了服务于进一步的时序数据挖掘算法,这两个任务还具有机器翻译、关联检测等广泛的直接应用.但是,直接的相似连接方法因为时序数据的规模大、DTW计算复杂性高而不能在实际中工作.尽管存在很多基于DTW的索引和上下界计算方法,这些工作主要关注DTW度量上的快速检索而非相似连接.因此,这些方法都假设存在一个固定的时序数据作为查询,并根据查询使用时间和空间复杂度很高的方法构建索引或进行预计算.但在文中的相似连接问题中,所有时序数据都是查询,因此这些方法的构建索引和预计算的时间比直接的相似连接方法需要的处理时间还长.为此,该文针对两种相似连接任务提出了两个基于DTW上下界的剪裁框架用于减少准确DTW相似性的计算次数.基于划分,该文为DTW度量设计了新颖的上下界计算方案.由于细粒度的划分带来上下界接近准确的DTW相似性但需要更长的计算时间,而粗粒度的划分需要更短的计算时间和与准确DTW相似性有较大差距的上下界,该文设计了基于二分查找的机制来自动找到合适的划分粒度,实现了整体的高处理性能.面对单机不能容纳全部时序数据和运行时间长的情况,该文将提出的两种相似连接处理框架利用MapReduce并行计算框架扩展到了分布式环境.该文在两个真实数据集上验证了文中提出的DTW相似连接在实际应用中的效果,并在真实与合成数据集上进行了充分的实验,验证了文中方法的高效性.
Revealing evolution insights of things, time series mining is becoming an indispensable component of big data driven decision making. As a fundamental operation in time series mining, given a similarity measure, similarity join gathers all pairs of similar time series. It is demonstrated that DTW (Dynamic Time Warping)has served as the best measure in disparate domains ranging from scientific to social fields such as text mining or tendency prediction. In this paper, we for the first time propose to join similar time series with DTW as the similarity measure. Specifically, we for the first time define two tasks, the threshold based and the Top- k based similarity join under DTW. Besides to serve time series further mining tasks such as stock prediction, these two tasks can be directly applied to a wide spectrum of applications such as machine translation and delay-correlation detection. Unfortunately, trivial solutions suffer from the large scale nature of time series and high computational complexity of DTW. Numerous indexing techniques and various lower and upper bounds of DTW have been proposed. However, these works aim at similarity search rather than similarity join under DTW. In concrete, they assume that a fixed time series serves as a query and index or precomputation is performed on the query time series. It is time and space - consuming to construct index and precompute for the fixed time series. However, under our similarity join task, all time series serve as the fixed query and thus the index construction or precomputation time for all the time series is even beyond the execution time of the trivial solution and thus these techniques become impractical. To tame similarity join under DTW, we first propose two pruning based processing frameworks for the threshold-based and Top- k based similarity join tasks respectively. These two frameworks prune unnecessary calculation of accurate DTW similarity between time series by leveraging the cheap upper and lower bound of DTW measure. In this way, we further devise novel upper and lower bounds for DTW measure. Both bounds are developed on top on time series partition. Since fine - grained partition enables more accurate DTW similarity but consumes more execution time while coarse - grained partition results in less accurate DTW similarity but consumes less execution time, we develop a mechanism based on binary search to quickly tune the granularities of partitions automatically and thus enable the overall practical performance. When single machine cannot meet the requirement of performance or cannot hold the massive time series, we extend our processing frameworks to distributed environment. Specifically, we design a MapReduce implementation to our pruning based similarity join framework. We conduct extensive experiments to demonstrate the effectiveness and efficiency of our methods. First, we apply the two proposed similarity join tasks on two real world datasets to demonstrate that the threshold-based similarity join task can be used to find correlated power supplement sources and find the same entities in different languages. Then, we use both real world and synthetic datasets to demonstrate that our methods outperform existing solutions consistently under various lengths and volume of time series.
作者
周宁南
张孝
刘城山
王珊
ZHOU NingNan;ZHANG Xiao;LIU ChengShan;WANG Shan(Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education(Renmin University of China),Beijing 100872;Department of Information,Renmin University of China,Beijing 100872)
出处
《计算机学报》
EI
CSCD
北大核心
2018年第8期1798-1813,共16页
Chinese Journal of Computers
基金
国家重点研发计划项目(2016YFB10007002)
国家自然科学基金重点项目(61432006)资助~~
关键词
动态时间规整
时序数据
相似连接
划分剪枝
分布剪枝
dynamic time warping
time series
similarity join
partition-based pruning
distribution-based pruning