How to process aggregate queries over data streams efficiently and effectively have been becoming hot re search topics in both academic community and industrial community. Aiming at the issues, a novel Linked-tree alg...How to process aggregate queries over data streams efficiently and effectively have been becoming hot re search topics in both academic community and industrial community. Aiming at the issues, a novel Linked-tree algorithm based on sliding window is proposed in this paper. Due to the proposal of concept area, the Linked-tree algorithm reuses many primary results in last window and then avoids lots of unnecessary repeated comparison operations between two successive windows. As a result, execution efficiency of MAX query is improved dramatically. In addition, since the size of memory is relevant to the number of areas but irrelevant to the size of sliding window, memory is economized greatly. The extensive experimental results show that the performance of Linked-tree algorithm has significant improvement gains over the traditional SC (Simple Compared) algorithm and Ranked-tree algorithm.展开更多
Join operation is a critical problem when dealing with sliding window over data streams. There have been many optimization strategies for sliding window join in the literature, but a simple heuristic is always used fo...Join operation is a critical problem when dealing with sliding window over data streams. There have been many optimization strategies for sliding window join in the literature, but a simple heuristic is always used for selecting the join sequence of many sliding windows, which is ineffectively. The graph-based approach is proposed to process the problem. The sliding window join model is introduced primarily. In this model vertex represent join operator and edge indicated the join relationship among sliding windows. Vertex weight and edge weight represent the cost of join and the reciprocity of join operators respectively. Then good query plan with minimal cost can be found in the model. Thus a complete join algorithm combining setting up model, finding optimal query plan and executing query plan is shown. Experiments show that the graph-based approach is feasible and can work better in above environment.展开更多
With the enhancement of data collection capabilities,massive streaming data have been accumulated in numerous application scenarios.Specifically,the issue of classifying data streams based on mobile sensors can be for...With the enhancement of data collection capabilities,massive streaming data have been accumulated in numerous application scenarios.Specifically,the issue of classifying data streams based on mobile sensors can be formalized as a multi-task multi-view learning problem with a specific task comprising multiple views with shared features collected from multiple sensors.Existing incremental learning methods are often single-task single-view,which cannot learn shared representations between relevant tasks and views.An adaptive multi-task multi-view incremental learning framework for data stream classification called MTMVIS is proposed to address the above challenges,utilizing the idea of multi-task multi-view learning.Specifically,the attention mechanism is first used to align different sensor data of different views.In addition,MTMVIS uses adaptive Fisher regularization from the perspective of multi-task multi-view learning to overcome catastrophic forgetting in incremental learning.Results reveal that the proposed framework outperforms state-of-the-art methods based on the experiments on two different datasets with other baselines.展开更多
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, a...Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O(√G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.展开更多
Outlier detection on data streams is an important task in data mining. The challenges become even larger when considering uncertain data. This paper studies the problem of outlier detection on uncertain data streams. ...Outlier detection on data streams is an important task in data mining. The challenges become even larger when considering uncertain data. This paper studies the problem of outlier detection on uncertain data streams. We propose Continuous Uncertain Outlier Detection (CUOD), which can quickly determine the nature of the uncertain elements by pruning to improve the efficiency. Furthermore, we propose a pruning approach -- Probability Pruning for Continuous Uncertain Outlier Detection (PCUOD) to reduce the detection cost. It is an estimated outlier probability method which can effectively reduce the amount of calculations. The cost of PCUOD incremental algorithm can satisfy the demand of uncertain data streams. Finally, a new method for parameter variable queries to CUOD is proposed, enabling the concurrent execution of different queries. To the best of our knowledge, this paper is the first work to perform outlier detection on uncertain data streams which can handle parameter variable queries simultaneously. Our methods are verified using both real data and synthetic data. The results show that they are able to reduce the required storage and running time.展开更多
In this paper, we study the skyline group problem over a data stream. An object can dominate another object if it is not worse than the other object on all attributes and is better than the other object on at least on...In this paper, we study the skyline group problem over a data stream. An object can dominate another object if it is not worse than the other object on all attributes and is better than the other object on at least one attribute. If an object cannot be dominated by any other object, it is a skyline object. The skyline group problem involves finding k-item groups that cannot be dominated by any other k-item group. Existing algorithms designed to find skyline groups can only process static data. However, data changes as a stream with time in many applications,and algorithms should be designed to support skyline group queries on dynamic data. In this paper, we propose new algorithms to find skyline groups over a data stream. We use data structures, namely a hash table, dominance graph, and matrix, to store dominance information and update results incrementally. We conduct experiments on synthetic datasets to evaluate the performance of the proposed algorithms. The experimental results show that our algorithms can efficiently find skyline groups over a data stream.展开更多
Pushed by the Internet of Things(IoT)paradigm modern sensor networks monitor a wide range of phenomena,in areas such as environmental monitoring,health care,industrial processes,and smart cities.These networks provide...Pushed by the Internet of Things(IoT)paradigm modern sensor networks monitor a wide range of phenomena,in areas such as environmental monitoring,health care,industrial processes,and smart cities.These networks provide a continuous pulse of the almost infinite activities that are happening in the physical space and are thus,key enablers for a Digital Earth Nervous System.Nevertheless,the rapid processing of these sensor data streams still continues to challenge traditional data-handling solutions and new approaches are being requested.We propose a generic answer to this challenge,which has the potential to support any form of distributed real-time analysis.This neutral methodology follows a brokering approach to work with different kinds of data sources and uses web-based standards to achieve interoperability.As a proof of concept,we implemented the methodology to detect anomalies in real-time and applied it to the area of environmental monitoring.The developed system is capable of detecting anomalies,generating notifications,and displaying the recent situation to the user.展开更多
基于物联大数据赋能的业务流程能够更快更准地感知物理世界并及时做出响应的需求突现,提出一种物联网(Internet of Things,IoT)感知的业务微流程建模方法。首先,以单个IoT对象为中心建模,融合MAPE-K(monitor,analysis,plan,execution an...基于物联大数据赋能的业务流程能够更快更准地感知物理世界并及时做出响应的需求突现,提出一种物联网(Internet of Things,IoT)感知的业务微流程建模方法。首先,以单个IoT对象为中心建模,融合MAPE-K(monitor,analysis,plan,execution and knowledge base,MAPE-K)模型思想,将IoT对象实例生命周期的行为状态与微流程实例状态一一映射,实现对单个IoT对象的环形自动监控和调节;其次,基于从IoT传感设备获取的数据,定义基于SASE+语言的业务规则,提取对业务流程有意义的业务事件,避免了无关事件对宏流程的干扰;最后,通过设计一个微流程建模工具原型系统,结合真实案例分析,验证了提出建模方法的有效性,实现了业务流程与IoT实时流式感知数据的结合,并显著减少了宏流程需要处理的业务事件数量。展开更多
Data stream management system (DSMS) provides convenient solutions to the problem of processing continuous queries on data streams.Previous approaches for scheduling these queries and their operators assume that each ...Data stream management system (DSMS) provides convenient solutions to the problem of processing continuous queries on data streams.Previous approaches for scheduling these queries and their operators assume that each operator runs in separate thread or all operators combine in one query plan and run in a single thread.Both approaches suffer from severe drawbacks concerning the thread overhead and the stalls due to expensive operators.To overcome these drawbacks,a novel approach called clustered operators scheduling (COS) is proposed that adaptively clusters operators of the query plan into a number of groups based on their selectivity and computing cost using S-mean clustering.Experimental evaluation is provided to demonstrate the potential benefits of COS scheduling over the other scheduling strategies.COS can provide adaptive,flexible,reliable,scalable and robust design for continuous query processor.展开更多
离群点检测是数据管理领域中的一个重要问题,在信用卡欺诈检测、工业工程过程管理、银行数据处理等方面等均有广泛应用.大数据时代的到来加剧了对大规模流媒体数据进行离群点检测多样化的需求,不同用户可根据自身偏好选择不同类型的数...离群点检测是数据管理领域中的一个重要问题,在信用卡欺诈检测、工业工程过程管理、银行数据处理等方面等均有广泛应用.大数据时代的到来加剧了对大规模流媒体数据进行离群点检测多样化的需求,不同用户可根据自身偏好选择不同类型的数据作为离群点.针对流数据环境下多离群点检测问题,提出了一种查询处理框架MQOD(Multiple Query of Outlier Detection),利用多查询任务之间的包含关系来支持多离群点检测任务,从而提高查询效率.在MQOD框架下,构建了HT-Grid索引以支持流数据的管理,利用滑动窗口的时间特性对窗口进行划分,并根据划分结果确定执行查询的范围,减少不必要的对象访问.通过真实数据集和合成数据集对MQOD算法进行了验证,验证结果表征了算法的高效性.展开更多
基金Supported by the National Natural Science Foun-dation of China (60573089) the National 985 Project Fundation(985-2-DB-Y01)
文摘How to process aggregate queries over data streams efficiently and effectively have been becoming hot re search topics in both academic community and industrial community. Aiming at the issues, a novel Linked-tree algorithm based on sliding window is proposed in this paper. Due to the proposal of concept area, the Linked-tree algorithm reuses many primary results in last window and then avoids lots of unnecessary repeated comparison operations between two successive windows. As a result, execution efficiency of MAX query is improved dramatically. In addition, since the size of memory is relevant to the number of areas but irrelevant to the size of sliding window, memory is economized greatly. The extensive experimental results show that the performance of Linked-tree algorithm has significant improvement gains over the traditional SC (Simple Compared) algorithm and Ranked-tree algorithm.
文摘Join operation is a critical problem when dealing with sliding window over data streams. There have been many optimization strategies for sliding window join in the literature, but a simple heuristic is always used for selecting the join sequence of many sliding windows, which is ineffectively. The graph-based approach is proposed to process the problem. The sliding window join model is introduced primarily. In this model vertex represent join operator and edge indicated the join relationship among sliding windows. Vertex weight and edge weight represent the cost of join and the reciprocity of join operators respectively. Then good query plan with minimal cost can be found in the model. Thus a complete join algorithm combining setting up model, finding optimal query plan and executing query plan is shown. Experiments show that the graph-based approach is feasible and can work better in above environment.
文摘With the enhancement of data collection capabilities,massive streaming data have been accumulated in numerous application scenarios.Specifically,the issue of classifying data streams based on mobile sensors can be formalized as a multi-task multi-view learning problem with a specific task comprising multiple views with shared features collected from multiple sensors.Existing incremental learning methods are often single-task single-view,which cannot learn shared representations between relevant tasks and views.An adaptive multi-task multi-view incremental learning framework for data stream classification called MTMVIS is proposed to address the above challenges,utilizing the idea of multi-task multi-view learning.Specifically,the attention mechanism is first used to align different sensor data of different views.In addition,MTMVIS uses adaptive Fisher regularization from the perspective of multi-task multi-view learning to overcome catastrophic forgetting in incremental learning.Results reveal that the proposed framework outperforms state-of-the-art methods based on the experiments on two different datasets with other baselines.
基金supported by the "Hundred Talents Program" of CAS and the National Natural Science Foundation of China under Grant No. 60772034.
文摘Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O(√G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
基金supported by the National Natural Science Foundation of China under Grant Nos.61025007,61328202,61173029,61100024,61332006,and 61073063the National High Technology Research and Development 863 Program of China under Grant No.2012AA011004the National Basic Research 973 Program of China under Grant No.2011CB302200-G
文摘Outlier detection on data streams is an important task in data mining. The challenges become even larger when considering uncertain data. This paper studies the problem of outlier detection on uncertain data streams. We propose Continuous Uncertain Outlier Detection (CUOD), which can quickly determine the nature of the uncertain elements by pruning to improve the efficiency. Furthermore, we propose a pruning approach -- Probability Pruning for Continuous Uncertain Outlier Detection (PCUOD) to reduce the detection cost. It is an estimated outlier probability method which can effectively reduce the amount of calculations. The cost of PCUOD incremental algorithm can satisfy the demand of uncertain data streams. Finally, a new method for parameter variable queries to CUOD is proposed, enabling the concurrent execution of different queries. To the best of our knowledge, this paper is the first work to perform outlier detection on uncertain data streams which can handle parameter variable queries simultaneously. Our methods are verified using both real data and synthetic data. The results show that they are able to reduce the required storage and running time.
基金supported by the Fundamental Research Funds for the Central Universities (Nos. FRF-TP-14025A1 and FRF-TP-15-025A2)supported by the Key Technologies Research and Development Program of 12th Five-Year Plan of China (No.2013BAI13B06)
文摘In this paper, we study the skyline group problem over a data stream. An object can dominate another object if it is not worse than the other object on all attributes and is better than the other object on at least one attribute. If an object cannot be dominated by any other object, it is a skyline object. The skyline group problem involves finding k-item groups that cannot be dominated by any other k-item group. Existing algorithms designed to find skyline groups can only process static data. However, data changes as a stream with time in many applications,and algorithms should be designed to support skyline group queries on dynamic data. In this paper, we propose new algorithms to find skyline groups over a data stream. We use data structures, namely a hash table, dominance graph, and matrix, to store dominance information and update results incrementally. We conduct experiments on synthetic datasets to evaluate the performance of the proposed algorithms. The experimental results show that our algorithms can efficiently find skyline groups over a data stream.
基金European Commission and Generalitat Valenciana government[ACIF/2012/112]and[BEFPI/2014/067].
文摘Pushed by the Internet of Things(IoT)paradigm modern sensor networks monitor a wide range of phenomena,in areas such as environmental monitoring,health care,industrial processes,and smart cities.These networks provide a continuous pulse of the almost infinite activities that are happening in the physical space and are thus,key enablers for a Digital Earth Nervous System.Nevertheless,the rapid processing of these sensor data streams still continues to challenge traditional data-handling solutions and new approaches are being requested.We propose a generic answer to this challenge,which has the potential to support any form of distributed real-time analysis.This neutral methodology follows a brokering approach to work with different kinds of data sources and uses web-based standards to achieve interoperability.As a proof of concept,we implemented the methodology to detect anomalies in real-time and applied it to the area of environmental monitoring.The developed system is capable of detecting anomalies,generating notifications,and displaying the recent situation to the user.
文摘基于物联大数据赋能的业务流程能够更快更准地感知物理世界并及时做出响应的需求突现,提出一种物联网(Internet of Things,IoT)感知的业务微流程建模方法。首先,以单个IoT对象为中心建模,融合MAPE-K(monitor,analysis,plan,execution and knowledge base,MAPE-K)模型思想,将IoT对象实例生命周期的行为状态与微流程实例状态一一映射,实现对单个IoT对象的环形自动监控和调节;其次,基于从IoT传感设备获取的数据,定义基于SASE+语言的业务规则,提取对业务流程有意义的业务事件,避免了无关事件对宏流程的干扰;最后,通过设计一个微流程建模工具原型系统,结合真实案例分析,验证了提出建模方法的有效性,实现了业务流程与IoT实时流式感知数据的结合,并显著减少了宏流程需要处理的业务事件数量。
基金Project(50275150) supported by the National Natural Science Foundation of ChinaProject(20040533035) supported by the National Research Foundation for the Doctoral Program of Higher Education of China
文摘Data stream management system (DSMS) provides convenient solutions to the problem of processing continuous queries on data streams.Previous approaches for scheduling these queries and their operators assume that each operator runs in separate thread or all operators combine in one query plan and run in a single thread.Both approaches suffer from severe drawbacks concerning the thread overhead and the stalls due to expensive operators.To overcome these drawbacks,a novel approach called clustered operators scheduling (COS) is proposed that adaptively clusters operators of the query plan into a number of groups based on their selectivity and computing cost using S-mean clustering.Experimental evaluation is provided to demonstrate the potential benefits of COS scheduling over the other scheduling strategies.COS can provide adaptive,flexible,reliable,scalable and robust design for continuous query processor.
文摘离群点检测是数据管理领域中的一个重要问题,在信用卡欺诈检测、工业工程过程管理、银行数据处理等方面等均有广泛应用.大数据时代的到来加剧了对大规模流媒体数据进行离群点检测多样化的需求,不同用户可根据自身偏好选择不同类型的数据作为离群点.针对流数据环境下多离群点检测问题,提出了一种查询处理框架MQOD(Multiple Query of Outlier Detection),利用多查询任务之间的包含关系来支持多离群点检测任务,从而提高查询效率.在MQOD框架下,构建了HT-Grid索引以支持流数据的管理,利用滑动窗口的时间特性对窗口进行划分,并根据划分结果确定执行查询的范围,减少不必要的对象访问.通过真实数据集和合成数据集对MQOD算法进行了验证,验证结果表征了算法的高效性.