A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,wh...A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,which is implemented as an extended reservoir-sampling algorithm.A skip factor based on the change ratio of data-values is introduced to describe the distribution characteristics of data-values adaptively.The second step of this method is to partition the fluxes of data streams averagely,which is implemented with two alternative equal-depth histogram generating algorithms that fit the different cases:one for incremental maintenance based on heuristics and the other for periodical updates to generate an approximate partition vector.The experimental results on actual data prove that the method is efficient,practical and suitable for time-varying data streams processing.展开更多
Data stream management system (DSMS) provides convenient solutions to the problem of processing continuous queries on data streams.Previous approaches for scheduling these queries and their operators assume that each ...Data stream management system (DSMS) provides convenient solutions to the problem of processing continuous queries on data streams.Previous approaches for scheduling these queries and their operators assume that each operator runs in separate thread or all operators combine in one query plan and run in a single thread.Both approaches suffer from severe drawbacks concerning the thread overhead and the stalls due to expensive operators.To overcome these drawbacks,a novel approach called clustered operators scheduling (COS) is proposed that adaptively clusters operators of the query plan into a number of groups based on their selectivity and computing cost using S-mean clustering.Experimental evaluation is provided to demonstrate the potential benefits of COS scheduling over the other scheduling strategies.COS can provide adaptive,flexible,reliable,scalable and robust design for continuous query processor.展开更多
Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviati...Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic, which cannot be identified accurately under current techniques. In this paper, we introduce a novel method to approximate the actual rank. This method associates unknown web traffic with websites according to statistical probabilities. Then, we construct a probabilistic top-k query model to rank websites. We conduct several experiments by using real HTTP traffic traces collected from a commercial ISP covering an entire city in northern China. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service. Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. All these findings are helpful for ISPs understanding network status and deploying Content Distributed Network(CDN).展开更多
基金The High Technology Research Plan of Jiangsu Prov-ince (No.BG2004034)the Foundation of Graduate Creative Program ofJiangsu Province (No.xm04-36).
文摘A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,which is implemented as an extended reservoir-sampling algorithm.A skip factor based on the change ratio of data-values is introduced to describe the distribution characteristics of data-values adaptively.The second step of this method is to partition the fluxes of data streams averagely,which is implemented with two alternative equal-depth histogram generating algorithms that fit the different cases:one for incremental maintenance based on heuristics and the other for periodical updates to generate an approximate partition vector.The experimental results on actual data prove that the method is efficient,practical and suitable for time-varying data streams processing.
基金Project(50275150) supported by the National Natural Science Foundation of ChinaProject(20040533035) supported by the National Research Foundation for the Doctoral Program of Higher Education of China
文摘Data stream management system (DSMS) provides convenient solutions to the problem of processing continuous queries on data streams.Previous approaches for scheduling these queries and their operators assume that each operator runs in separate thread or all operators combine in one query plan and run in a single thread.Both approaches suffer from severe drawbacks concerning the thread overhead and the stalls due to expensive operators.To overcome these drawbacks,a novel approach called clustered operators scheduling (COS) is proposed that adaptively clusters operators of the query plan into a number of groups based on their selectivity and computing cost using S-mean clustering.Experimental evaluation is provided to demonstrate the potential benefits of COS scheduling over the other scheduling strategies.COS can provide adaptive,flexible,reliable,scalable and robust design for continuous query processor.
基金supported by 111 Project of China under Grant No.B08004
文摘Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic, which cannot be identified accurately under current techniques. In this paper, we introduce a novel method to approximate the actual rank. This method associates unknown web traffic with websites according to statistical probabilities. Then, we construct a probabilistic top-k query model to rank websites. We conduct several experiments by using real HTTP traffic traces collected from a commercial ISP covering an entire city in northern China. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service. Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. All these findings are helpful for ISPs understanding network status and deploying Content Distributed Network(CDN).