The smart grid has caught grea attentions in recent years, which is poised to transform a centralized, producer-controlled network to a decentralized, consumerinteractive network that's supported by fine-grained m...The smart grid has caught grea attentions in recent years, which is poised to transform a centralized, producer-controlled network to a decentralized, consumerinteractive network that's supported by fine-grained monitoring. Large-scale WSNs(Wireless Sensor Networks) have been considered one of the very promising technologies to support the implementation of smart grid. WSNs are applied in almos every aspect of smart grid, including power generation, power transmission, power distribution, power utilization and power dispatch, and the data query processing of‘WSNs in power grid' become an hotspo issue due to the amount of data of power grid is very large and the requirement of response time is very high. To meet the demands top-k query processing is a good choice which performs the cooperative query by aggregating the database objects' degree of match for each different query predicate and returning the best k matching objects. In this paper, a framework that can effectively apply top-k query to wireless sensor network in smart grid is proposed, which is based on the cluster-topology sensor network. In the new method, local indices are used to optimize the necessary query routing and process intermediate results inside the cluster to cu down the data traffic, and the hierarchical join query is executed based on the local resultsBesides, top-k query results are verified by the clean-up process, and two schemes are taken to deal with the problem of node's dynamicity, which further reduce communication cost. Case studies and experimental results show that our algorithm has outperformed the current existing one with higher quality results and better efficiently.展开更多
Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviati...Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic, which cannot be identified accurately under current techniques. In this paper, we introduce a novel method to approximate the actual rank. This method associates unknown web traffic with websites according to statistical probabilities. Then, we construct a probabilistic top-k query model to rank websites. We conduct several experiments by using real HTTP traffic traces collected from a commercial ISP covering an entire city in northern China. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service. Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. All these findings are helpful for ISPs understanding network status and deploying Content Distributed Network(CDN).展开更多
针对传统序列模式挖掘(SPM)不考虑模式重复性且忽略各项的效用(单价或利润)与模式长度对用户兴趣度影响的问题,提出一次性条件下top-k高平均效用序列模式挖掘(TOUP)算法。TOUP算法主要包括两个核心步骤:平均效用计算和候选模式生成。首...针对传统序列模式挖掘(SPM)不考虑模式重复性且忽略各项的效用(单价或利润)与模式长度对用户兴趣度影响的问题,提出一次性条件下top-k高平均效用序列模式挖掘(TOUP)算法。TOUP算法主要包括两个核心步骤:平均效用计算和候选模式生成。首先,提出基于各项出现位置与项重复关系数组的CSP(Calculation Support of Pattern)算法计算模式支持度,从而实现模式平均效用的快速计算;其次,采用项集扩展和序列扩展生成候选模式,并提出了最大平均效用上界,基于该上界实现对候选模式的有效剪枝。在5个真实数据集和1个合成数据集上的实验结果表明,相较于TOUP-dfs和HAOP-ms算法,TOUP算法的候选模式数分别降低了38.5%~99.8%和0.9%~77.6%;运行时间分别降低了33.6%~97.1%和57.9%~97.2%。TOUP的算法性能更优,能更高效地挖掘用户感兴趣的模式。展开更多
With its untameable and traceable properties,blockchain technology has been widely used in the field of data sharing.How to preserve individual privacy while enabling efficient data queries is one of the primary issue...With its untameable and traceable properties,blockchain technology has been widely used in the field of data sharing.How to preserve individual privacy while enabling efficient data queries is one of the primary issues with secure data sharing.In this paper,we study verifiable keyword frequency(KF)queries with local differential privacy in blockchain.Both the numerical and the keyword attributes are present in data objects;the latter are sensitive and require privacy protection.However,prior studies in blockchain have the problem of trilemma in privacy protection and are unable to handle KF queries.We propose an efficient framework that protects data owners’privacy on keyword attributes while enabling quick and verifiable query processing for KF queries.The framework computes an estimate of a keyword’s frequency and is efficient in query time and verification object(VO)size.A utility-optimized local differential privacy technique is used for privacy protection.The data owner adds noise locally into data based on local differential privacy so that the attacker cannot infer the owner of the keywords while keeping the difference in the probability distribution of the KF within the privacy budget.We propose the VB-cm tree as the authenticated data structure(ADS).The VB-cm tree combines the Verkle tree and the Count-Min sketch(CM-sketch)to lower the VO size and query time.The VB-cm tree uses the vector commitment to verify the query results.The fixed-size CM-sketch,which summarizes the frequency of multiple keywords,is used to estimate the KF via hashing operations.We conduct an extensive evaluation of the proposed framework.The experimental results show that compared to theMerkle B+tree,the query time is reduced by 52.38%,and the VO size is reduced by more than one order of magnitude.展开更多
The query processing in distributed database management systems(DBMS)faces more challenges,such as more operators,and more factors in cost models and meta-data,than that in a single-node DMBS,in which query optimizati...The query processing in distributed database management systems(DBMS)faces more challenges,such as more operators,and more factors in cost models and meta-data,than that in a single-node DMBS,in which query optimization is already an NP-hard problem.Learned query optimizers(mainly in the single-node DBMS)receive attention due to its capability to capture data distributions and flexible ways to avoid hard-craft rules in refinement and adaptation to new hardware.In this paper,we focus on extensions of learned query optimizers to distributed DBMSs.Specifically,we propose one possible but general architecture of the learned query optimizer in the distributed context and highlight differences from the learned optimizer in the single-node ones.In addition,we discuss the challenges and possible solutions.展开更多
Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this ...Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this type of queries, whose key idea is to maintain a subset of objects in the window, and try to retrieve answers from it. However, all the existing algorithms are sensitive to query parameters and data distribution. In addition, they suffer from expensive overhead for incremental maintenance, and thus cannot satisfy real-time requirement. In this paper, we define a novel query named (ε, δ)-approximate continuous top-κ query, which returns approximate answers for top-κ query. In order to efficiently support this query, we propose an efficient framework, named PABF (Probabilistic Approximate Based Framework), to support approximate top-κ query over sliding window. We firstly maintain a self-adaptive pruning value, which could filter out newly arrived objects who have a probability less than 1 - 5 of being a query result. For those objects that are not filtered, we combine them together, if the score difference among them is less than a threshold. To efficiently maintain these combined results, the framework PABF also proposes a multi-phase merging algorithm. Theoretical analysis indicates that even in the worst case, we require only logarithmic complexity for maintaining each candidate.展开更多
Caching is an important technique to enhance the efficiency of query processing. Unfortunately, traditional caching mechanisms are not efficient for deep Web because of storage space and dynamic maintenance limitation...Caching is an important technique to enhance the efficiency of query processing. Unfortunately, traditional caching mechanisms are not efficient for deep Web because of storage space and dynamic maintenance limitations. In this paper, we present on providing a cache mechanism based on Top-K data source (KDS-CM) instead of result records for deep Web query. By integrating techniques from IR and Top-K, a data reorganization strategy is presented to model KDS-CM. Also some measures about cache management and optimization are proposed to improve the performances of cache effectively. Experimental results show the benefits of KDS-CM in execution cost and dynamic maintenance when compared with various alternate strategies.展开更多
We propose an influential set based moving k keyword query processing model, which avoids the shortcoming of safe region-based approaches that the update cost and update frequency cannot be optimized simultaneously. B...We propose an influential set based moving k keyword query processing model, which avoids the shortcoming of safe region-based approaches that the update cost and update frequency cannot be optimized simultaneously. Based on the model, we design a parallel query processing method and a parallel validation method for multicore processing platforms. The time complexity of the algorithms is O((log|D|+p.k)/p.k)?and O(log p.k), respectively, which are all O(1/k) times the time complexity of the state-of-the-art method. The experiment result confirms the superiority of our algorithms over the state-of-the-art method.展开更多
Purpose:Existing researches of predicting queries with news intents have tried to extract the classification features from external knowledge bases,this paper tries to present how to apply features extracted from quer...Purpose:Existing researches of predicting queries with news intents have tried to extract the classification features from external knowledge bases,this paper tries to present how to apply features extracted from query logs for automatic identification of news queries without using any external resources.Design/methodology/approach:First,we manually labeled 1,220 news queries from Sogou.com.Based on the analysis of these queries,we then identified three features of news queries in terms of query content,time of query occurrence and user click behavior.Afterwards,we used 12 effective features proposed in literature as baseline and conducted experiments based on the support vector machine(SVM)classifier.Finally,we compared the impacts of the features used in this paper on the identification of news queries.Findings:Compared with baseline features,the F-score has been improved from 0.6414 to0.8368 after the use of three newly-identified features,among which the burst point(bst)was the most effective while predicting news queries.In addition,query expression(qes)was more useful than query terms,and among the click behavior-based features,news URL was the most effective one.Research limitations:Analyses based on features extracted from query logs might lead to produce limited results.Instead of short queries,the segmentation tool used in this study has been more widely applied for long texts.Practical implications:The research will be helpful for general-purpose search engines to address search intents for news events.Originality/value:Our approach provides a new and different perspective in recognizing queries with news intent without such large news corpora as blogs or Twitter.展开更多
文摘The smart grid has caught grea attentions in recent years, which is poised to transform a centralized, producer-controlled network to a decentralized, consumerinteractive network that's supported by fine-grained monitoring. Large-scale WSNs(Wireless Sensor Networks) have been considered one of the very promising technologies to support the implementation of smart grid. WSNs are applied in almos every aspect of smart grid, including power generation, power transmission, power distribution, power utilization and power dispatch, and the data query processing of‘WSNs in power grid' become an hotspo issue due to the amount of data of power grid is very large and the requirement of response time is very high. To meet the demands top-k query processing is a good choice which performs the cooperative query by aggregating the database objects' degree of match for each different query predicate and returning the best k matching objects. In this paper, a framework that can effectively apply top-k query to wireless sensor network in smart grid is proposed, which is based on the cluster-topology sensor network. In the new method, local indices are used to optimize the necessary query routing and process intermediate results inside the cluster to cu down the data traffic, and the hierarchical join query is executed based on the local resultsBesides, top-k query results are verified by the clean-up process, and two schemes are taken to deal with the problem of node's dynamicity, which further reduce communication cost. Case studies and experimental results show that our algorithm has outperformed the current existing one with higher quality results and better efficiently.
基金supported by 111 Project of China under Grant No.B08004
文摘Top-k ranking of websites according to traffic volume is important for Internet Service Providers(ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic, which cannot be identified accurately under current techniques. In this paper, we introduce a novel method to approximate the actual rank. This method associates unknown web traffic with websites according to statistical probabilities. Then, we construct a probabilistic top-k query model to rank websites. We conduct several experiments by using real HTTP traffic traces collected from a commercial ISP covering an entire city in northern China. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service. Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. All these findings are helpful for ISPs understanding network status and deploying Content Distributed Network(CDN).
文摘针对传统序列模式挖掘(SPM)不考虑模式重复性且忽略各项的效用(单价或利润)与模式长度对用户兴趣度影响的问题,提出一次性条件下top-k高平均效用序列模式挖掘(TOUP)算法。TOUP算法主要包括两个核心步骤:平均效用计算和候选模式生成。首先,提出基于各项出现位置与项重复关系数组的CSP(Calculation Support of Pattern)算法计算模式支持度,从而实现模式平均效用的快速计算;其次,采用项集扩展和序列扩展生成候选模式,并提出了最大平均效用上界,基于该上界实现对候选模式的有效剪枝。在5个真实数据集和1个合成数据集上的实验结果表明,相较于TOUP-dfs和HAOP-ms算法,TOUP算法的候选模式数分别降低了38.5%~99.8%和0.9%~77.6%;运行时间分别降低了33.6%~97.1%和57.9%~97.2%。TOUP的算法性能更优,能更高效地挖掘用户感兴趣的模式。
文摘With its untameable and traceable properties,blockchain technology has been widely used in the field of data sharing.How to preserve individual privacy while enabling efficient data queries is one of the primary issues with secure data sharing.In this paper,we study verifiable keyword frequency(KF)queries with local differential privacy in blockchain.Both the numerical and the keyword attributes are present in data objects;the latter are sensitive and require privacy protection.However,prior studies in blockchain have the problem of trilemma in privacy protection and are unable to handle KF queries.We propose an efficient framework that protects data owners’privacy on keyword attributes while enabling quick and verifiable query processing for KF queries.The framework computes an estimate of a keyword’s frequency and is efficient in query time and verification object(VO)size.A utility-optimized local differential privacy technique is used for privacy protection.The data owner adds noise locally into data based on local differential privacy so that the attacker cannot infer the owner of the keywords while keeping the difference in the probability distribution of the KF within the privacy budget.We propose the VB-cm tree as the authenticated data structure(ADS).The VB-cm tree combines the Verkle tree and the Count-Min sketch(CM-sketch)to lower the VO size and query time.The VB-cm tree uses the vector commitment to verify the query results.The fixed-size CM-sketch,which summarizes the frequency of multiple keywords,is used to estimate the KF via hashing operations.We conduct an extensive evaluation of the proposed framework.The experimental results show that compared to theMerkle B+tree,the query time is reduced by 52.38%,and the VO size is reduced by more than one order of magnitude.
基金partially supported by NSFC under Grant Nos.61832001 and 62272008ZTE Industry-University-Institute Fund Project。
文摘The query processing in distributed database management systems(DBMS)faces more challenges,such as more operators,and more factors in cost models and meta-data,than that in a single-node DMBS,in which query optimization is already an NP-hard problem.Learned query optimizers(mainly in the single-node DBMS)receive attention due to its capability to capture data distributions and flexible ways to avoid hard-craft rules in refinement and adaptation to new hardware.In this paper,we focus on extensions of learned query optimizers to distributed DBMSs.Specifically,we propose one possible but general architecture of the learned query optimizer in the distributed context and highlight differences from the learned optimizer in the single-node ones.In addition,we discuss the challenges and possible solutions.
基金This work is partially supported by the National Natural Science Fund for Distinguish Young Scholars of China under Grant No. 61322208, the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61272178 and 61572122, and the Key Program of the National Natural Science Foundation of China under Grant No. 61532021.
文摘Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this type of queries, whose key idea is to maintain a subset of objects in the window, and try to retrieve answers from it. However, all the existing algorithms are sensitive to query parameters and data distribution. In addition, they suffer from expensive overhead for incremental maintenance, and thus cannot satisfy real-time requirement. In this paper, we define a novel query named (ε, δ)-approximate continuous top-κ query, which returns approximate answers for top-κ query. In order to efficiently support this query, we propose an efficient framework, named PABF (Probabilistic Approximate Based Framework), to support approximate top-κ query over sliding window. We firstly maintain a self-adaptive pruning value, which could filter out newly arrived objects who have a probability less than 1 - 5 of being a query result. For those objects that are not filtered, we combine them together, if the score difference among them is less than a threshold. To efficiently maintain these combined results, the framework PABF also proposes a multi-phase merging algorithm. Theoretical analysis indicates that even in the worst case, we require only logarithmic complexity for maintaining each candidate.
基金Supported by the National Natural Science Foundation of China (60673139, 60473073, 60573090)
文摘Caching is an important technique to enhance the efficiency of query processing. Unfortunately, traditional caching mechanisms are not efficient for deep Web because of storage space and dynamic maintenance limitations. In this paper, we present on providing a cache mechanism based on Top-K data source (KDS-CM) instead of result records for deep Web query. By integrating techniques from IR and Top-K, a data reorganization strategy is presented to model KDS-CM. Also some measures about cache management and optimization are proposed to improve the performances of cache effectively. Experimental results show the benefits of KDS-CM in execution cost and dynamic maintenance when compared with various alternate strategies.
文摘We propose an influential set based moving k keyword query processing model, which avoids the shortcoming of safe region-based approaches that the update cost and update frequency cannot be optimized simultaneously. Based on the model, we design a parallel query processing method and a parallel validation method for multicore processing platforms. The time complexity of the algorithms is O((log|D|+p.k)/p.k)?and O(log p.k), respectively, which are all O(1/k) times the time complexity of the state-of-the-art method. The experiment result confirms the superiority of our algorithms over the state-of-the-art method.
基金supported by the Social Science Planning Foundation of Chongqing(Grant No.:2011QNCB28)
文摘Purpose:Existing researches of predicting queries with news intents have tried to extract the classification features from external knowledge bases,this paper tries to present how to apply features extracted from query logs for automatic identification of news queries without using any external resources.Design/methodology/approach:First,we manually labeled 1,220 news queries from Sogou.com.Based on the analysis of these queries,we then identified three features of news queries in terms of query content,time of query occurrence and user click behavior.Afterwards,we used 12 effective features proposed in literature as baseline and conducted experiments based on the support vector machine(SVM)classifier.Finally,we compared the impacts of the features used in this paper on the identification of news queries.Findings:Compared with baseline features,the F-score has been improved from 0.6414 to0.8368 after the use of three newly-identified features,among which the burst point(bst)was the most effective while predicting news queries.In addition,query expression(qes)was more useful than query terms,and among the click behavior-based features,news URL was the most effective one.Research limitations:Analyses based on features extracted from query logs might lead to produce limited results.Instead of short queries,the segmentation tool used in this study has been more widely applied for long texts.Practical implications:The research will be helpful for general-purpose search engines to address search intents for news events.Originality/value:Our approach provides a new and different perspective in recognizing queries with news intent without such large news corpora as blogs or Twitter.