In order to reduce the disk access time, a database can be stored on several simultaneously accessi- ble disks. In this paper, we are concerned with the dynamic d-attribute database allocation problem for range querie...In order to reduce the disk access time, a database can be stored on several simultaneously accessi- ble disks. In this paper, we are concerned with the dynamic d-attribute database allocation problem for range queries. An allocation method, called coordinate modulo allocation method, is proposed to al- locate data in a d-attribute database among disks so that the maximum disk accessing concurrency can be achieved for range queries. Our analysis and experiments show that the method achieves the optimum or near-optimum parallelism for range queries. The paper offers the conditions under which the method is optimal. The worst case bounds of the performance of the method are also given. In addi- tion, the parallel algorithm of processing range queries is described at the end of the paper. The meth- od has been used in the statistic and scientific database management system which is being designed by us.展开更多
Multidimensional data query has been gaining much interest in database research communities in recent years, yet many of the existing studies focus mainly on ten tralized systems. A solution to querying in Peer-to-Pee...Multidimensional data query has been gaining much interest in database research communities in recent years, yet many of the existing studies focus mainly on ten tralized systems. A solution to querying in Peer-to-Peer(P2P) environment was proposed to achieve both low processing cost in terms of the number of peers accessed and search messages and balanced query loads among peers. The system is based on a balanced tree structured P2P network. By partitioning the query space intelligently, the amount of query forwarding is effectively controlled, and the number of peers involved and search messages are also limited. Dynamic load balancing can be achieved during space partitioning and query resolving. Extensive experiments confirm the effectiveness and scalability of our algorithms on P2P networks.展开更多
In a database-as-a-service(DaaS)model,a data owner stores data in a database server of a service provider,and the DaaS adopts the encryption for data privacy and indexing for data query.However,an attacker can obtain ...In a database-as-a-service(DaaS)model,a data owner stores data in a database server of a service provider,and the DaaS adopts the encryption for data privacy and indexing for data query.However,an attacker can obtain original data’s statistical information and distribution via the indexing distribution from the database of the service provider.In this work,a novel indexing schema is proposed to satisfy privacy-preserved data management requirements,in which an attacker cannot obtain data source distribution or statistic information from the index.The approach includes 2 parts:the Hash-based indexing for encrypted data and correctness verification for range queries.The evaluation results demonstrate that the approach can hide statistical information of encrypted data distribution while can also obtain correct answers for range queries.Meanwhile,the approach can achieve nearly 10 times and 35 times improvement on encrypted data publishing and indexing respectively,compared with the start-of-the-art method order-preserving Hash-based function(OPHF).展开更多
The volume of trajectory data has become tremendously huge in recent years. How to effectively and efficiently maintain and compute such trajectory data has become a challenging task. In this paper, we propose a traje...The volume of trajectory data has become tremendously huge in recent years. How to effectively and efficiently maintain and compute such trajectory data has become a challenging task. In this paper, we propose a trajectory spatial and temporal compression framework, namely CLEAN. The key of spatial compression is to mine meaningful trajectory frequent patterns on road network. By treating the mined patterns as dictionary items, the long trajectories have the chance to be encoded by shorter paths, thus leading to smaller space cost. And an error-bounded temporal compression is carefully designed on top of the identified spatial patterns for much low space cost. Meanwhile, the patterns are also utilized to improve the performance of two trajectory applications, range query and clustering, without decompression overhead. Extensive experiments on real trajectory datasets validate that CLEAN significantly outperforms existing state-of-art approaches in terms of spatial-temporal compression and trajectory applications.展开更多
Due to the proliferation of Internet and Intranet,the distributed storage systems have received a lot of attention. These systems span a large number of machines and store huge amount of data for a lot of users.In the...Due to the proliferation of Internet and Intranet,the distributed storage systems have received a lot of attention. These systems span a large number of machines and store huge amount of data for a lot of users.In the distributed storage systems,a row can be directly accessed using a row key.We concentrate on a problem of efficient processing of queries whose predicate is on a column but not a row key.In this paper,we present a cache management technique,called DICE which maintains query results of range queries to support the next range queries.To accelerate the search time of the cached query results,we use modified Interval Ski Lists.In addition,we devise a novel cache replacement policy since DICE maintains an interval rather than a data item.Since our cache replacement policy considers the properties of intervals,our proposed technique is more efficient than traditional buffer replacement algorithms.Our experimental result demonstrates the efficiency of our proposed technique.展开更多
Distance-based range search is crucial in many real applications.In particular,given a database and a query issuer,a distance-based range search retrieves all the objects in the database whose distances from the query...Distance-based range search is crucial in many real applications.In particular,given a database and a query issuer,a distance-based range search retrieves all the objects in the database whose distances from the query issuer are less than or equal to a given threshold.Often,due to the accuracy of positioning devices,updating protocols or characteristics of applications(for example,location privacy protection),data obtained from real world are imprecise or uncertain.Therefore, existing approaches over exact databases cannot be directly applied to the uncertain scenario.In this paper,we redefine the distance-based range query in the context of uncertain databases,namely the probabilistic uncertain distance-based range (PUDR) queries,which obtain objects with confidence guarantees.We categorize the topological relationships between uncertain objects and uncertain search ranges into six cases and present the probability evaluation in each case.It is verified by experiments that our approach outperform Monte-Carlo method utilized in most existing work in precision and time cost for uniform uncertainty distribution.This approach approximates the probabilities of objects following other practical uncertainty distribution,such as Gaussian distribution with acceptable errors.Since the retrieval of a PUDR query requires accessing all the objects in the databases,which is quite costly,we propose spatial pruning and probabilistic pruning techniques to reduce the search space.Two metrics,false positive rate and false negative rate are introduced to measure the qualities of query results.An extensive empirical study has been conducted to demonstrate the efficiency and effectiveness of our proposed algorithms under various experimental settings.展开更多
I/O parallelism is considered to be a promising approach to achieving highperformance in parallel data warehousing systems where huge amounts of data and complex analyticalqueries have to be processed. This paper prop...I/O parallelism is considered to be a promising approach to achieving highperformance in parallel data warehousing systems where huge amounts of data and complex analyticalqueries have to be processed. This paper proposes a parallel secondary data cube storage structure(PHC for short) to efficiently support the processing of range sum queries and dynamic updates ondata cube using parallel computing systems. Based on PHC, two parallel algorithms for processingrange sum queries and updates are proposed also. Both the algorithms have the same time complexity,O(log^d n/P). The analytical and experimental results show that PHC and the parallel algorithms havehigh performance and achieve optimum speedup.展开更多
Data obtained from real world are imprecise or uncertain due to the accuracy of positioning devices,updating protocols or characteristics of applications.On the other hand,users sometimes prefer to qualitatively expre...Data obtained from real world are imprecise or uncertain due to the accuracy of positioning devices,updating protocols or characteristics of applications.On the other hand,users sometimes prefer to qualitatively express their requests with vague conditions and different parts of search region are in-equally important in some applications.We address the problem of efficiently processing the fuzzy range queries for uncertain moving objects whose whereabouts in time are not known exactly,for which the basic syntax is find objects always/sometimes near to the query issuer with the qualifying guarantees no less than a given threshold during a given temporal interval.We model the location uncertainty of moving objects on the utilization of probability density functions and describe the indeterminate boundary of query range with fuzzy set.We present the qualifying guarantee evaluation of objects,and propose pruning techniques based on the α-cut of fuzzy set to shrink the search space efficiently.We also design rules to reject non-qualifying objects and validate qualifying objects in order to avoid unnecessary costly numeric integrations in the refinement step.An extensive empirical study has been conducted to demonstrate the efficiency and effectiveness of algorithms under various experimental展开更多
With the increasing popularity of location-based services(LBS),data outsourcing toward clouds is an emerging paradigm for ease of data management by LBS providers.Geometric range queries are one of the fundamental sea...With the increasing popularity of location-based services(LBS),data outsourcing toward clouds is an emerging paradigm for ease of data management by LBS providers.Geometric range queries are one of the fundamental search functions in LBS,which are to find points inside geometric areas(e.g.,circles or polygons).To ensure data confidentiality,the service users tend to encrypt the data before outsourcing it.However,regarding encrypted data,only a few consider geometric range queries,where the rationale is the high-dimension calculations make these queries particularly harder.In this paper,we propose a novel scheme for geometric range queries,that can provide the privacy of data stored at a cloud server and queries.Our scheme supports querying encrypted spatial data with irregular-shaped areas,achieves fast searches and enables dynamic updates.Experimental results over real-world spatial datasets demonstrate that our scheme results in fewer communication rounds and can speed up the search time 4×compared to state-of-the-art schemes,without carrying any potentially visible leakage in the structure.展开更多
Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional ind...Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional indexing techniques have been presented to support queries on non-key columns for DOTs. However, there was no comprehensive analysis or comparison of these techniques, which brings troubles to users in selecting or proposing a proper indexing technique for a certain workload. This paper proposes a taxonomy based on six indexing issues to classify indexing techniques on DOTs and provides a comprehensive review of the state-of-the-art techniques. Based on the taxonomy, we propose a performance model named QSModel to estimate the query time and storage cost of these techniques and run experiments on a practical workload from Tencent to evaluate this model. The results show that the maximum error rates of the query time and storage cost are 24.2% and 9.8% respectively. Furthermore, we propose IndexComparator, an open source project that implements representative indexing techniques. Therefore, users can select the best-fit indexing technique based on both theoretical analysis and practical experiments.展开更多
When querying databases containing sensitive information,the privacy of individuals stored in the database has to be guaranteed.Such guarantees are provided by differentially private mechanisms which add controlled no...When querying databases containing sensitive information,the privacy of individuals stored in the database has to be guaranteed.Such guarantees are provided by differentially private mechanisms which add controlled noise to the query responses.However,most such mechanisms do not take into consideration the valid range of the query being posed.Thus,noisy responses that fall outside of this range may potentially be produced.To rectify this and therefore improve the utility of the mechanism,the commonly-used Laplace distribution can be truncated to the valid range of the query and then normalized.However,such a data-dependent operation of normalization leaks additional information about the true query response,thereby violating the differential privacy guarantee.Here,we propose a new method which preserves the differential privacy guarantee through a careful determination of an appropriate scaling parameter for the Laplace distribution.We adapt the privacy guarantee in the context of the Laplace distribution to account for data-dependent normalization factors and study this guarantee for different classes of range constraint configurations.We provide derivations of the optimal scaling parameter(i.e.,the minimal value that preserves differential privacy)for each class or provide an approximation thereof.As a result of this work,one can use the Laplace distribution to answer queries in a range-adherent and differentially private manner.To demonstrate the benefits of our proposed method of normalization,we present an experimental comparison against other range-adherent mechanisms.We show that our proposed approach is able to provide improved utility over the alternative mechanisms.展开更多
文摘In order to reduce the disk access time, a database can be stored on several simultaneously accessi- ble disks. In this paper, we are concerned with the dynamic d-attribute database allocation problem for range queries. An allocation method, called coordinate modulo allocation method, is proposed to al- locate data in a d-attribute database among disks so that the maximum disk accessing concurrency can be achieved for range queries. Our analysis and experiments show that the method achieves the optimum or near-optimum parallelism for range queries. The paper offers the conditions under which the method is optimal. The worst case bounds of the performance of the method are also given. In addi- tion, the parallel algorithm of processing range queries is described at the end of the paper. The meth- od has been used in the statistic and scientific database management system which is being designed by us.
基金Supported by the Natural Science Foundation ofJiangsu Province(BG2004034)
文摘Multidimensional data query has been gaining much interest in database research communities in recent years, yet many of the existing studies focus mainly on ten tralized systems. A solution to querying in Peer-to-Peer(P2P) environment was proposed to achieve both low processing cost in terms of the number of peers accessed and search messages and balanced query loads among peers. The system is based on a balanced tree structured P2P network. By partitioning the query space intelligently, the amount of query forwarding is effectively controlled, and the number of peers involved and search messages are also limited. Dynamic load balancing can be achieved during space partitioning and query resolving. Extensive experiments confirm the effectiveness and scalability of our algorithms on P2P networks.
基金the National Natural Science Foundation of China(No.61931019).
文摘In a database-as-a-service(DaaS)model,a data owner stores data in a database server of a service provider,and the DaaS adopts the encryption for data privacy and indexing for data query.However,an attacker can obtain original data’s statistical information and distribution via the indexing distribution from the database of the service provider.In this work,a novel indexing schema is proposed to satisfy privacy-preserved data management requirements,in which an attacker cannot obtain data source distribution or statistic information from the index.The approach includes 2 parts:the Hash-based indexing for encrypted data and correctness verification for range queries.The evaluation results demonstrate that the approach can hide statistical information of encrypted data distribution while can also obtain correct answers for range queries.Meanwhile,the approach can achieve nearly 10 times and 35 times improvement on encrypted data publishing and indexing respectively,compared with the start-of-the-art method order-preserving Hash-based function(OPHF).
基金National Natural Science Foundation of China (Grant No. 61772371,No. 61972286)
文摘The volume of trajectory data has become tremendously huge in recent years. How to effectively and efficiently maintain and compute such trajectory data has become a challenging task. In this paper, we propose a trajectory spatial and temporal compression framework, namely CLEAN. The key of spatial compression is to mine meaningful trajectory frequent patterns on road network. By treating the mined patterns as dictionary items, the long trajectories have the chance to be encoded by shorter paths, thus leading to smaller space cost. And an error-bounded temporal compression is carefully designed on top of the identified spatial patterns for much low space cost. Meanwhile, the patterns are also utilized to improve the performance of two trajectory applications, range query and clustering, without decompression overhead. Extensive experiments on real trajectory datasets validate that CLEAN significantly outperforms existing state-of-art approaches in terms of spatial-temporal compression and trajectory applications.
基金supported by National Research Foundation of Korea under Grant No.2010-0016165supported by the IT R&D Program of MIC/IITA under Grant No.2007-S-016-02.
文摘Due to the proliferation of Internet and Intranet,the distributed storage systems have received a lot of attention. These systems span a large number of machines and store huge amount of data for a lot of users.In the distributed storage systems,a row can be directly accessed using a row key.We concentrate on a problem of efficient processing of queries whose predicate is on a column but not a row key.In this paper,we present a cache management technique,called DICE which maintains query results of range queries to support the next range queries.To accelerate the search time of the cached query results,we use modified Interval Ski Lists.In addition,we devise a novel cache replacement policy since DICE maintains an interval rather than a data item.Since our cache replacement policy considers the properties of intervals,our proposed technique is more efficient than traditional buffer replacement algorithms.Our experimental result demonstrates the efficiency of our proposed technique.
基金supported by the National High Technology Research and Development 863 Program of China under Grant No. 2007AA01Z404the Program of Jiangsu Province under Grant No.BE2008135.
文摘Distance-based range search is crucial in many real applications.In particular,given a database and a query issuer,a distance-based range search retrieves all the objects in the database whose distances from the query issuer are less than or equal to a given threshold.Often,due to the accuracy of positioning devices,updating protocols or characteristics of applications(for example,location privacy protection),data obtained from real world are imprecise or uncertain.Therefore, existing approaches over exact databases cannot be directly applied to the uncertain scenario.In this paper,we redefine the distance-based range query in the context of uncertain databases,namely the probabilistic uncertain distance-based range (PUDR) queries,which obtain objects with confidence guarantees.We categorize the topological relationships between uncertain objects and uncertain search ranges into six cases and present the probability evaluation in each case.It is verified by experiments that our approach outperform Monte-Carlo method utilized in most existing work in precision and time cost for uniform uncertainty distribution.This approach approximates the probabilities of objects following other practical uncertainty distribution,such as Gaussian distribution with acceptable errors.Since the retrieval of a PUDR query requires accessing all the objects in the databases,which is quite costly,we propose spatial pruning and probabilistic pruning techniques to reduce the search space.Two metrics,false positive rate and false negative rate are introduced to measure the qualities of query results.An extensive empirical study has been conducted to demonstrate the efficiency and effectiveness of our proposed algorithms under various experimental settings.
文摘I/O parallelism is considered to be a promising approach to achieving highperformance in parallel data warehousing systems where huge amounts of data and complex analyticalqueries have to be processed. This paper proposes a parallel secondary data cube storage structure(PHC for short) to efficiently support the processing of range sum queries and dynamic updates ondata cube using parallel computing systems. Based on PHC, two parallel algorithms for processingrange sum queries and updates are proposed also. Both the algorithms have the same time complexity,O(log^d n/P). The analytical and experimental results show that PHC and the parallel algorithms havehigh performance and achieve optimum speedup.
基金supported by the National High Technology Research and Development 863 Program of China under Grant No. 2007AA01Z404the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No. 20103218110017+1 种基金the Science & Technology Pillar Program of Jiangsu Province of China under Grant No. BE2008135the Postdoctoral Science Foundation of China under Grant No. 20100481133.
文摘Data obtained from real world are imprecise or uncertain due to the accuracy of positioning devices,updating protocols or characteristics of applications.On the other hand,users sometimes prefer to qualitatively express their requests with vague conditions and different parts of search region are in-equally important in some applications.We address the problem of efficiently processing the fuzzy range queries for uncertain moving objects whose whereabouts in time are not known exactly,for which the basic syntax is find objects always/sometimes near to the query issuer with the qualifying guarantees no less than a given threshold during a given temporal interval.We model the location uncertainty of moving objects on the utilization of probability density functions and describe the indeterminate boundary of query range with fuzzy set.We present the qualifying guarantee evaluation of objects,and propose pruning techniques based on the α-cut of fuzzy set to shrink the search space efficiently.We also design rules to reject non-qualifying objects and validate qualifying objects in order to avoid unnecessary costly numeric integrations in the refinement step.An extensive empirical study has been conducted to demonstrate the efficiency and effectiveness of algorithms under various experimental
基金supported by National Natural Science Foundation of China(Nos.62072460,62076245,61772538,61772536,61772537,4212022).
文摘With the increasing popularity of location-based services(LBS),data outsourcing toward clouds is an emerging paradigm for ease of data management by LBS providers.Geometric range queries are one of the fundamental search functions in LBS,which are to find points inside geometric areas(e.g.,circles or polygons).To ensure data confidentiality,the service users tend to encrypt the data before outsourcing it.However,regarding encrypted data,only a few consider geometric range queries,where the rationale is the high-dimension calculations make these queries particularly harder.In this paper,we propose a novel scheme for geometric range queries,that can provide the privacy of data stored at a cloud server and queries.Our scheme supports querying encrypted spatial data with irregular-shaped areas,achieves fast searches and enables dynamic updates.Experimental results over real-world spatial datasets demonstrate that our scheme results in fewer communication rounds and can speed up the search time 4×compared to state-of-the-art schemes,without carrying any potentially visible leakage in the structure.
基金This work is partially supported by the Strategic Priority Program of Chinese Academy of Sciences under Grant No. XDB02040009, the Key Program of the National Natural Science Foundation of China under Grant No. 61532016, the Key Program of Cloud Computing and Big Data of the Ministry of the Science and Technology of China under Grant No. 2016YFB1000200, and Tencent Inc.
文摘Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional indexing techniques have been presented to support queries on non-key columns for DOTs. However, there was no comprehensive analysis or comparison of these techniques, which brings troubles to users in selecting or proposing a proper indexing technique for a certain workload. This paper proposes a taxonomy based on six indexing issues to classify indexing techniques on DOTs and provides a comprehensive review of the state-of-the-art techniques. Based on the taxonomy, we propose a performance model named QSModel to estimate the query time and storage cost of these techniques and run experiments on a practical workload from Tencent to evaluate this model. The results show that the maximum error rates of the query time and storage cost are 24.2% and 9.8% respectively. Furthermore, we propose IndexComparator, an open source project that implements representative indexing techniques. Therefore, users can select the best-fit indexing technique based on both theoretical analysis and practical experiments.
基金supported by the Natural Sciences and Engineering Research Council of Canada(NSERC)under Grant Nos.RGPIN-2020-06482,RGPIN-2016-06253 and CGSD2-503941-2017.
文摘When querying databases containing sensitive information,the privacy of individuals stored in the database has to be guaranteed.Such guarantees are provided by differentially private mechanisms which add controlled noise to the query responses.However,most such mechanisms do not take into consideration the valid range of the query being posed.Thus,noisy responses that fall outside of this range may potentially be produced.To rectify this and therefore improve the utility of the mechanism,the commonly-used Laplace distribution can be truncated to the valid range of the query and then normalized.However,such a data-dependent operation of normalization leaks additional information about the true query response,thereby violating the differential privacy guarantee.Here,we propose a new method which preserves the differential privacy guarantee through a careful determination of an appropriate scaling parameter for the Laplace distribution.We adapt the privacy guarantee in the context of the Laplace distribution to account for data-dependent normalization factors and study this guarantee for different classes of range constraint configurations.We provide derivations of the optimal scaling parameter(i.e.,the minimal value that preserves differential privacy)for each class or provide an approximation thereof.As a result of this work,one can use the Laplace distribution to answer queries in a range-adherent and differentially private manner.To demonstrate the benefits of our proposed method of normalization,we present an experimental comparison against other range-adherent mechanisms.We show that our proposed approach is able to provide improved utility over the alternative mechanisms.