To solve the query processing correctness problem for semantic-based relational data integration,the semantics of SAPRQL(simple protocol and RDF query language) queries is defined.In the course of query rewriting,al...To solve the query processing correctness problem for semantic-based relational data integration,the semantics of SAPRQL(simple protocol and RDF query language) queries is defined.In the course of query rewriting,all relative tables are found and decomposed into minimal connectable units.Minimal connectable units are joined according to semantic queries to produce the semantically correct query plans.Algorithms for query rewriting and transforming are presented.Computational complexity of the algorithms is discussed.Under the worst case,the query decomposing algorithm can be finished in O(n2) time and the query rewriting algorithm requires O(nm) time.And the performance of the algorithms is verified by experiments,and experimental results show that when the length of query is less than 8,the query processing algorithms can provide satisfactory performance.展开更多
A defining characteristic of continuous queries over on-line data streams,possibly bounded by sliding windows,is the potentially infinite and time-evolving nature of their inputs and outputs.For different update patte...A defining characteristic of continuous queries over on-line data streams,possibly bounded by sliding windows,is the potentially infinite and time-evolving nature of their inputs and outputs.For different update patterns of continuous queries,suitable data structures bring great query processing efficiency.In this paper,we proposed a data structure suitable for weak nonmonotonic update pattern in which the lifetime of each tuple is known at generation time,but the length of lifetime is not necessarily the same.The new data structure combined the ladder queue with the feature of weak non-monotonic update pattern.The experiment results show that the new data structure performs much better than the traditional calendar queue in many cases.展开更多
The query processing in distributed database management systems(DBMS)faces more challenges,such as more operators,and more factors in cost models and meta-data,than that in a single-node DMBS,in which query optimizati...The query processing in distributed database management systems(DBMS)faces more challenges,such as more operators,and more factors in cost models and meta-data,than that in a single-node DMBS,in which query optimization is already an NP-hard problem.Learned query optimizers(mainly in the single-node DBMS)receive attention due to its capability to capture data distributions and flexible ways to avoid hard-craft rules in refinement and adaptation to new hardware.In this paper,we focus on extensions of learned query optimizers to distributed DBMSs.Specifically,we propose one possible but general architecture of the learned query optimizer in the distributed context and highlight differences from the learned optimizer in the single-node ones.In addition,we discuss the challenges and possible solutions.展开更多
Recent development of wireless communication technologies and the popularity of smart phones .are making location-based services (LBS) popular. However, requesting queries to LBS servers with users' exact locations...Recent development of wireless communication technologies and the popularity of smart phones .are making location-based services (LBS) popular. However, requesting queries to LBS servers with users' exact locations may threat the privacy of users. Therefore, there have been many researches on generating a cloaked query region for user privacy protection. Consequently, an efficient query processing algorithm for a query region is required. So, in this paper, we propose k-nearest neighbor query (k-NN) processing algorithms for a query region in road networks. To efficiently retrieve k-NN points of interest (POIs), we make use of the Island index. We also propose a method that generates an adaptive Island index to improve the query processing performance and storage usage. Finally, we show by our performance analysis that our k-NN query processing algorithms outperform the existing k-Range Nearest Neighbor (kRNN) algorithm in terms of network expansion cost and query processing time.展开更多
An extent join to compute path expressions containing parent-children andancestor-descendent operations and two path expression optimization rules, path-shortening andpath-complementing, are presented in this paper. P...An extent join to compute path expressions containing parent-children andancestor-descendent operations and two path expression optimization rules, path-shortening andpath-complementing, are presented in this paper. Path-shortening reduces the number of joins byshortening the path while path-complementing optimizes the path execution by using an equivalentcomplementary path expression to compute the original one. Experimental results show that thealgorithms proposed are more efficient than traditional algorithms.展开更多
In this paper, constrained K closest pairs query is introduced, wbich retrieves the K closest pairs satisfying the given spatial constraint from two datasets. For data sets indexed by R trees in spatial databases, thr...In this paper, constrained K closest pairs query is introduced, wbich retrieves the K closest pairs satisfying the given spatial constraint from two datasets. For data sets indexed by R trees in spatial databases, three algorithms are presented for answering this kind of query. Among of them, two-phase Range+Join and Join+Range algorithms adopt the strategy that changes the execution order of range and closest pairs queries, and constrained heap-based algorithm utilizes extended distance functions to prune search space and minimize the pruning distance. Experimental results show that constrained heap-base algorithm has better applicability and performance than two-phase algorithms.展开更多
Recently,in the area of big data,some popular applications such as web search engines and recommendation systems,face the problem to diversify results during query processing.In this sense,it is both significant and e...Recently,in the area of big data,some popular applications such as web search engines and recommendation systems,face the problem to diversify results during query processing.In this sense,it is both significant and essential to propose methods to deal with big data in order to increase the diversity of the result set.In this paper,we firstly define the diversity of a set and the ability of an element to improve the overall diversity.Based on these definitions,we propose a diversification framework which has good performance in terms of effectiveness and efficiency.Also,this framework has theoretical guarantee on probability of success.Secondly,we design implementation algorithms based on this framework for both numerical and string data.Thirdly,for numerical and string data respectively,we carry out extensive experiments on real data to verify the performance of our proposed framework,and also perform scalability experiments on synthetic data.展开更多
Cloud computing is a very promising paradigm of service-oriented computing. One major benefit of cloud computing is its elasticity, i.e., the system's capacity to provide and remove resources automatically at runtime...Cloud computing is a very promising paradigm of service-oriented computing. One major benefit of cloud computing is its elasticity, i.e., the system's capacity to provide and remove resources automatically at runtime. For that, it is essential to design and implement an efficient and effective technique that takes full advantage of the system's potential flexibility. This paper presents a non-intrusive approach that monitors the performance of relational database management systems in a cloud infrastructure, and automatically makes decisions to maximize the efficiency of the provider's environment while still satisfying agreed upon "service level agreements" (SLAs). Our experiments conducted on Amazon's cloud infrastructure, confirm that our technique is capable of automatically and dynamically adjusting the system's allocated resources observing the SLA.展开更多
In recent years there has been a significant interest in peer-to-peer (P2P) environments in the community of data management. However, almost all work, so far, is focused on exact query processing in current P2P dat...In recent years there has been a significant interest in peer-to-peer (P2P) environments in the community of data management. However, almost all work, so far, is focused on exact query processing in current P2P data systems. The autonomy of peers also is not considered enough. In addition, the system cost is very high because the information publishing method of shared data is based on each document instead of document set. In this paper, abstract indices (AbIx) are presented to implement content-based approximate queries in centralized, distributed and structured P2P data systems. It can be used to search as few peers as possible but get as many returns satisfying users' queries as possible on the guarantee of high autonomy of peers. Also, abstract indices have low system cost, can improve the query processing speed, and support very frequent updates and the set information publishing method. In order to verify the effectiveness of abstract indices, a simulator of 10,000 peers, over 3 million documents is made, and several metrics are proposed. The experimental results show that abstract indices work well in various P2P data systems.展开更多
Area query processing is significant for various applications of wireless sensor networks since it can request information of particular areas in the monitored environment. Existing query processing techniques cannot ...Area query processing is significant for various applications of wireless sensor networks since it can request information of particular areas in the monitored environment. Existing query processing techniques cannot solve area queries. Intuitively, centralized processing on Base Station can accomplish area queries via collecting information from all sensor nodes. However, this method is not suitable for wireless sensor networks with limited energy since a large amount of energy is wasted for reporting useless data. This motivates us to propose an energy-efficient in-network area query processing scheme. In our scheme, the monitored area is partitioned into grids, and a unique gray code number is used to represent a Grid ID (GID), which is also an effective way to describe an area. Furthermore, a reporting tree is constructed to process area merging and data aggregations. Based on the properties of GIDs, subareas can be merged easily and useless data can be discarded as early as possible to reduce energy consumption. For energy-efficiently answering continuous queries, we also design an incremental update method to continuously generate query results. In essence, all of these strategies are pivots to conserve energy consumption. With a thorough simulation study, it is shown that our scheme is effective and energy-efficient展开更多
In order to reduce the disk access time, a database can be stored on several simultaneously accessi- ble disks. In this paper, we are concerned with the dynamic d-attribute database allocation problem for range querie...In order to reduce the disk access time, a database can be stored on several simultaneously accessi- ble disks. In this paper, we are concerned with the dynamic d-attribute database allocation problem for range queries. An allocation method, called coordinate modulo allocation method, is proposed to al- locate data in a d-attribute database among disks so that the maximum disk accessing concurrency can be achieved for range queries. Our analysis and experiments show that the method achieves the optimum or near-optimum parallelism for range queries. The paper offers the conditions under which the method is optimal. The worst case bounds of the performance of the method are also given. In addi- tion, the parallel algorithm of processing range queries is described at the end of the paper. The meth- od has been used in the statistic and scientific database management system which is being designed by us.展开更多
Multiple time series (MTS), which describes an object in multi-dimensions, is based on single time series and has been proved to be useful. In this paper, a new analytical method called α/β-Dominant-Skyline on MTS...Multiple time series (MTS), which describes an object in multi-dimensions, is based on single time series and has been proved to be useful. In this paper, a new analytical method called α/β-Dominant-Skyline on MTS and a formal definition of the α/β-dominant skyline MTS are given. Also, three algorithms, called NL, BC and MFB, are proposed to address the α/β-dominant skyline queries over MTS. Finally experimental results on both synthetic and real data verify the correctness and effectiveness of the proposed method and algorithms.展开更多
Purpose-Resilient distributed processing technique(RDPT),in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.Design/methodology/approach-The proposed wo...Purpose-Resilient distributed processing technique(RDPT),in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.Design/methodology/approach-The proposed work is implemented with Pig Latin with Spark contexts to develop query processing in a distributed environment.Findings-Query processing in Hadoop influences the distributed processing with the MapReduce model.MapReduce caters to the works on different nodes with the implementation of complex mappers and reducers.Its results are valid for some extent size of the data.Originality/value-Pig supports the required parallel processing framework with the following constructs during the processing of queries:FOREACH;FLATTEN;COGROUP.展开更多
The idea of positional inverted index is exploited for indexing of graph database. The main idea is the use of hashing tables in order to prune a considerable portion of graph database that cannot contain the answer s...The idea of positional inverted index is exploited for indexing of graph database. The main idea is the use of hashing tables in order to prune a considerable portion of graph database that cannot contain the answer set. These tables are implemented using column-based techniques and are used to store graphs of database, frequent sub-graphs and the neighborhood of nodes. In order to exact checking of remaining graphs, the vertex invariant is used for isomorphism test which can be parallel implemented. The results of evaluation indicate that proposed method outperforms existing methods.展开更多
The unified multimedia query language (UMQL) is a powerful general-purpose multimedia query language, and it is very suitable for multimedia information retrieval. The paper proposes a grammar analysis model to impl...The unified multimedia query language (UMQL) is a powerful general-purpose multimedia query language, and it is very suitable for multimedia information retrieval. The paper proposes a grammar analysis model to implement an effective grammatical processing for the language. It separates the grammar analysis ofa UMQL query specification into two phases: syntactic analysis and semantic analysis, and then respectively uses Backus-Naur form (EBNF) and logical algebra to specify both restrictive grammar rules. As a result, the model can present error guiding information for a query specification which owns incorrect grammar. The model not only suits well the processing of UMQL queries, but aLso has a guiding significance for other projects concerning query processings of descriptive query languages.展开更多
Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient metho...Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient methods for this purpose: division transmission and progressive transmission methods. In division transmission method, a map can be divided into several parts, called “tiles”, and only tiles can be transmitted at the request of a client. In progressive transmission method, a map can be split into several phase views based on the significance of vertices, and a server produces a target object and then transmits it progressively when this spatial object is requested from a client. In order to achieve these methods, the algorithms, “tile division”, “priority order estimation” and the strategies for data transmission are proposed in this paper, respectively. Compared with such traditional methods as “map total transmission” and “layer transmission”, the web based GIS data transmission, proposed in this paper, is advantageous in the increase of the data transmission efficiency by a great margin.展开更多
HashQuery,a Hash-area-based data dissemination protocol,was designed in wireless sensor networks. Using a Hash function which uses time as the key,both mobile sinks and sensors can determine the same Hash area. The se...HashQuery,a Hash-area-based data dissemination protocol,was designed in wireless sensor networks. Using a Hash function which uses time as the key,both mobile sinks and sensors can determine the same Hash area. The sensors can send the information about the events that they monitor to the Hash area and the mobile sinks need only to query that area instead of flooding among the whole network,and thus much energy can be saved. In addition,the location of the Hash area changes over time so as to balance the energy consumption in the whole network. Theoretical analysis shows that the proposed protocol can be energy-efficient and simulation studies further show that when there are 5 sources and 5 sinks in the network,it can save at least 50% energy compared with the existing two-tier data dissemination(TTDD) protocol,especially in large-scale wireless sensor networks.展开更多
The k-median problem has attracted a number of researchers. However,few of them have considered both the dynamic environment and the issue of accuracy. In this paper,a new type of query is studied,called continuous me...The k-median problem has attracted a number of researchers. However,few of them have considered both the dynamic environment and the issue of accuracy. In this paper,a new type of query is studied,called continuous median monitoring (CMM) query. It considers the k-median problem under dynamic environment with an accuracy guarantee. A continuous group nearest neighbor based (CGB) algorithm and an average distance medoid (ADM) algorithm are proposed to solve the CMM problem. ADM is a hill climbing schemed algorithm and achieves a rapid converging speed by checking only qualified candidates. Experiments show that ADM is more efficient than CGB and outperforms the classical PAM (partitioning around medoids) and CLARANS (clustering large applications based on randomized search) algorithms with various parameter settings.展开更多
Automated performance tuning of data management systems offer various benefits such as improved performance, declined administration costs, and reduced workloads to database administrators (DBAs). Currently, DBAs tune...Automated performance tuning of data management systems offer various benefits such as improved performance, declined administration costs, and reduced workloads to database administrators (DBAs). Currently, DBAs tune the performance of database systems with a little help from the database servers. In this paper, we propose a new technique for automated performance tuning of data management systems. Firstly, we show how to use the periods of low workload time for performance improvements in the periods of high workload time. We demonstrate that extensions of a database system with materialised views and indices when a workload is low may contribute to better performance for a successive period of high workload. The paper proposes several online algorithms for continuous processing of estimated database workloads and for the discovery of the best plan for materialised view and index database extensions and of elimination of the extensions that are no longer needed. We present the results of experiments that show how the proposed automated performance tuning technique improves the overall performance of a data management system. 展开更多
Answering reachability queries is one of the fundamental graph operations.Existing approaches either accelerate index construction by constructing an index that covers only partial reachability relationship,which may ...Answering reachability queries is one of the fundamental graph operations.Existing approaches either accelerate index construction by constructing an index that covers only partial reachability relationship,which may result in performing cost traversing operation when answering a query;or accelerate query answering by constructing an index covering the complete reachability relationship,which may be inefficient due to comparing the complete node labels.We propose a novel labeling scheme,which covers the complete reachability relationship,to accelerate reachability queries processing.The idea is to decompose the given directed acyclic graph(DAG)G into two subgraphs,G1 and G2.For G1,we propose to use topological labels consisting of two integers to answer all reachability queries.For G2,we construct 2-hop labels as existing methods do to answer queries that cannot be answered by topological labels.The benefits of our method lie in two aspects.On one hand,our method does not need to perform the cost traversing operation when answering queries.On the other hand,our method can quickly answer most queries in constant time without comparing the whole node labels.We confirm the efficiency of our approaches by extensive experimental studies using 20 real datasets.展开更多
基金Weaponry Equipment Pre-Research Foundation of PLA Equipment Ministry (No. 9140A06050409JB8102)Pre-Research Foundation of PLA University of Science and Technology (No. 2009JSJ11)
文摘To solve the query processing correctness problem for semantic-based relational data integration,the semantics of SAPRQL(simple protocol and RDF query language) queries is defined.In the course of query rewriting,all relative tables are found and decomposed into minimal connectable units.Minimal connectable units are joined according to semantic queries to produce the semantically correct query plans.Algorithms for query rewriting and transforming are presented.Computational complexity of the algorithms is discussed.Under the worst case,the query decomposing algorithm can be finished in O(n2) time and the query rewriting algorithm requires O(nm) time.And the performance of the algorithms is verified by experiments,and experimental results show that when the length of query is less than 8,the query processing algorithms can provide satisfactory performance.
基金Funded by the Natural Science Foundation of China (No. 60873030)National High Technology Research and Development Program of China (No. 2007AA01Z309)Defense Pre-Research Foundation of China (No. 9140A04010209JW0504 and No. 9140A15040208JW0501)
文摘A defining characteristic of continuous queries over on-line data streams,possibly bounded by sliding windows,is the potentially infinite and time-evolving nature of their inputs and outputs.For different update patterns of continuous queries,suitable data structures bring great query processing efficiency.In this paper,we proposed a data structure suitable for weak nonmonotonic update pattern in which the lifetime of each tuple is known at generation time,but the length of lifetime is not necessarily the same.The new data structure combined the ladder queue with the feature of weak non-monotonic update pattern.The experiment results show that the new data structure performs much better than the traditional calendar queue in many cases.
基金partially supported by NSFC under Grant Nos.61832001 and 62272008ZTE Industry-University-Institute Fund Project。
文摘The query processing in distributed database management systems(DBMS)faces more challenges,such as more operators,and more factors in cost models and meta-data,than that in a single-node DMBS,in which query optimization is already an NP-hard problem.Learned query optimizers(mainly in the single-node DBMS)receive attention due to its capability to capture data distributions and flexible ways to avoid hard-craft rules in refinement and adaptation to new hardware.In this paper,we focus on extensions of learned query optimizers to distributed DBMSs.Specifically,we propose one possible but general architecture of the learned query optimizer in the distributed context and highlight differences from the learned optimizer in the single-node ones.In addition,we discuss the challenges and possible solutions.
基金supported by the Korea Institute of Science and Technology Information (KISTI)
文摘Recent development of wireless communication technologies and the popularity of smart phones .are making location-based services (LBS) popular. However, requesting queries to LBS servers with users' exact locations may threat the privacy of users. Therefore, there have been many researches on generating a cloaked query region for user privacy protection. Consequently, an efficient query processing algorithm for a query region is required. So, in this paper, we propose k-nearest neighbor query (k-NN) processing algorithms for a query region in road networks. To efficiently retrieve k-NN points of interest (POIs), we make use of the Island index. We also propose a method that generates an adaptive Island index to improve the query processing performance and storage usage. Finally, we show by our performance analysis that our k-NN query processing algorithms outperform the existing k-Range Nearest Neighbor (kRNN) algorithm in terms of network expansion cost and query processing time.
文摘An extent join to compute path expressions containing parent-children andancestor-descendent operations and two path expression optimization rules, path-shortening andpath-complementing, are presented in this paper. Path-shortening reduces the number of joins byshortening the path while path-complementing optimizes the path execution by using an equivalentcomplementary path expression to compute the original one. Experimental results show that thealgorithms proposed are more efficient than traditional algorithms.
基金Supported by National Natural Science Foundationof China (60073045)
文摘In this paper, constrained K closest pairs query is introduced, wbich retrieves the K closest pairs satisfying the given spatial constraint from two datasets. For data sets indexed by R trees in spatial databases, three algorithms are presented for answering this kind of query. Among of them, two-phase Range+Join and Join+Range algorithms adopt the strategy that changes the execution order of range and closest pairs queries, and constrained heap-based algorithm utilizes extended distance functions to prune search space and minimize the pruning distance. Experimental results show that constrained heap-base algorithm has better applicability and performance than two-phase algorithms.
基金This paper was partially supported by NSFC(Grant Nos.U1509216,U1866602,61602129)and Microsoft Research Asia.
文摘Recently,in the area of big data,some popular applications such as web search engines and recommendation systems,face the problem to diversify results during query processing.In this sense,it is both significant and essential to propose methods to deal with big data in order to increase the diversity of the result set.In this paper,we firstly define the diversity of a set and the ability of an element to improve the overall diversity.Based on these definitions,we propose a diversification framework which has good performance in terms of effectiveness and efficiency.Also,this framework has theoretical guarantee on probability of success.Secondly,we design implementation algorithms based on this framework for both numerical and string data.Thirdly,for numerical and string data respectively,we carry out extensive experiments on real data to verify the performance of our proposed framework,and also perform scalability experiments on synthetic data.
文摘Cloud computing is a very promising paradigm of service-oriented computing. One major benefit of cloud computing is its elasticity, i.e., the system's capacity to provide and remove resources automatically at runtime. For that, it is essential to design and implement an efficient and effective technique that takes full advantage of the system's potential flexibility. This paper presents a non-intrusive approach that monitors the performance of relational database management systems in a cloud infrastructure, and automatically makes decisions to maximize the efficiency of the provider's environment while still satisfying agreed upon "service level agreements" (SLAs). Our experiments conducted on Amazon's cloud infrastructure, confirm that our technique is capable of automatically and dynamically adjusting the system's allocated resources observing the SLA.
基金Supported by the National Natural Science Foundation of China under Grant No. 60473077 and the Program for New Century Excellent Talents in University.
文摘In recent years there has been a significant interest in peer-to-peer (P2P) environments in the community of data management. However, almost all work, so far, is focused on exact query processing in current P2P data systems. The autonomy of peers also is not considered enough. In addition, the system cost is very high because the information publishing method of shared data is based on each document instead of document set. In this paper, abstract indices (AbIx) are presented to implement content-based approximate queries in centralized, distributed and structured P2P data systems. It can be used to search as few peers as possible but get as many returns satisfying users' queries as possible on the guarantee of high autonomy of peers. Also, abstract indices have low system cost, can improve the query processing speed, and support very frequent updates and the set information publishing method. In order to verify the effectiveness of abstract indices, a simulator of 10,000 peers, over 3 million documents is made, and several metrics are proposed. The experimental results show that abstract indices work well in various P2P data systems.
文摘Area query processing is significant for various applications of wireless sensor networks since it can request information of particular areas in the monitored environment. Existing query processing techniques cannot solve area queries. Intuitively, centralized processing on Base Station can accomplish area queries via collecting information from all sensor nodes. However, this method is not suitable for wireless sensor networks with limited energy since a large amount of energy is wasted for reporting useless data. This motivates us to propose an energy-efficient in-network area query processing scheme. In our scheme, the monitored area is partitioned into grids, and a unique gray code number is used to represent a Grid ID (GID), which is also an effective way to describe an area. Furthermore, a reporting tree is constructed to process area merging and data aggregations. Based on the properties of GIDs, subareas can be merged easily and useless data can be discarded as early as possible to reduce energy consumption. For energy-efficiently answering continuous queries, we also design an incremental update method to continuously generate query results. In essence, all of these strategies are pivots to conserve energy consumption. With a thorough simulation study, it is shown that our scheme is effective and energy-efficient
文摘In order to reduce the disk access time, a database can be stored on several simultaneously accessi- ble disks. In this paper, we are concerned with the dynamic d-attribute database allocation problem for range queries. An allocation method, called coordinate modulo allocation method, is proposed to al- locate data in a d-attribute database among disks so that the maximum disk accessing concurrency can be achieved for range queries. Our analysis and experiments show that the method achieves the optimum or near-optimum parallelism for range queries. The paper offers the conditions under which the method is optimal. The worst case bounds of the performance of the method are also given. In addi- tion, the parallel algorithm of processing range queries is described at the end of the paper. The meth- od has been used in the statistic and scientific database management system which is being designed by us.
基金supported by the National Natural Science Foundation of China under Grant No. 61170064the National High Technology Research and Development 863 Program of China under Grant No. 2013AA013204the Tsinghua National Laboratory for Information Science and Technology (TNLIST) Cross-Discipline Foundation
文摘Multiple time series (MTS), which describes an object in multi-dimensions, is based on single time series and has been proved to be useful. In this paper, a new analytical method called α/β-Dominant-Skyline on MTS and a formal definition of the α/β-dominant skyline MTS are given. Also, three algorithms, called NL, BC and MFB, are proposed to address the α/β-dominant skyline queries over MTS. Finally experimental results on both synthetic and real data verify the correctness and effectiveness of the proposed method and algorithms.
文摘Purpose-Resilient distributed processing technique(RDPT),in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.Design/methodology/approach-The proposed work is implemented with Pig Latin with Spark contexts to develop query processing in a distributed environment.Findings-Query processing in Hadoop influences the distributed processing with the MapReduce model.MapReduce caters to the works on different nodes with the implementation of complex mappers and reducers.Its results are valid for some extent size of the data.Originality/value-Pig supports the required parallel processing framework with the following constructs during the processing of queries:FOREACH;FLATTEN;COGROUP.
文摘The idea of positional inverted index is exploited for indexing of graph database. The main idea is the use of hashing tables in order to prune a considerable portion of graph database that cannot contain the answer set. These tables are implemented using column-based techniques and are used to store graphs of database, frequent sub-graphs and the neighborhood of nodes. In order to exact checking of remaining graphs, the vertex invariant is used for isomorphism test which can be parallel implemented. The results of evaluation indicate that proposed method outperforms existing methods.
基金the National High-Tech Research and Development Plan of China under Grant No. 2006AA01Z430.
文摘The unified multimedia query language (UMQL) is a powerful general-purpose multimedia query language, and it is very suitable for multimedia information retrieval. The paper proposes a grammar analysis model to implement an effective grammatical processing for the language. It separates the grammar analysis ofa UMQL query specification into two phases: syntactic analysis and semantic analysis, and then respectively uses Backus-Naur form (EBNF) and logical algebra to specify both restrictive grammar rules. As a result, the model can present error guiding information for a query specification which owns incorrect grammar. The model not only suits well the processing of UMQL queries, but aLso has a guiding significance for other projects concerning query processings of descriptive query languages.
文摘Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient methods for this purpose: division transmission and progressive transmission methods. In division transmission method, a map can be divided into several parts, called “tiles”, and only tiles can be transmitted at the request of a client. In progressive transmission method, a map can be split into several phase views based on the significance of vertices, and a server produces a target object and then transmits it progressively when this spatial object is requested from a client. In order to achieve these methods, the algorithms, “tile division”, “priority order estimation” and the strategies for data transmission are proposed in this paper, respectively. Compared with such traditional methods as “map total transmission” and “layer transmission”, the web based GIS data transmission, proposed in this paper, is advantageous in the increase of the data transmission efficiency by a great margin.
基金Project(07JJ1010) supported by Hunan Provincial Natural Science Foundation of ChinaProjects(2006AA01Z202, 2006AA01Z199) supported by the National High-Tech Research and Development Program of China+2 种基金Project(7002102) supported by the City University of Hong Kong, Strategic Research Grant (SRG)Project(IRT-0661) supported by the Program for Changjiang Scholars and Innovative Research Team in UniversityProject(NCET-06-0686) supported by the Program for New Century Excellent Talents in University
文摘HashQuery,a Hash-area-based data dissemination protocol,was designed in wireless sensor networks. Using a Hash function which uses time as the key,both mobile sinks and sensors can determine the same Hash area. The sensors can send the information about the events that they monitor to the Hash area and the mobile sinks need only to query that area instead of flooding among the whole network,and thus much energy can be saved. In addition,the location of the Hash area changes over time so as to balance the energy consumption in the whole network. Theoretical analysis shows that the proposed protocol can be energy-efficient and simulation studies further show that when there are 5 sources and 5 sinks in the network,it can save at least 50% energy compared with the existing two-tier data dissemination(TTDD) protocol,especially in large-scale wireless sensor networks.
文摘The k-median problem has attracted a number of researchers. However,few of them have considered both the dynamic environment and the issue of accuracy. In this paper,a new type of query is studied,called continuous median monitoring (CMM) query. It considers the k-median problem under dynamic environment with an accuracy guarantee. A continuous group nearest neighbor based (CGB) algorithm and an average distance medoid (ADM) algorithm are proposed to solve the CMM problem. ADM is a hill climbing schemed algorithm and achieves a rapid converging speed by checking only qualified candidates. Experiments show that ADM is more efficient than CGB and outperforms the classical PAM (partitioning around medoids) and CLARANS (clustering large applications based on randomized search) algorithms with various parameter settings.
文摘Automated performance tuning of data management systems offer various benefits such as improved performance, declined administration costs, and reduced workloads to database administrators (DBAs). Currently, DBAs tune the performance of database systems with a little help from the database servers. In this paper, we propose a new technique for automated performance tuning of data management systems. Firstly, we show how to use the periods of low workload time for performance improvements in the periods of high workload time. We demonstrate that extensions of a database system with materialised views and indices when a workload is low may contribute to better performance for a successive period of high workload. The paper proposes several online algorithms for continuous processing of estimated database workloads and for the discovery of the best plan for materialised view and index database extensions and of elimination of the extensions that are no longer needed. We present the results of experiments that show how the proposed automated performance tuning technique improves the overall performance of a data management system.
基金This work was partly supported by National Key R&D Program of China,Grant No.2017YFB0309800the grants from the Natural Science Foundation of China(No.61472339,No.61303040,No.61572421,No.61272124)+1 种基金Shanghai Alliance Program(LM201552)Shanghai University of Engineering and Technology School-Enterprise cooperation projects(15)(DZ-025).
文摘Answering reachability queries is one of the fundamental graph operations.Existing approaches either accelerate index construction by constructing an index that covers only partial reachability relationship,which may result in performing cost traversing operation when answering a query;or accelerate query answering by constructing an index covering the complete reachability relationship,which may be inefficient due to comparing the complete node labels.We propose a novel labeling scheme,which covers the complete reachability relationship,to accelerate reachability queries processing.The idea is to decompose the given directed acyclic graph(DAG)G into two subgraphs,G1 and G2.For G1,we propose to use topological labels consisting of two integers to answer all reachability queries.For G2,we construct 2-hop labels as existing methods do to answer queries that cannot be answered by topological labels.The benefits of our method lie in two aspects.On one hand,our method does not need to perform the cost traversing operation when answering queries.On the other hand,our method can quickly answer most queries in constant time without comparing the whole node labels.We confirm the efficiency of our approaches by extensive experimental studies using 20 real datasets.