Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this ...Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this type of queries, whose key idea is to maintain a subset of objects in the window, and try to retrieve answers from it. However, all the existing algorithms are sensitive to query parameters and data distribution. In addition, they suffer from expensive overhead for incremental maintenance, and thus cannot satisfy real-time requirement. In this paper, we define a novel query named (ε, δ)-approximate continuous top-κ query, which returns approximate answers for top-κ query. In order to efficiently support this query, we propose an efficient framework, named PABF (Probabilistic Approximate Based Framework), to support approximate top-κ query over sliding window. We firstly maintain a self-adaptive pruning value, which could filter out newly arrived objects who have a probability less than 1 - 5 of being a query result. For those objects that are not filtered, we combine them together, if the score difference among them is less than a threshold. To efficiently maintain these combined results, the framework PABF also proposes a multi-phase merging algorithm. Theoretical analysis indicates that even in the worst case, we require only logarithmic complexity for maintaining each candidate.展开更多
Massive XML data are increasingly generated for the representation, storage and exchange of web information. Twig query processing over massive XML data has become a research focus. However, most traditional algorithm...Massive XML data are increasingly generated for the representation, storage and exchange of web information. Twig query processing over massive XML data has become a research focus. However, most traditional algorithms cannot be directly implemented in a distributed manner. Some of the existing distributed algorithms generate a lot of useless intermediate results and execute many join operations of partial results in most cases; others require the priori knowledge of query pattern before XML partition, storage and query processing, which is impractical in the cases of large-scale data or frequent incoming new queries. To improve efficiency and scalability, in this paper, we propose a 3-phase distributed algorithm DisT3 based on node distribution mechanism to avoid unnecessary intermediate results. Furthermore, we propose a lightweight local index ReP with an enhanced XML partitioning approach using arbitrary partitioning strategy, and based on ReP we propose an improved 2-phase distributed algorithm DisT2ReP to further reduce the communication cost. After the performance guarantees are analyzed, extensive experiments are conducted to verify the efficiency and scalability of our proposed algorithms in distributed twig query applications.展开更多
Massive data is written to blockchain systems for the destination of keeping safe. However, existing blockchain protocols still demand that each full node has to contain the entire chain. Most nodes quit because they ...Massive data is written to blockchain systems for the destination of keeping safe. However, existing blockchain protocols still demand that each full node has to contain the entire chain. Most nodes quit because they are unable to grow their storage space with the size of data. As the number of nodes decreases, the security of blockchains would significantly reduce. We present SE-Chain, a novel scale-out blockchain model that improves storage scalability under the premise of ensuring safety and achieves efficient retrieval. The SE-Chain consists of three parts:the data layer, the processing layer and the storage layer. In the data layer, each transaction is stored in the AB-M tree (Adaptive Balanced Merkle tree), which adaptively combines the advantages of balanced binary tree (quick retrieval) and Merkle tree (quick verification). In the processing layer, the full nodes store the part of the complete chain selected by the duplicate ratio regulation algorithm. Meanwhile, the node reliability verification method is used for increasing the stability of full nodes and reducing the risk of imperfect data recovering caused by the reduction of duplicate number in the storage layer. The experimental results on real datasets show that the query time of SE-Chain based on the AB-M tree is reduced by 17% when 16 nodes exist. Overall, SE-Chain improves the storage scalability extremely and implements efficient querying of transactions.展开更多
With more and more knowledge provided by WWW, querying and mining the knowledge bases have attracted much research attention. Among all the queries over knowledge bases, which are usually modelled as graphs, a keyword...With more and more knowledge provided by WWW, querying and mining the knowledge bases have attracted much research attention. Among all the queries over knowledge bases, which are usually modelled as graphs, a keyword query is the most widely used one. Although the problem of keyword query over graphs has been deeply studied for years, knowledge bases, as special error-tolerant graphs, lead to the results of the traditional defined keyword queries out of users' satisfaction. Thus, in this paper, we define a new keyword query, called confident r-clique, specific for knowledge bases based on the r-clique definition for keyword query on general graphs, which has been proved to be the best one. However, as we prove in the paper, finding the confident r-cliques is #P-hard. We propose a filtering-and-verification framework to improve the search efficiency. In the filtering phase, we develop the tightest upper bound of the confident r-clique, and design an index together with its search algorithm, which suits the large scale of knowledge bases well. In the verification phase, we develop an efficient sampling method to verify the final answers from the candidates remaining in the filtering phase. Extensive experiments demonstrate that the results derived from our new definition satisfy the users' requirement better compared with the traditional r-clique definition, and our algorithms are efficient.展开更多
基金This work is partially supported by the National Natural Science Fund for Distinguish Young Scholars of China under Grant No. 61322208, the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61272178 and 61572122, and the Key Program of the National Natural Science Foundation of China under Grant No. 61532021.
文摘Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this type of queries, whose key idea is to maintain a subset of objects in the window, and try to retrieve answers from it. However, all the existing algorithms are sensitive to query parameters and data distribution. In addition, they suffer from expensive overhead for incremental maintenance, and thus cannot satisfy real-time requirement. In this paper, we define a novel query named (ε, δ)-approximate continuous top-κ query, which returns approximate answers for top-κ query. In order to efficiently support this query, we propose an efficient framework, named PABF (Probabilistic Approximate Based Framework), to support approximate top-κ query over sliding window. We firstly maintain a self-adaptive pruning value, which could filter out newly arrived objects who have a probability less than 1 - 5 of being a query result. For those objects that are not filtered, we combine them together, if the score difference among them is less than a threshold. To efficiently maintain these combined results, the framework PABF also proposes a multi-phase merging algorithm. Theoretical analysis indicates that even in the worst case, we require only logarithmic complexity for maintaining each candidate.
基金This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61272181, 61672145, 61572121 and U1401256.
文摘Massive XML data are increasingly generated for the representation, storage and exchange of web information. Twig query processing over massive XML data has become a research focus. However, most traditional algorithms cannot be directly implemented in a distributed manner. Some of the existing distributed algorithms generate a lot of useless intermediate results and execute many join operations of partial results in most cases; others require the priori knowledge of query pattern before XML partition, storage and query processing, which is impractical in the cases of large-scale data or frequent incoming new queries. To improve efficiency and scalability, in this paper, we propose a 3-phase distributed algorithm DisT3 based on node distribution mechanism to avoid unnecessary intermediate results. Furthermore, we propose a lightweight local index ReP with an enhanced XML partitioning approach using arbitrary partitioning strategy, and based on ReP we propose an improved 2-phase distributed algorithm DisT2ReP to further reduce the communication cost. After the performance guarantees are analyzed, extensive experiments are conducted to verify the efficiency and scalability of our proposed algorithms in distributed twig query applications.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.61472069,61402089 and U1401256China Postdoctoral Science Foundation under Grant Nos.2019T120216 and 2018M641705the Fundamental Research Funds for the Central Universities of China under Grant Nos.N2019007,N180408019 and N180101028.
文摘Massive data is written to blockchain systems for the destination of keeping safe. However, existing blockchain protocols still demand that each full node has to contain the entire chain. Most nodes quit because they are unable to grow their storage space with the size of data. As the number of nodes decreases, the security of blockchains would significantly reduce. We present SE-Chain, a novel scale-out blockchain model that improves storage scalability under the premise of ensuring safety and achieves efficient retrieval. The SE-Chain consists of three parts:the data layer, the processing layer and the storage layer. In the data layer, each transaction is stored in the AB-M tree (Adaptive Balanced Merkle tree), which adaptively combines the advantages of balanced binary tree (quick retrieval) and Merkle tree (quick verification). In the processing layer, the full nodes store the part of the complete chain selected by the duplicate ratio regulation algorithm. Meanwhile, the node reliability verification method is used for increasing the stability of full nodes and reducing the risk of imperfect data recovering caused by the reduction of duplicate number in the storage layer. The experimental results on real datasets show that the query time of SE-Chain based on the AB-M tree is reduced by 17% when 16 nodes exist. Overall, SE-Chain improves the storage scalability extremely and implements efficient querying of transactions.
基金Yu-Rong Cheng and Guo-Ren Wang are supported by the National Natural Science Foundation of China (NSFC) under Grant Nos. 61332006, 61332014, 61328202 and U1401256. Ye Yuan is supported by the NSFC under Grant No. 61572119 and the Fundamental Research Fudnds for the Central Universities of China under Grant Nos. N150402005 and N130504006. Lei Chen is supported by the NSFC under Grant No. 61328202.
文摘With more and more knowledge provided by WWW, querying and mining the knowledge bases have attracted much research attention. Among all the queries over knowledge bases, which are usually modelled as graphs, a keyword query is the most widely used one. Although the problem of keyword query over graphs has been deeply studied for years, knowledge bases, as special error-tolerant graphs, lead to the results of the traditional defined keyword queries out of users' satisfaction. Thus, in this paper, we define a new keyword query, called confident r-clique, specific for knowledge bases based on the r-clique definition for keyword query on general graphs, which has been proved to be the best one. However, as we prove in the paper, finding the confident r-cliques is #P-hard. We propose a filtering-and-verification framework to improve the search efficiency. In the filtering phase, we develop the tightest upper bound of the confident r-clique, and design an index together with its search algorithm, which suits the large scale of knowledge bases well. In the verification phase, we develop an efficient sampling method to verify the final answers from the candidates remaining in the filtering phase. Extensive experiments demonstrate that the results derived from our new definition satisfy the users' requirement better compared with the traditional r-clique definition, and our algorithms are efficient.