Join-aggregate is an important and widely used operation in database system. However, it is time-consuming to process join-aggregate query in big data environment, especially on MapReduce framework. The main bottlenec...Join-aggregate is an important and widely used operation in database system. However, it is time-consuming to process join-aggregate query in big data environment, especially on MapReduce framework. The main bottlenecks contain two aspects: lots of I/O caused by temporary data and heavy communication overhead between different data nodes during query processing. To overcome such disadvantages, we design a data structure called Reference Primary Key table (RPK-table) which stores the relationship of primary key and foreign key between tables. Based on this structure, we propose an improved algorithm on MapReduce framework for join-aggregate query. Experi-ments on TPC-H dataset demonstrate that our algorithm outperforms existing methods in terms of communication cost and query response time.展开更多
文摘Join-aggregate is an important and widely used operation in database system. However, it is time-consuming to process join-aggregate query in big data environment, especially on MapReduce framework. The main bottlenecks contain two aspects: lots of I/O caused by temporary data and heavy communication overhead between different data nodes during query processing. To overcome such disadvantages, we design a data structure called Reference Primary Key table (RPK-table) which stores the relationship of primary key and foreign key between tables. Based on this structure, we propose an improved algorithm on MapReduce framework for join-aggregate query. Experi-ments on TPC-H dataset demonstrate that our algorithm outperforms existing methods in terms of communication cost and query response time.