The rapidly increasing scale of data warehouses is challenging today's data analytical technologies. A con- ventional data analytical platform processes data warehouse queries using a star schema -- it normalizes the...The rapidly increasing scale of data warehouses is challenging today's data analytical technologies. A con- ventional data analytical platform processes data warehouse queries using a star schema -- it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users' demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries. In this paper, we propose a new query processing frame- work for data warehouses. It pushes the join operations par- tially to the pre-processing phase and partially to the post- processing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation oper- ations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and sta- ble despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data ware- house. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.展开更多
MapReduce is a popular framework for large- scale data analysis. As data access is critical for MapReduce's performance, some recent work has applied different storage models, such as column-store or PAX-store, to Ma...MapReduce is a popular framework for large- scale data analysis. As data access is critical for MapReduce's performance, some recent work has applied different storage models, such as column-store or PAX-store, to MapReduce platforms. However, the data access patterns of different queries are very different. No storage model is able to achieve the optimal performance alone. In this paper, we study how MapReduce can benefit from the presence of two different column-store models - pure column-store and PAX-store. We propose a hybrid storage system called hybrid columnstore (HC-store). Based on the characteristics of the incoming MapReduce tasks, our storage model can determine whether to access the underlying pure column-store or PAX-store. We studied the properties of the different storage models and create a cost model to decide the data access strategy at runtime. We have implemented HC-store on top of Hadoop. Our experimental results show that HC-store is able to outperform PAX-store and column-store, especially when confronted with diverse workload.展开更多
Transposable elements(TEs)are a major determinant of eukaryotic genome size.The collective properties of a genomic TE community reveal the history of TE/host evolutionary dynamics and impact present-day host structure...Transposable elements(TEs)are a major determinant of eukaryotic genome size.The collective properties of a genomic TE community reveal the history of TE/host evolutionary dynamics and impact present-day host structure and function,from genome to organism levels.In rare cases,TE community/genome size has greatly expanded in animals,associated with increased cell size and changes to anatomy and physiology.Here,we characterize the TE landscape of the genome and transcriptome in an amphibian with a giant genome—the caecilian Ichthyophis bannanicus,which we show has a genome size of 12.2 Gb.Amphibians are an important model system because the clade includes independent cases of genomic gigantism.The I.bannanicus genome differs compositionally from other giant amphibian genomes,but shares a low rate of ectopic recombination-mediated deletion.We examine TE activity using expression and divergence plots;TEs account for 15%of somatic transcription,and most superfamilies appear active.We quantify TE diversity in the caecilian,as well as other vertebrates with a range of genome sizes,using diversity indices commonly applied in community ecology.We synthesize previous models that integrate TE abundance,diversity,and activity,and test whether the caecilian meets model predictions for genomes with high TE abundance.We propose thorough,consistent characterization of TEs to strengthen future comparative analyses.Such analyses will ultimately be required to reveal whether the divergent TE assemblages found across convergent gigantic genomes reflect fundamental shared features of TE/host genome evolutionary dynamics.展开更多
文摘The rapidly increasing scale of data warehouses is challenging today's data analytical technologies. A con- ventional data analytical platform processes data warehouse queries using a star schema -- it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users' demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries. In this paper, we propose a new query processing frame- work for data warehouses. It pushes the join operations par- tially to the pre-processing phase and partially to the post- processing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation oper- ations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and sta- ble despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data ware- house. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.
基金Acknowledgements This work was sponsored by the National Key Basic Research Program of China (973 Program) (2014CB340403), the National Natural Science Foundation of China (Grant Nos. 61170013, 61272138 and 61232007).
文摘MapReduce is a popular framework for large- scale data analysis. As data access is critical for MapReduce's performance, some recent work has applied different storage models, such as column-store or PAX-store, to MapReduce platforms. However, the data access patterns of different queries are very different. No storage model is able to achieve the optimal performance alone. In this paper, we study how MapReduce can benefit from the presence of two different column-store models - pure column-store and PAX-store. We propose a hybrid storage system called hybrid columnstore (HC-store). Based on the characteristics of the incoming MapReduce tasks, our storage model can determine whether to access the underlying pure column-store or PAX-store. We studied the properties of the different storage models and create a cost model to decide the data access strategy at runtime. We have implemented HC-store on top of Hadoop. Our experimental results show that HC-store is able to outperform PAX-store and column-store, especially when confronted with diverse workload.
基金supported by the National Natural Science Foundation of China(Grant No.31570391 to WJ)the National Key R&D Program of China(Grant No.2016YFC0503200)the National Science Foundation of USA(Grant No.1911585 to RLM)
文摘Transposable elements(TEs)are a major determinant of eukaryotic genome size.The collective properties of a genomic TE community reveal the history of TE/host evolutionary dynamics and impact present-day host structure and function,from genome to organism levels.In rare cases,TE community/genome size has greatly expanded in animals,associated with increased cell size and changes to anatomy and physiology.Here,we characterize the TE landscape of the genome and transcriptome in an amphibian with a giant genome—the caecilian Ichthyophis bannanicus,which we show has a genome size of 12.2 Gb.Amphibians are an important model system because the clade includes independent cases of genomic gigantism.The I.bannanicus genome differs compositionally from other giant amphibian genomes,but shares a low rate of ectopic recombination-mediated deletion.We examine TE activity using expression and divergence plots;TEs account for 15%of somatic transcription,and most superfamilies appear active.We quantify TE diversity in the caecilian,as well as other vertebrates with a range of genome sizes,using diversity indices commonly applied in community ecology.We synthesize previous models that integrate TE abundance,diversity,and activity,and test whether the caecilian meets model predictions for genomes with high TE abundance.We propose thorough,consistent characterization of TEs to strengthen future comparative analyses.Such analyses will ultimately be required to reveal whether the divergent TE assemblages found across convergent gigantic genomes reflect fundamental shared features of TE/host genome evolutionary dynamics.