摘要
近年来,互联网上的RDF三元组数量增长迅速,传统的单机SPARQL查询处理技术已不能满足实际需要.现有的分布式SPARQL查询处理系统可以分为2类,基于Hadoop的,或是基于数据库集群的.前者主要采用Map?Reduce来处理查询,效率较低;后者则继承了传统数据库集群的缺陷,可扩展性较差.提出一个新颖的SPARQL查询处理系统FusionDB.该系统采用分布式查询处理引擎和HDFS,这样既可以受益于传统的分布式数据库技术,如分布式连接、流水线、负载均衡等,又从新兴的Hadoop技术中得到了良好的容错能力和高可扩展性.为了进一步加速查询处理的效率,FusionDB还在HDFS文件上增加了注入式索引.实验表明,相比于传统的系统,FusionDB在性能上具有明显的优势.
Recently,the volume of RDF triples in Internet is growing rapidly.Traditional centralized SPARQL evaluating approaches cannot handle such large-volume RDF data and do not meet the practical requirements.Existing distributed SPARQL processing systems can be categorized into two classes,i.e.Hadoop based and DB cluster based.The efficiency of the Hadoop based approaches is questionable because they evaluate SPARQL queries through a set of Map?Reduce jobs.On the other hand,the second class of approaches inherits the property of low scalability from the DB clusters.This paper proposes a novel system,named FusionDB,which is built on distributed query engine and HDFS.Therefore,FusionDB can benefit from both DB clusters and Hadoop.It can adopt the techniques in DB clusters,such as distributed join,streamline,and workload balancing.It also naturally obtains the ability of high scalability from Hadoop.To improve the query evaluation efficiency,we further build Trojan index over HDFS.As illustrated by our experimental study,the performance of FusionDB defeats the competitors markedly.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2015年第S1期139-142,共4页
Journal of Computer Research and Development
基金
中国人民大学预研委托(团队基金)项目(14XNLQ06)
异构大数据分析挖掘整合技术北京市工程实验室基金项目