摘要
为了实现对海量RDF数据的高效查询,研究RDF数据在分布式数据库HBase中的存储方法。基于MapReduce设计海量RDF数据的两阶段查询策略,将查询分为SPARQL预处理阶段与分布式查询执行阶段。SPARQL预处理阶段设计实现基于SPARQL变量关联度的查询划分算法JOVR,通过计算SPARQL查询语句中变量的关联度确定连接变量的连接顺序,根据连接变量将SPARQL子句连接操作划分到最小数量的MapReduce任务中;分布式查询执行阶段执行SPARQL预处理阶段划分的MapReduce任务,实现对海量RDF数据的并行查询。采用LUBM标准测试数据集对查询策略予以验证。研究结果表明:JOVR算法能够高效地实现对海量RDF数据的查询,并具有较强的稳定性与可扩展性。
In order to achieve the efficient query for large-scale RDF data, the storage method of RDF triples in HBase was analyzed and a two-phase query strategy for large-scale RDF data was designed based on MapReduce, which was divided into two stages, i.e. the SPARQL pretreatment stage and the distributed query execution stage. In the SPARQL pretreatment stage, a SPARQL query classification algorithm-JOVR was implemented, which determined the join order of connection variables by calculating the correlation between the variables in a SPARQL query statement, and then the join between SPARQL clauses was divided into the minimum number of MapReduce jobs according to the connection variables. The distributed query execution phase accomplished large-scale RDF data query concurrently based on MapRdecue jobs from SPARQL pretreatment stage. The strategy was verified by LUMB benchmark set. The results show that JOVR can query large-scale RDF data efficiently with strong stability and scalability.
出处
《中南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2017年第5期1218-1226,共9页
Journal of Central South University:Science and Technology
基金
国家自然科学基金资助项目(61301136
61572525
61602525)~~