RDF partitioning for scalable SPARQL query processing

RDF partitioning for scalable SPARQL query processing

导出

摘要 The volume of RDF data increases dramatically within recent years, while cloud computing platforms like Hadoop are supposed to be a good choice for processing queries over huge data sets for their wonderful scalability. Previous work on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect system performance. Specifically, a good partitioning solution would greatly reduce or even to- tally avoid cross-node joins, and significantly cut down the cost in query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture, where Map/Reduce takes charge of the computing tasks, and RDF query engines like RDF-3X store the data and execute join operations. According to the analysis of query workloads, this work proposes a novel algorithm for automatically parti- tioning RDF data and an approximate solution to physically place the partitions in order to reduce data redundancy. It also discusses how to make a good trade-off between query evaluation efficiency and data redundancy. All of these pro- posed approaches have been evaluated by extensive experiments over large RDF data sets. The volume of RDF data increases dramatically within recent years, while cloud computing platforms like Hadoop are supposed to be a good choice for processing queries over huge data sets for their wonderful scalability. Previous work on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect system performance. Specifically, a good partitioning solution would greatly reduce or even to- tally avoid cross-node joins, and significantly cut down the cost in query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture, where Map/Reduce takes charge of the computing tasks, and RDF query engines like RDF-3X store the data and execute join operations. According to the analysis of query workloads, this work proposes a novel algorithm for automatically parti- tioning RDF data and an approximate solution to physically place the partitions in order to reduce data redundancy. It also discusses how to make a good trade-off between query evaluation efficiency and data redundancy. All of these pro- posed approaches have been evaluated by extensive experiments over large RDF data sets.

作者 Xiaoyan WANG Tao YANG Jinchuan CHEN Long HE Xiaoyong DU

机构地区 School of Information Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education Information Center State Key Laboratory of Software Development Environment

出处《Frontiers of Computer Science》 SCIE EI CSCD 2015年第6期919-933,共15页 中国计算机科学前沿（英文版）

关键词 RDF data data partitioning SPARQL query RDF data, data partitioning, SPARQL query

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论] TU831.36 [建筑科学—供热、供燃气、通风及空调工程]

引文网络
相关文献

参考文献1

1Bin Cui,Hong Mei,Beng Chin Ooi.Big data: the driver for innovation in databases[J].National Science Review,2014,1(1):27-30. 被引量：10

共引文献9

1石峻峰,周俐霞,樊泽恒,王丽.大数据时代高校数字档案资源管理研究[J].现代教育技术,2015,25(1):19-24. 被引量：35
2王长峰.大数据背景下企业创新模式变革[J].技术经济与管理研究,2016(3):29-33. 被引量：13
3Yue WANG,Hongzhi WANG,Jianzhong LI,Hong GAO.Efficient graph similarity join for information integration on graphs[J].Frontiers of Computer Science,2016,10(2):317-329. 被引量：4
4Yueting ZHUANG,Yaoguang WANG,Jian SHAO,Ling CHEN,Weiming LU,Jianling SUN,Baogang WEI,Jiangqin WU.D-Ocean： an unstructured data management system for data ocean environment[J].Frontiers of Computer Science,2016,10(2):353-369. 被引量：2
5邵蓥侠,崔斌,马林,阴红志.一种基于Sketch的Top-k紧密中心性快速搜索算法[J].计算机学报,2016,39(10):1965-1978. 被引量：2
6黄权隆,黄艳香,邵蓥侠,孟嘉,任鑫琦,崔斌,冯是聪.HybriG:一种高效处理大量重边的属性图存储架构[J].计算机学报,2018,41(8):1766-1779. 被引量：6
7Ye Yuan,Guijun Ma,Cheng Cheng,Beitong Zhou,Huan Zhao,Hai-Tao Zhang,Han Ding.A general end-to-end diagnosis framework for manufacturing systems[J].National Science Review,2020,7(2):418-429. 被引量：25
8黄艳香,徐嬴,崔斌,叶浩.基于用户社交网络的SNS应用排序[J].计算机研究与发展,2015,52(S1):64-73. 被引量：1
9石峻峰,周俐霞,付双双.大数据时代数字档案资源管理研究现状与趋势分析[J].信息安全与通信保密,2014,12(9):87-89. 被引量：20

1Aoouch.存储大时代：资深玩家推荐大硬盘分区方案[J].计算机应用文摘,2007(06S):44-47.
2青岛毛毛熊.硬盘分区别忽视——海量硬盘分区经验谈[J].微型计算机,2006(34):140-141.
3何管略.对大容量硬盘分区方案的探讨[J].上海微型计算机,1997(2):42-42.
4数动连线.买大硬盘装Windows 7编辑推荐最佳分区方案[J].网络与信息,2009,23(12):70-71.
5帅仕麟.大硬盘分区有讲究[J].办公自动化,2006(4):42-42.
6张遥,王群.安全灵活的VLAN技术[J].网管员世界,2005(2):89-91.
7胡振.单硬盘安装多系统分区方案[J].网管员世界,2010(23):74-76.
8柳林.从认识到精通——菜鸟学电脑之硬盘分区格式及分区方案篇[J].电脑校园,2004(11):30-31.
9林煜,张燕,.基于Java网站内容管理的设计与实现[J].电脑知识与技术（过刊）,2010,0(14):3666-3667. 被引量：3
10小林广.日本利用RDF的地区制冷采暖供热系统[J].东方锅炉,1995(2):37-42.

Frontiers of Computer Science

2015年第6期

浏览历史

内容加载中请稍等...

RDF partitioning for scalable SPARQL query processing

参考文献1

共引文献9

相关作者

相关机构

相关主题

浏览历史