期刊文献+

面向大数据流的分布式索引构建 被引量:4

Distributed Index Construction for Big Data Streams
下载PDF
导出
摘要 大数据流的高效存储与索引是当今数据领域的一大难点.面向带有时间属性的数据流,根据其时间属性,将数据流划分为连续的时间窗口,提出了基于双层B+树的分布式索引结构WB-Index.下层B+树索引基于窗口内流数据构建,索引构建过程结合基于排序的批量构建技术,进一步对时间窗口分片,将数据流接收、分片数据排序以及B+树构建并行化,提高了构建性能.上层B+树索引基于各时间窗口构建,结合时间窗口时间戳的递增性和无限性,提出了避免节点分裂的构建方法,减少了B+树分裂移动开销,提高了空间利用率和更新效率.WB-Index架构中,将流数据和索引分离,同时利用内存缓存尽可能多的双层B+索引和热点数据来提高查询性能.理论和实验结果表明,该分布式索引架构能够支持高效的实时数据流写入以及流数据查询,能够很好地应用于具有时间属性的数据流场景. Efficient storage and indexing of big data streams are challenging issues in the database field.By segmenting the temporal data stream into continuous time windows,a distributed master-slave index structure is proposed based on double-layer B+tree called WB-Index.Lower B+tree index is built on stream tuples in each time window.Upper B+tree index is built on each successive time window.Lower B+tree index is constructed by combining both batch loading and parallel sorting techniques.The core idea of the construction method is to slice the time window and isolate the parallelable operations from others in the time window.Sorting and data stream receiving between slices work in parallel,while the B+tree skeleton(a B+tree without value)construction for the time window and the merge-sorting operation are parallelized as well.These techniques effectively expedite the B+tree construction.Due to the monotonous increasement of timestamps of time windows,a split-less method for upper B+tree index construction is adopted to avoid the node splitting and memory movement overhead,and improve the space utilization and update efficiency.In WB-Index,data stream tuples and index are separated,and index and hotspot data are cached as much as possible to improve query efficiency.Finally,theoretic analysis and experiments have both demonstrated that WB-Index can support efficient real-time data stream writing and stream data querying.
作者 杨良怀 卢晨曦 范玉雷 朱镇洋 潘建 YANG Liang-Huai;LU Chen-Xi;FAN Yu-Lei;ZHU Zhen-Yang;PAN Jian(School of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China;Zhijiang College,Zhejiang University of Technology,Shaoxing 312030,China)
出处 《软件学报》 EI CSCD 北大核心 2021年第11期3576-3595,共20页 Journal of Software
基金 国家重点研发计划(2020YFB1707700)。
关键词 大数据 数据流 分布式索引 B+树 big data data stream distributed index B+tree
  • 相关文献

参考文献5

二级参考文献33

  • 1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量:161
  • 2桂浩,冯玉才,李又奎.面向流数据的数据管理系统的研究[J].计算机应用研究,2005,22(1):88-90. 被引量:6
  • 3张冬冬,李建中,王伟平,郭龙江.数据流历史数据的存储与聚集查询处理算法[J].软件学报,2005,16(12):2089-2098. 被引量:17
  • 4Guha S, Koudas N. Approximating a data stream for querying and estimation: Algorithms and performance evaluation. In: Stefano C, Christoph F, Pat S, eds. Proc. of the 18th Int'l Conf. on Data Engineering San Jose: IEEE Computer Society, 2002. 567-576.
  • 5Madden S, Shah M, Hellerstein JM, Raman V. Continuously adaptive continuous queries over streams. In: Franklin MJ, Moon B,Ailamaki A, eds. Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data Madison: ACM, 2002.49-60.
  • 6Gehrke J, Korn F, Srivastava D. On computing correlated aggregates over continual data streams. In: Afef WG, ed. Proc. of the2001 ACM SIGMOD Int'l Conf. on Management of Data Santa Barbara: ACM, 2001. 13-24.
  • 7Dobra A, Gehrke J, Garofalakis M, Rastogi R. Processing complex aggregate queries over data streams. In: Franklin MJ, Moon B,Ailamaki A, eds. Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data Madison: ACM, 2002. 61-72.
  • 8Chen Y, Dong G, Han J, Wah BW, Wang J. Multi-Dimensional regression analysis of time-series data streams. In: Bernstein PA,Loannidis YE, Ramakrishnan R, eds. Proc. of the 28th Int'l Conf. on Very Large Data Bases Hong Kong: Morgan Kaufmann Publishers, 2002. 323-334.
  • 9Zhang D, Gunopulos D, Tsotras V J, Seeger B. Temporal aggregation over data streams using multiple granularities. In: Jensen CS,Jeffery KG, eds. Proc. of the 8th Int'l Conf. on Extending Database Technology LNCS, 2002. 646-663.
  • 10Olken F. Random Sampling from Databases [Ph.D. Thesis]. Berkeley, University of California, 1993.

共引文献35

同被引文献33

引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部