摘要
随着互联网发展,用户面临网络流量数据规模大、处理时效要求高的挑战,需解决数据采集、实时处理、存储组织和查询检索中的关键问题,为此,本文提出一种分布式的数据汇聚查询平台,通过半同步半异步模式的分级架构,支持采集超大规模流量数据;利用多分区队列的消息缓存、并行分布式流处理和基于属性划分的数据加载等手段优化组合,实现高效的实时处理;采用基于抽象数据访问驱动的虚分区式数据存储来对异构数据统一管理,具备良好扩展性;通过异步构建的分级索引架构,实现对数据报文的快速检索,最终为用户提供低延迟、高吞吐、快查询的一体化系统.实验证明平台有良好性能和可扩展性,主要环节有数倍以上不同程度的性能提升,并已应用于实际系统.
With the continuous development and explosive grow th of the Internet,users are facing the challenges of massive network flowand strict requirements of real-time processing.Hence,key problems in data collection,real-time processing,storage organization and query retrieval in massive network flowis required to be addressed to solve the aforementioned challenges.This paper proposes a distributed real-time data aggregation query platform.It collects large scale network flow through a hierarchical structure of semi-synchronous and semi-asynchronous mode.It realizes efficient real-time processing by optimized message caching for multi-partition queues,parallel distributed stream processing and data loading based on attribute partition.The scalability of the proposed platform is established by using virtual partition data storage base on abstract data access driver.It also achieves rapid retrieval of massive data through asynchronous construction of hierarchical index,and ultimately provides users an integrated system with low latency,high throughput and fast query.Experiments show that the platform has convincing performance and scalability,and the performance has been improved significantly.The proposed platform has been applied in several practical systems.
作者
郭庆
朱一凡
谢莹莹
张榆
陈小兵
GUO Qing;ZHU Yi-fan;XIE Ying-ying;ZHANG Yu;CHEN Xiao-bing(School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China;Bigdata Department,Daw ning Information Industry Co.,Ltd.,Beijing 100193,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2020年第6期1314-1320,共7页
Journal of Chinese Computer Systems
基金
国家重点研发计划项目(2016YFC0802602)资助.