摘要
针对HBase不提供二级索引、自带Coprocessor(协作器)不稳定及海量数据检索速度较慢等问题,设计了一种新的基于Elasticsearch的HBase二级索引方案ELHBase(Elasticsearch Indexing HBase)。该方案借助Flume、Kafka、HBase及Elastic⁃search搭建了一套数据采集、高速解析和录入大数据处理框架,使用Flume自定义Sink采集数据同时生成相应ID存入到Kafka,通过解析技术分别把数据存储到HBase,相应ID作为索引存储到ElasticSearch。该方案在不利用Coprocessor的基础上增加了直接查询ElasticSearch的接口,利用ElasticSearch提供的高效、灵活、多样的检索功能实现对HBase海量数据的快速检索,协同解决了HBase数据索引性能不高、协作器不稳定、ElasticSearch不适合大量数据存储等问题。最后,分别与SI⁃HBase、hindex进行了二级索引性能对比实验,证明了该方案在写入性能上较SIHBase更快、更稳定,查询速度上要远快于hindex。
In view of HBase's lack of secondary index,self-contained coprocessor instability and difficult to meet the needs of massive data retrieval speed requirements,a new Elasticsearch-based HBase secondary indexing scheme ELHBase(Elasticsearch Indexing HBase)is designed.In this scheme,flume,Kafka,HBase and Elasticsearch are used to build a set of data acquisition,high-speed analy⁃sis and input big data processing framework,the flume is used to customize sink to collect data and generate corresponding ID to store in Kafka,the data is stored in HBase respectively through parsing technology,the corresponding ID is stored in Elasticsearch as an index.An interface is added to query Elasticsearch directly without coprocessor in the scheme,using the efficient,flexible and diverse retrieval functions provided by Elasticsearch to realize the rapid retrieval of HBase massive data,the problems are solved that HBase data index performance is not high,collaborator is not stable,elastic search is not suitable for a large number of data storage and so on.Finally,the performance of two-level index is compared with SIHBase and hindex,It is proved that this scheme is faster and more stable than SIH⁃Base in writing performance and faster than hindex in query speed.
作者
郭雪峰
GUO Xue-feng(Network Security Technology R&D Center,The Third Research Institute of Ministry of Public Security,Shanghai 201204,China)
出处
《电脑知识与技术》
2020年第1期5-7,共3页
Computer Knowledge and Technology