摘要
为提高大数据环境下的数据查询分析效率,该文结合内存计算技术和批量更新技术提出一种优化倒排索引方法——内存磁盘索引(RFDirectory)。基于Lucene实现内存和磁盘相结合的倒排表管理技术。将新增数据写入缓存中,并周期性地写入磁盘索引结构中,从而提升倒排索引的写入性能。通过整合磁盘和内存的多分块倒排结构,为用户提供高效的数据查询分析结果。实验结果表明:在大数据环境下,RFDirectory方法的索引构建时间缩短为磁盘索引(FSDirectory)、内存索引(RAMDirectory)方法索引构建时间的50%,返回1个关键字的检索结果耗时缩短了近15%。
To improve the data query efficiency in big data,an optimized inverted index method—RAM FS directory( RFDirectory) is proposed here based on memory computing and batch processing technique. A post-list management technique combining random access memory( RAM) and disk is realized based on Lucene. New data are written into a cache,and then written into a disk index periodically to improve the writing performance of the inverted index method. Data query results are provided efficiently to consumers by integrating the multiple block inverted structure of the disk and RAM. Experimental results show that the index constructing time of RFDirectory is 50% of that of FSDirectory or RAMDirectory,and the time consuming of returning the index result of one keyword is reduced by 15% in big data.
出处
《南京理工大学学报》
EI
CAS
CSCD
北大核心
2015年第3期260-265,共6页
Journal of Nanjing University of Science and Technology
关键词
大数据
LUCENE
内存计算
批量更新
倒排索引
倒排表
缓存
内存索引
磁盘索引
多分块倒排结构
big data
Lucene
memory computing
batch processing
inverted index
post-list
cache
random access memory index
disk index
multiple block inverted structure