摘要
针对海量数据环境下单机检索低效问题,建立了对海量化合物快速检索的分布式计算模型,提出了基于分治策略的分段哈希算法。对于如分子量、脂水分配系数(lggP)等不适于用哈希检索的连续数值型数据,设计了连续属性离散化模型进行离散化处理。实验结果表明,在对化合物大文件进行检索时,该模型可快速有效地检索范围信息,避免了对海量数据的重复检索,大幅降低了化合物检索的内存及时间,具有稳定的可扩展性和高效性。
Focusing on the problem of inefficient single retrieve in the environment of massive data, in this paper, a distributed computing model for fast retrieval of massive compounds is built, and a segment hash based on divided-and-conquer is proposed. In addition, aiming at some continuity properties which are not suitable for the hash retrieval such as molecular weight, lipid-water partition coefficient (logP) and so on, in this article a model ofdiscretization to process continuous attributes is designed. The experimental results show that when retrieving the large compound file, this method can retrieve a range of the information quickly and efficiently, avoid the repetition of retrieving massive data, and greatly reduce the memory and the time of the retrieve of compounds. Besides, the model is stably scalable and efficient.
出处
《计算机与应用化学》
CAS
2015年第7期885-888,共4页
Computers and Applied Chemistry
关键词
并行计算
化学信息学
海量数据
连续属性离散化
哈希
parallel computation
chemoinformatics
massive data
discretization of continuous features
hash