Relational database management systems are usually deployed on singlenode machines and have strict limitations in terms of da ta structure. This means they do not work well with big data, and NoSQL has been proposed a...Relational database management systems are usually deployed on singlenode machines and have strict limitations in terms of da ta structure. This means they do not work well with big data, and NoSQL has been proposed as a solution. To make data querying more efficient, indexes and memory cache techniques are used in NoSQL databases. In this paper, we propose a hierarchical in dexing mechanism and a prototype distributed datastorage system, called HMIBase, which has hierarchical indexes for nonprima ry keys in tables and makes data querying more efficient. HMIBase uses HBase as the lower data storage and creates a memory cache for more efficient data transmission. HMIBase supports coprocessortoprocess update requests. It also provides a client with query and update APIs and a server to support RPCs from the client and finish jobs. To improve the cache hit ratio, we propose a memory cache replacement strategy, called Hot Score algorithm, in HMIBase. The experimental results show that Hot Score algo rithm is better than other cachereplacement strategies.展开更多
Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from...Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects:(1) it converts the commonly used serial Collapsed Gibbs Sampling(CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian(MCCB) estimation method, which is embarrassingly parallel;(2)it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity;(3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.展开更多
Dear Editor.Transmembrane proteins with β-barrel topology are mainly found in the outer membranes(OMs)of Gram-negative bacteria,mitochondria and chloroplasts(Wimley,2003).These proteins usually contain even numbers ...Dear Editor.Transmembrane proteins with β-barrel topology are mainly found in the outer membranes(OMs)of Gram-negative bacteria,mitochondria and chloroplasts(Wimley,2003).These proteins usually contain even numbers of β-strands,ranging from 8-36.To achieve an overall cylindrical topology,the polypeptide chain of a β-barrel OMP must fold to form a series of anti-parallel β-strands with each β-strand hydrogen-bonding to its neighboring strands(Otzen and Andersen,2013).The folding and insertion of a β-barrel OMP in vivo requires an evolutionarily conserved multiprotein complex termedβ-barrel assembly machinery(BAM)complex(Noinaj et al.,2015).展开更多
基金supported by China National Science Foundation(Grant 61223003)ZTE Industry-Academia-Research Cooperation Funds
文摘Relational database management systems are usually deployed on singlenode machines and have strict limitations in terms of da ta structure. This means they do not work well with big data, and NoSQL has been proposed as a solution. To make data querying more efficient, indexes and memory cache techniques are used in NoSQL databases. In this paper, we propose a hierarchical in dexing mechanism and a prototype distributed datastorage system, called HMIBase, which has hierarchical indexes for nonprima ry keys in tables and makes data querying more efficient. HMIBase uses HBase as the lower data storage and creates a memory cache for more efficient data transmission. HMIBase supports coprocessortoprocess update requests. It also provides a client with query and update APIs and a server to support RPCs from the client and finish jobs. To improve the cache hit ratio, we propose a memory cache replacement strategy, called Hot Score algorithm, in HMIBase. The experimental results show that Hot Score algo rithm is better than other cachereplacement strategies.
基金partially supported by the National Natural Science Foundation of China(No.61572250)the Science and Technology Program of Jiangsu Province(No.BE2017155)
文摘Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects:(1) it converts the commonly used serial Collapsed Gibbs Sampling(CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian(MCCB) estimation method, which is embarrassingly parallel;(2)it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity;(3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.
基金supported by the National Basic Research Program of China(2014CB910202)the National Natural Science Foundation of China(11672317,31771015)。
文摘Dear Editor.Transmembrane proteins with β-barrel topology are mainly found in the outer membranes(OMs)of Gram-negative bacteria,mitochondria and chloroplasts(Wimley,2003).These proteins usually contain even numbers of β-strands,ranging from 8-36.To achieve an overall cylindrical topology,the polypeptide chain of a β-barrel OMP must fold to form a series of anti-parallel β-strands with each β-strand hydrogen-bonding to its neighboring strands(Otzen and Andersen,2013).The folding and insertion of a β-barrel OMP in vivo requires an evolutionarily conserved multiprotein complex termedβ-barrel assembly machinery(BAM)complex(Noinaj et al.,2015).