In distributed storage systems,replication and erasure code(EC)are common methods for data redundancy.Compared with replication,EC has better storage efficiency,but suffers higher overhead in update.Moreover,consisten...In distributed storage systems,replication and erasure code(EC)are common methods for data redundancy.Compared with replication,EC has better storage efficiency,but suffers higher overhead in update.Moreover,consistency and reliability problems caused by concurrent updates bring new challenges to applications of EC.Many works focus on optimizing the EC solution,including algorithm optimization,novel data update method,and so on,but lack the solutions for consistency and reliability problems.In this paper,we introduce a storage system that decouples data updating and EC encoding,namely,decoupled data updating and coding(DDUC),and propose a data placement policy that combines replication and parity blocks.For the(N,M)EC system,the data are placed as N groups of M+1 replicas,and redundant data blocks of the same stripe are placed in the parity nodes,so that the parity nodes can autonomously perform local EC encoding.Based on the above policy,a two-phase data update method is implemented in which data are updated in replica mode in phase 1,and the EC encoding is done independently by parity nodes in phase 2.This solves the problem of data reliability degradation caused by concurrent updates while ensuring high concurrency performance.It also uses persistent memory(PMem)hardware features of the byte addressing and eight-byte atomic write to implement a lightweight logging mechanism that improves performance while ensuring data consistency.Experimental results show that the concurrent access performance of the proposed storage system is 1.70–3.73 times that of the state-of-the-art storage system Ceph,and the latency is only 3.4%–5.9%that of Ceph.展开更多
Many key-value stores use RDMA to optimize the messaging and data transmission between application layer and the storage layer,most of which only provide point-wise operations.Skiplist-based store can support both poi...Many key-value stores use RDMA to optimize the messaging and data transmission between application layer and the storage layer,most of which only provide point-wise operations.Skiplist-based store can support both point operations and range queries,but its CPU-intensive access operations combined with the high-speed network will easily lead to the storage layer reaches CPU bottlenecks.The common solution to this problem is offloading some operations into the application layer and using RDMA bypassing CPU to directly perform remote access,but this method is only used in the hash tablebased store.In this paper,we present RS-store,a skiplist-based key-value store with RDMA,which can overcome the CPU handle of the storage layer by enabling two access modes:local access and remote access.In RS-store,we redesign a novel data structure R-skiplist to save the communication cost in remote access,and implement a latch-free concurrency control mechanism to ensure all the concurrency during two access modes.RS-store also supports client-active range query which can reduce the storage layer’s CPU consumption.At last,we evaluate RS-store on an RDMA-capable cluster.Experimental results show that RS-store achieves up to 2x improvements over RDMA-enabled RocksDB on the throughput and application’s scalability.展开更多
1 Introduction and main contributiions Emerging hardwares like remote Direct Memory Access(RDMA)capable networks and persistent memory(PM)are promising to build fast high availability in-memory key-value stores.The re...1 Introduction and main contributiions Emerging hardwares like remote Direct Memory Access(RDMA)capable networks and persistent memory(PM)are promising to build fast high availability in-memory key-value stores.The recent advent of Intel Optane DC Persistent Memory Modules(Optane DCPMM)brings the future closer.However,existing studies to combine the two devices cannot deliver the desired performance due to their two-phase protocols for log shipping and most of them were based on emulation that perform sub-optimally on real PM hardware.展开更多
基金Project supported by the National Key Research and Development Program of China(No.2021YFB3101100)。
文摘In distributed storage systems,replication and erasure code(EC)are common methods for data redundancy.Compared with replication,EC has better storage efficiency,but suffers higher overhead in update.Moreover,consistency and reliability problems caused by concurrent updates bring new challenges to applications of EC.Many works focus on optimizing the EC solution,including algorithm optimization,novel data update method,and so on,but lack the solutions for consistency and reliability problems.In this paper,we introduce a storage system that decouples data updating and EC encoding,namely,decoupled data updating and coding(DDUC),and propose a data placement policy that combines replication and parity blocks.For the(N,M)EC system,the data are placed as N groups of M+1 replicas,and redundant data blocks of the same stripe are placed in the parity nodes,so that the parity nodes can autonomously perform local EC encoding.Based on the above policy,a two-phase data update method is implemented in which data are updated in replica mode in phase 1,and the EC encoding is done independently by parity nodes in phase 2.This solves the problem of data reliability degradation caused by concurrent updates while ensuring high concurrency performance.It also uses persistent memory(PMem)hardware features of the byte addressing and eight-byte atomic write to implement a lightweight logging mechanism that improves performance while ensuring data consistency.Experimental results show that the concurrent access performance of the proposed storage system is 1.70–3.73 times that of the state-of-the-art storage system Ceph,and the latency is only 3.4%–5.9%that of Ceph.
基金This work was supported by Youth Program of National Science Foundation of China(61702189).
文摘Many key-value stores use RDMA to optimize the messaging and data transmission between application layer and the storage layer,most of which only provide point-wise operations.Skiplist-based store can support both point operations and range queries,but its CPU-intensive access operations combined with the high-speed network will easily lead to the storage layer reaches CPU bottlenecks.The common solution to this problem is offloading some operations into the application layer and using RDMA bypassing CPU to directly perform remote access,but this method is only used in the hash tablebased store.In this paper,we present RS-store,a skiplist-based key-value store with RDMA,which can overcome the CPU handle of the storage layer by enabling two access modes:local access and remote access.In RS-store,we redesign a novel data structure R-skiplist to save the communication cost in remote access,and implement a latch-free concurrency control mechanism to ensure all the concurrency during two access modes.RS-store also supports client-active range query which can reduce the storage layer’s CPU consumption.At last,we evaluate RS-store on an RDMA-capable cluster.Experimental results show that RS-store achieves up to 2x improvements over RDMA-enabled RocksDB on the throughput and application’s scalability.
文摘1 Introduction and main contributiions Emerging hardwares like remote Direct Memory Access(RDMA)capable networks and persistent memory(PM)are promising to build fast high availability in-memory key-value stores.The recent advent of Intel Optane DC Persistent Memory Modules(Optane DCPMM)brings the future closer.However,existing studies to combine the two devices cannot deliver the desired performance due to their two-phase protocols for log shipping and most of them were based on emulation that perform sub-optimally on real PM hardware.