Due to the development of technology in medicine,millions of health-related data such as scanning the images are generated.It is a great challenge to store the data and handle a massive volume of data.Healthcare data ...Due to the development of technology in medicine,millions of health-related data such as scanning the images are generated.It is a great challenge to store the data and handle a massive volume of data.Healthcare data is stored in the cloud-fog storage environments.This cloud-Fog based health model allows the users to get health-related data from different sources,and duplicated informa-tion is also available in the background.Therefore,it requires an additional sto-rage area,increase in data acquisition time,and insecure data replication in the environment.This paper is proposed to eliminate the de-duplication data using a window size chunking algorithm with a biased sampling-based bloomfilter and provide the health data security using the Advanced Signature-Based Encryp-tion(ASE)algorithm in the Fog-Cloud Environment(WCA-BF+ASE).This WCA-BF+ASE eliminates the duplicate copy of the data and minimizes its sto-rage space and maintenance cost.The data is also stored in an efficient and in a highly secured manner.The security level in the cloud storage environment Win-dows Chunking Algorithm(WSCA)has got 86.5%,two thresholds two divisors(TTTD)80%,Ordinal in Python(ORD)84.4%,Boom Filter(BF)82%,and the proposed work has got better security storage of 97%.And also,after applying the de-duplication process,the proposed method WCA-BF+ASE has required only less storage space for variousfile sizes of 10 KB for 200,400 MB has taken only 22 KB,and 600 MB has required 35 KB,800 MB has consumed only 38 KB,1000 MB has taken 40 KB of storage spaces.展开更多
Cloud storage is essential for managing user data to store and retrieve from the distributed data centre.The storage service is distributed as pay a service for accessing the size to collect the data.Due to the massiv...Cloud storage is essential for managing user data to store and retrieve from the distributed data centre.The storage service is distributed as pay a service for accessing the size to collect the data.Due to the massive amount of data stored in the data centre containing similar information and file structures remaining in multi-copy,duplication leads to increase storage space.The potential deduplication system doesn’t make efficient data reduction because of inaccuracy in finding similar data analysis.It creates a complex nature to increase the storage consumption under cost.To resolve this problem,this paper proposes an efficient storage reduction called Hash-Indexing Block-based Deduplication(HIBD)based on Segmented Bind Linkage(SBL)Methods for reducing storage in a cloud environment.Initially,preprocessing is done using the sparse augmentation technique.Further,the preprocessed files are segmented into blocks to make Hash-Index.The block of the contents is compared with other files through Semantic Content Source Deduplication(SCSD),which identifies the similar content presence between the file.Based on the content presence count,the Distance Vector Weightage Correlation(DVWC)estimates the document similarity weight,and related files are grouped into a cluster.Finally,the segmented bind linkage compares the document to find duplicate content in the cluster using similarity weight based on the coefficient match case.This implementation helps identify the data redundancy efficiently and reduces the service cost in distributed cloud storage.展开更多
Virtualization is the backbone of cloud computing,which is a developing and widely used paradigm.Byfinding and merging identical memory pages,memory deduplication improves memory efficiency in virtualized systems.Kern...Virtualization is the backbone of cloud computing,which is a developing and widely used paradigm.Byfinding and merging identical memory pages,memory deduplication improves memory efficiency in virtualized systems.Kernel Same Page Merging(KSM)is a Linux service for memory pages sharing in virtualized environments.Memory deduplication is vulnerable to a memory disclosure attack,which uses covert channel establishment to reveal the contents of other colocated virtual machines.To avoid a memory disclosure attack,sharing of identical pages within a single user’s virtual machine is permitted,but sharing of contents between different users is forbidden.In our proposed approach,virtual machines with similar operating systems of active domains in a node are recognised and organised into a homogenous batch,with memory deduplication performed inside that batch,to improve the memory pages sharing efficiency.When compared to memory deduplication applied to the entire host,implementation details demonstrate a significant increase in the number of pages shared when memory deduplication applied batch-wise and CPU(Central processing unit)consumption also increased.展开更多
Data deduplication, as a compression method, has been widely used in most backup systems to improve bandwidth and space efficiency. As data exploded to be backed up, two main challenges in data deduplication are the C...Data deduplication, as a compression method, has been widely used in most backup systems to improve bandwidth and space efficiency. As data exploded to be backed up, two main challenges in data deduplication are the CPU-intensive chunking and hashing works and the I/0 intensive disk-index access latency. However, CPU-intensive works have been vastly parallelized and speeded up by multi-core and many-core processors; the I/0 latency is likely becoming the bottleneck in data deduplication. To alleviate the challenge of I/0 latency in multi-core systems, multi-threaded deduplication (Multi-Dedup) architecture was proposed. The main idea of Multi-Dedup was using parallel deduplication threads to hide the I/0 latency. A prefix based concurrent index was designed to maintain the internal consistency of the deduplication index with low synchronization overhead. On the other hand, a collisionless cache array was also designed to preserve locality and similarity within the parallel threads. In various real-world datasets experiments, Multi-Dedup achieves 3-5 times performance improvements incorporating with locality-based ChunkStash and local-similarity based SiLo methods. In addition, Multi-Dedup has dramatically decreased the synchronization overhead and achieves 1.5-2 times performance improvements comparing to traditional lock-based synchronization methods.展开更多
The tremendous development of cloud computing with related technol-ogies is an unexpected one.However,centralized cloud storage faces few chal-lenges such as latency,storage,and packet drop in the network.Cloud storag...The tremendous development of cloud computing with related technol-ogies is an unexpected one.However,centralized cloud storage faces few chal-lenges such as latency,storage,and packet drop in the network.Cloud storage gets more attention due to its huge data storage and ensures the security of secret information.Most of the developments in cloud storage have been positive except better cost model and effectiveness,but still data leakage in security are billion-dollar questions to consumers.Traditional data security techniques are usually based on cryptographic methods,but these approaches may not be able to with-stand an attack from the cloud server's interior.So,we suggest a model called multi-layer storage(MLS)based on security using elliptical curve cryptography(ECC).The suggested model focuses on the significance of cloud storage along with data protection and removing duplicates at the initial level.Based on divide and combine methodologies,the data are divided into three parts.Here,thefirst two portions of data are stored in the local system and fog nodes to secure the data using the encoding and decoding technique.The other part of the encrypted data is saved in the cloud.The viability of our model has been tested by research in terms of safety measures and test evaluation,and it is truly a powerful comple-ment to existing methods in cloud storage.展开更多
A significant number of cloud storage environments are already implementing deduplication technology.Due to the nature of the cloud environment,a storage server capable of accommodating large-capacity storage is requi...A significant number of cloud storage environments are already implementing deduplication technology.Due to the nature of the cloud environment,a storage server capable of accommodating large-capacity storage is required.As storage capacity increases,additional storage solutions are required.By leveraging deduplication,you can fundamentally solve the cost problem.However,deduplication poses privacy concerns due to the structure itself.In this paper,we point out the privacy infringement problemand propose a new deduplication technique to solve it.In the proposed technique,since the user’s map structure and files are not stored on the server,the file uploader list cannot be obtained through the server’s meta-information analysis,so the user’s privacy is maintained.In addition,the personal identification number(PIN)can be used to solve the file ownership problemand provides advantages such as safety against insider breaches and sniffing attacks.The proposed mechanism required an additional time of approximately 100 ms to add a IDRef to distinguish user-file during typical deduplication,and for smaller file sizes,the time required for additional operations is similar to the operation time,but relatively less time as the file’s capacity grows.展开更多
Cloud computing technology is the culmination of technical advancements in computer networks,hardware and software capabilities that collectively gave rise to computing as a utility.It offers a plethora of utilities t...Cloud computing technology is the culmination of technical advancements in computer networks,hardware and software capabilities that collectively gave rise to computing as a utility.It offers a plethora of utilities to its clients worldwide in a very cost-effective way and this feature is enticing users/companies to migrate their infrastructure to cloud platform.Swayed by its gigantic capacity and easy access clients are uploading replicated data on cloud resulting in an unnecessary crunch of storage in datacenters.Many data compression techniques came to rescue but none could serve the purpose for the capacity as large as a cloud,hence,researches were made to de-duplicate the data and harvest the space from exiting storage capacity which was going in vain due to duplicacy of data.For providing better cloud services through scalable provisioning of resources,interoperability has brought many Cloud Service Providers(CSPs)under one umbrella and termed it as Cloud Federation.Many policies have been devised for private and public cloud deployment models for searching/eradicating replicated copies using hashing techniques.Whereas the exploration for duplicate copies is not restricted to any one type of CSP but to a set of public or private CSPs contributing to the federation.It was found that even in advanced deduplication techniques for federated clouds,due to the different nature of CSPs,a single file is stored at private as well as public group in the same cloud federation which can be handled if an optimized deduplication strategy be rendered for addressing this issue.Therefore,this study has been aimed to further optimize a deduplication strategy for federated cloud environment and suggested a central management agent for the federation.It was perceived that work relevant to this is not existing,hence,in this paper,the concept of federation agent has been implemented and deduplication technique following file level has been used for the accomplishment of this approach.展开更多
Modern backup systems exploit data deduplication technology to save stor-age space whereas suffering from the fragmentation problem caused by deduplication.Fragmentation degrades the restore performance because of res...Modern backup systems exploit data deduplication technology to save stor-age space whereas suffering from the fragmentation problem caused by deduplication.Fragmentation degrades the restore performance because of restoring the chunks thatare scattered all over different containers. To improve the restore performance, thestate-of-the-art History Aware Rewriting Algorithm(HAR) is proposed to collect frag-mented chunks in the last backup and rewrite them in the next backup. However, dueto rewriting fragmented chunks in the next backup, HAR fails to eliminate internalfragmentation caused by self-referenced chunks(that exist more than two times in abackup) in the current backup, thus degrading the restore performance. In this paper,we propose Selectively Rewriting Self-Referenced Chunks(SRSC), a scheme that de-signs a buffer to simulate a restore cache, identify internal fragmentation in the cacheand selectively rewrite them. Our experimental results based on two real-world datas-ets show that SRSC improves the restore performance by 45% with an acceptable sac-rifice of the deduplication ratio.展开更多
In architecture of cloud storage, the deduplication technology encrypted with theconvergent key is one of the important data compression technologies, which effectively improvesthe utilization of space and bandwidth. ...In architecture of cloud storage, the deduplication technology encrypted with theconvergent key is one of the important data compression technologies, which effectively improvesthe utilization of space and bandwidth. To further refine the usage scenarios for varioususer permissions and enhance user’s data security, we propose a blockchain-based differentialauthorized deduplication system. The proposed system optimizes the traditionalProof of Vote (PoV) consensus algorithm and simplifies the existing differential authorizationprocess to realize credible management and dynamic update of authority. Based on thedecentralized property of blockchain, we overcome the centralized single point fault problemof traditional differentially authorized deduplication system. Besides, the operations oflegitimate users are recorded in blocks to ensure the traceability of behaviors.展开更多
In deduplication, index-lookup disk bottleneck is a major obstacle which limits the throughput of backup processes. One way to minimize the effect of this issue and boost speed is to use very high course-grained chunk...In deduplication, index-lookup disk bottleneck is a major obstacle which limits the throughput of backup processes. One way to minimize the effect of this issue and boost speed is to use very high course-grained chunks for deduplication at a cost of low storage saving and limited scalability. Another way is to distribute the deduplication process among multiple nodes but this approach introduces storage node island effect and also incurs high communication cost. In this paper, we explore dCACH, a content-aware clustered and hierarchical deduplication system, which implements a hybrid of inline course grained and offline fine-grained distributed deduplication where routing decisions are made for a set of files instead of single files. It utilizes bloom filters for detecting similarity between a data stream and previous data streams and performs stateful routing which solves the storage node island problem. Moreover, it exploits the negligibly small amount of content shared among chunks from different file types to create groups of files and deduplicate each group in their own fingerprint index space. It implements hierarchical deduplication to reduce the size of fingerprint indexes at the global level, where only files and big sized segments are deduplicated. Locality is created and exploited first using the big sized segments deduplicated at the global level and second by routing a set of consecutive files together to one storage node. Furthermore, the use of bloom filter for similarity detection between streams has low communication and computation cost while it enables to achieve duplicate elimination performance comparable to single node deduplication. dCACH is evaluated using a prototype deployed on a server environment distributed over four separate machines. It is shown to have 10× the speed of Extreme_Binn with a minimal communication overhead, while its duplicate elimination effectiveness is on a par with a single node deduplication system.展开更多
In cloud storage,client-side deduplication is widely used to reduce storage and communication costs.In client-side deduplication,if the cloud server detects that the user’s outsourced data have been stored,then clien...In cloud storage,client-side deduplication is widely used to reduce storage and communication costs.In client-side deduplication,if the cloud server detects that the user’s outsourced data have been stored,then clients will not need to reupload the data.However,the information on whether data need to be uploaded can be used as a side-channel,which can consequently be exploited by adversaries to compromise data privacy.In this paper,we propose a new threat model against side-channel attacks.Different from existing schemes,the adversary could learn the approximate ratio of stored chunks to unstored chunks in outsourced files,and this ratio will affect the probability that the adversary compromises the data privacy through side-channel attacks.Under this threat model,we design two defense schemes to minimize privacy leakage,both of which design interaction protocols between clients and the server during deduplication checks to reduce the probability that the adversary compromises data privacy.We analyze the security of our schemes,and evaluate their performances based on a real-world dataset.Compared with existing schemes,our schemes can better mitigate data privacy leakage and have a slightly lower communication cost.展开更多
Data deduplication (dedupe for short) is a special data compression technique. It has been widely adopted to save backup time as well as storage space, particularly in backup storage systems. Therefore, most dedupe ...Data deduplication (dedupe for short) is a special data compression technique. It has been widely adopted to save backup time as well as storage space, particularly in backup storage systems. Therefore, most dedupe research has primarily focused on improving dedupe write performance. However, backup storage dedupe read performance is also a crucial problem for storage recovery. This paper designs a new dedupe storage read cache for backup applications that improves read performance by exploiting a special characteristic: the read sequence is the same as the write sequence. Consequently, for better cache utilization, by looking ahead for future references within a moving window, it evicts victims from the cache having the smallest future access. Moreover~ to further improve read cache performance. it maintains a small log buffer to judiciously cache future access data chunks. Extensive experiments with real-world backup workloads demonstrate that the proposed read cache scheme improves read performance by up to 64.3%展开更多
Data deduplication for file communication across wide area network (WAN) in the applications such as file synchronization and mirroring of cloud environments usually achieves significant bandwidth saving at the cost...Data deduplication for file communication across wide area network (WAN) in the applications such as file synchronization and mirroring of cloud environments usually achieves significant bandwidth saving at the cost of significant time overheads of data deduplication. The time overheads include the time required for data deduplication at two geographi- cally distributed nodes (e.g., disk access bottleneck) and the duplication query/answer operations between the sender and the receiver, since each query or answer introduces at least one round-trip time (RTT) of latency. In this paper, we present a data deduplication system across WAN with metadata feedback and metadata utilization (MFMU), in order to harness the data deduplication related time overheads. In the proposed MFMU system, selective metadata feedbacks from the receiver to the sender are introduced to reduce the number of duplication query/answer operations. In addition, to harness the metadata related disk I/O operations at the receiver, as well as the bandwidth overhead introduced by the metadata feedbacks, a hysteresis hash re-chunking mechanism based metadata utilization component is introduced. Our experimental results demonstrated that MFMU achieved an average of 20%~40% deduplication acceleration with the bandwidth saving ratio not reduced by the metadata feedbacks, as compared with the "baseline" content defined chunking (CDC) used in LBFS (Low-bandwith Network File system) and exiting state-of-the-art Bimodal chunking algorithms based data deduplication solutions.展开更多
Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage system...Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fin- gerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplica- tion system. Leach is motivated by the characteristics of real- world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to or- ganize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cache mem- ory, with a goal to service a majority of duplicated data de- tection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fin- gerprint index and cache updates. In comprehensive experi- ments on several real-world datasets, Leach outperforms con- ventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.展开更多
As data are growing rapidly in data centers,inline cluster deduplication technique has been widely used to improve storage efficiency and data reliability.However,there are some challenges faced by the cluster dedupli...As data are growing rapidly in data centers,inline cluster deduplication technique has been widely used to improve storage efficiency and data reliability.However,there are some challenges faced by the cluster deduplication system:the decreasing data deduplication rate with the increasing deduplication server nodes,high communication overhead for data routing,and load balance to improve the throughput of the system.In this paper,we propose a well-performed cluster deduplication system called AR-Dedupe.The experimental results of two real datasets demonstrate that AR-Dedupe can achieve a high data deduplication rate with a low communication overhead and keep the system load balancing well at the same time through a new data routing algorithm.In addition,we utilize application-aware mechanism to speed up the index of handprints in the routing server which has a 30%performance improvement.展开更多
Updatable block-level message-locked encryption(MLE) can efficiently update encrypted data, and public auditing can verify the integrity of cloud storage data by utilizing a third party auditor(TPA). However, there ar...Updatable block-level message-locked encryption(MLE) can efficiently update encrypted data, and public auditing can verify the integrity of cloud storage data by utilizing a third party auditor(TPA). However, there are seldom schemes supporting both updatable block-level deduplication and public auditing. In this paper, an updatable block-level deduplication scheme with efficient auditing is proposed based on a tree-based authenticated structure. In the proposed scheme, the cloud server(CS) can perform block-level deduplication, and the TPA achieves integrity auditing tasks. When a data block is updated, the ciphertext and auditing tags could be updated efficiently. The security analysis demonstrates that the proposed scheme can achieve privacy under chosen distribution attacks in secure deduplication and resist uncheatable chosen distribution attacks(UNC-CDA) in proof of ownership(PoW). Furthermore, the integrity auditing process is proven secure under adaptive chosen-message attacks. Compared with previous relevant schemes, the proposed scheme achieves better functionality and higher efficiency.展开更多
Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing ...Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution.展开更多
Storage auditing and client-side deduplication techniques have been proposed to assure data integrity and improve storage efficiency, respectively. Recently, a few schemes start to consider these two different aspects...Storage auditing and client-side deduplication techniques have been proposed to assure data integrity and improve storage efficiency, respectively. Recently, a few schemes start to consider these two different aspects together. However, these schemes either only support plaintext data file or have been proved insecure. In this paper, we propose a public auditing scheme for cloud storage systems, in which deduplication of encrypted data and data integrity checking can be achieved within the same framework. The cloud server can correctly check the ownership for new owners and the auditor can correctly check the integrity of deduplicated data. Our scheme supports deduplication of encrypted data by using the method of proxy re-encryption and also achieves deduplication of data tags by aggregating the tags from different owners. The analysis and experiment results show that our scheme is provably secure and efficient.展开更多
Deduplication has been commonly used in both enterprise storage systems and cloud storage. To overcome the performance challenge for the selective restore operations of deduplication systems, solid-state-drive-based ...Deduplication has been commonly used in both enterprise storage systems and cloud storage. To overcome the performance challenge for the selective restore operations of deduplication systems, solid-state-drive-based (i.e., SSD-based) re^d cache cm, be deployed for speeding up by caching popular restore contents dynamically. Unfortunately, frequent data updates induced by classical cache schemes (e.g., LRU and LFU) significantly shorten SSDs' lifetime while slowing down I/O processes in SSDs. To address this problem, we propose a new solution -- LOP-Cache to greatly improve tile write durability of SSDs as well as I/O performance by enlarging the proportion of long-term popular (LOP) data among data written into SSD-based cache. LOP-Cache keeps LOP data in the SSD cache for a long time period to decrease the number of cache replacements. Furthermore, it prevents unpopular or unnecessary data in deduplication containers from being written into the SSD cache. We implemented LOP-Cache in a prototype deduplication system to evaluate its pertbrmance. Our experimental results indicate that LOP-Cache shortens the latency of selective restore by an average of 37.3% at the cost of a small SSD-based cache with only 5.56% capacity of the deduplicated data. Importantly, LOP-Cache improves SSDs' lifetime by a factor of 9.77. The evidence shows that LOP-Cache offers a cost-efficient SSD-based read cache solution to boost performance of selective restore for deduplication systems.展开更多
Ciphertext-policy attribute-based searchable encryption (CP-ABSE) can achieve fine-grained access control for data sharing and retrieval, and secure deduplication can save storage space by eliminating duplicate copi...Ciphertext-policy attribute-based searchable encryption (CP-ABSE) can achieve fine-grained access control for data sharing and retrieval, and secure deduplication can save storage space by eliminating duplicate copies. However, there are seldom schemes supporting both searchable encryption and secure deduplication. In this paper, a large universe CP-ABSE scheme supporting secure block-level deduplication are proposed under a hybrid cloud mechanism. In the proposed scheme, after the ciphertext is inserted into bloom filter tree (BFT), private cloud can perform fine-grained deduplication efficiently by matching tags, and public cloud can search efficiently using homomorphic searchable method and keywords matching. Finally, the proposed scheme can achieve privacy under chosen distribution attacks block-level (PRV-CDA-B) secure deduplication and match-concealing (MC) searchable security. Compared with existing schemes, the proposed scheme has the advantage in supporting fine-grained access control, block-level deduplication and efficient search, simultaneously.展开更多
文摘Due to the development of technology in medicine,millions of health-related data such as scanning the images are generated.It is a great challenge to store the data and handle a massive volume of data.Healthcare data is stored in the cloud-fog storage environments.This cloud-Fog based health model allows the users to get health-related data from different sources,and duplicated informa-tion is also available in the background.Therefore,it requires an additional sto-rage area,increase in data acquisition time,and insecure data replication in the environment.This paper is proposed to eliminate the de-duplication data using a window size chunking algorithm with a biased sampling-based bloomfilter and provide the health data security using the Advanced Signature-Based Encryp-tion(ASE)algorithm in the Fog-Cloud Environment(WCA-BF+ASE).This WCA-BF+ASE eliminates the duplicate copy of the data and minimizes its sto-rage space and maintenance cost.The data is also stored in an efficient and in a highly secured manner.The security level in the cloud storage environment Win-dows Chunking Algorithm(WSCA)has got 86.5%,two thresholds two divisors(TTTD)80%,Ordinal in Python(ORD)84.4%,Boom Filter(BF)82%,and the proposed work has got better security storage of 97%.And also,after applying the de-duplication process,the proposed method WCA-BF+ASE has required only less storage space for variousfile sizes of 10 KB for 200,400 MB has taken only 22 KB,and 600 MB has required 35 KB,800 MB has consumed only 38 KB,1000 MB has taken 40 KB of storage spaces.
文摘Cloud storage is essential for managing user data to store and retrieve from the distributed data centre.The storage service is distributed as pay a service for accessing the size to collect the data.Due to the massive amount of data stored in the data centre containing similar information and file structures remaining in multi-copy,duplication leads to increase storage space.The potential deduplication system doesn’t make efficient data reduction because of inaccuracy in finding similar data analysis.It creates a complex nature to increase the storage consumption under cost.To resolve this problem,this paper proposes an efficient storage reduction called Hash-Indexing Block-based Deduplication(HIBD)based on Segmented Bind Linkage(SBL)Methods for reducing storage in a cloud environment.Initially,preprocessing is done using the sparse augmentation technique.Further,the preprocessed files are segmented into blocks to make Hash-Index.The block of the contents is compared with other files through Semantic Content Source Deduplication(SCSD),which identifies the similar content presence between the file.Based on the content presence count,the Distance Vector Weightage Correlation(DVWC)estimates the document similarity weight,and related files are grouped into a cluster.Finally,the segmented bind linkage compares the document to find duplicate content in the cluster using similarity weight based on the coefficient match case.This implementation helps identify the data redundancy efficiently and reduces the service cost in distributed cloud storage.
文摘Virtualization is the backbone of cloud computing,which is a developing and widely used paradigm.Byfinding and merging identical memory pages,memory deduplication improves memory efficiency in virtualized systems.Kernel Same Page Merging(KSM)is a Linux service for memory pages sharing in virtualized environments.Memory deduplication is vulnerable to a memory disclosure attack,which uses covert channel establishment to reveal the contents of other colocated virtual machines.To avoid a memory disclosure attack,sharing of identical pages within a single user’s virtual machine is permitted,but sharing of contents between different users is forbidden.In our proposed approach,virtual machines with similar operating systems of active domains in a node are recognised and organised into a homogenous batch,with memory deduplication performed inside that batch,to improve the memory pages sharing efficiency.When compared to memory deduplication applied to the entire host,implementation details demonstrate a significant increase in the number of pages shared when memory deduplication applied batch-wise and CPU(Central processing unit)consumption also increased.
基金Project(IRT0725)supported by the Changjiang Innovative Group of Ministry of Education,China
文摘Data deduplication, as a compression method, has been widely used in most backup systems to improve bandwidth and space efficiency. As data exploded to be backed up, two main challenges in data deduplication are the CPU-intensive chunking and hashing works and the I/0 intensive disk-index access latency. However, CPU-intensive works have been vastly parallelized and speeded up by multi-core and many-core processors; the I/0 latency is likely becoming the bottleneck in data deduplication. To alleviate the challenge of I/0 latency in multi-core systems, multi-threaded deduplication (Multi-Dedup) architecture was proposed. The main idea of Multi-Dedup was using parallel deduplication threads to hide the I/0 latency. A prefix based concurrent index was designed to maintain the internal consistency of the deduplication index with low synchronization overhead. On the other hand, a collisionless cache array was also designed to preserve locality and similarity within the parallel threads. In various real-world datasets experiments, Multi-Dedup achieves 3-5 times performance improvements incorporating with locality-based ChunkStash and local-similarity based SiLo methods. In addition, Multi-Dedup has dramatically decreased the synchronization overhead and achieves 1.5-2 times performance improvements comparing to traditional lock-based synchronization methods.
文摘The tremendous development of cloud computing with related technol-ogies is an unexpected one.However,centralized cloud storage faces few chal-lenges such as latency,storage,and packet drop in the network.Cloud storage gets more attention due to its huge data storage and ensures the security of secret information.Most of the developments in cloud storage have been positive except better cost model and effectiveness,but still data leakage in security are billion-dollar questions to consumers.Traditional data security techniques are usually based on cryptographic methods,but these approaches may not be able to with-stand an attack from the cloud server's interior.So,we suggest a model called multi-layer storage(MLS)based on security using elliptical curve cryptography(ECC).The suggested model focuses on the significance of cloud storage along with data protection and removing duplicates at the initial level.Based on divide and combine methodologies,the data are divided into three parts.Here,thefirst two portions of data are stored in the local system and fog nodes to secure the data using the encoding and decoding technique.The other part of the encrypted data is saved in the cloud.The viability of our model has been tested by research in terms of safety measures and test evaluation,and it is truly a powerful comple-ment to existing methods in cloud storage.
基金This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(NRF-2019R1I1A3A01062789)(received by N.Park).
文摘A significant number of cloud storage environments are already implementing deduplication technology.Due to the nature of the cloud environment,a storage server capable of accommodating large-capacity storage is required.As storage capacity increases,additional storage solutions are required.By leveraging deduplication,you can fundamentally solve the cost problem.However,deduplication poses privacy concerns due to the structure itself.In this paper,we point out the privacy infringement problemand propose a new deduplication technique to solve it.In the proposed technique,since the user’s map structure and files are not stored on the server,the file uploader list cannot be obtained through the server’s meta-information analysis,so the user’s privacy is maintained.In addition,the personal identification number(PIN)can be used to solve the file ownership problemand provides advantages such as safety against insider breaches and sniffing attacks.The proposed mechanism required an additional time of approximately 100 ms to add a IDRef to distinguish user-file during typical deduplication,and for smaller file sizes,the time required for additional operations is similar to the operation time,but relatively less time as the file’s capacity grows.
文摘Cloud computing technology is the culmination of technical advancements in computer networks,hardware and software capabilities that collectively gave rise to computing as a utility.It offers a plethora of utilities to its clients worldwide in a very cost-effective way and this feature is enticing users/companies to migrate their infrastructure to cloud platform.Swayed by its gigantic capacity and easy access clients are uploading replicated data on cloud resulting in an unnecessary crunch of storage in datacenters.Many data compression techniques came to rescue but none could serve the purpose for the capacity as large as a cloud,hence,researches were made to de-duplicate the data and harvest the space from exiting storage capacity which was going in vain due to duplicacy of data.For providing better cloud services through scalable provisioning of resources,interoperability has brought many Cloud Service Providers(CSPs)under one umbrella and termed it as Cloud Federation.Many policies have been devised for private and public cloud deployment models for searching/eradicating replicated copies using hashing techniques.Whereas the exploration for duplicate copies is not restricted to any one type of CSP but to a set of public or private CSPs contributing to the federation.It was found that even in advanced deduplication techniques for federated clouds,due to the different nature of CSPs,a single file is stored at private as well as public group in the same cloud federation which can be handled if an optimized deduplication strategy be rendered for addressing this issue.Therefore,this study has been aimed to further optimize a deduplication strategy for federated cloud environment and suggested a central management agent for the federation.It was perceived that work relevant to this is not existing,hence,in this paper,the concept of federation agent has been implemented and deduplication technique following file level has been used for the accomplishment of this approach.
基金supported in part by ZTE Industry-Academia-Research Cooperation Fundsthe National Natural Science Foundation of China under Grant Nos.61502191,61502190,61602197,and 61772222)+2 种基金Fundamental Research Funds for the Central Universities under Grant Nos.2017KFYXJJ065 and 2016YXMS085the Hubei Provincial Natural Science Foundation of China under Grant Nos.2016CFB226 and2016CFB192Key Laboratory of Information Storage System Ministry of Education of China
文摘Modern backup systems exploit data deduplication technology to save stor-age space whereas suffering from the fragmentation problem caused by deduplication.Fragmentation degrades the restore performance because of restoring the chunks thatare scattered all over different containers. To improve the restore performance, thestate-of-the-art History Aware Rewriting Algorithm(HAR) is proposed to collect frag-mented chunks in the last backup and rewrite them in the next backup. However, dueto rewriting fragmented chunks in the next backup, HAR fails to eliminate internalfragmentation caused by self-referenced chunks(that exist more than two times in abackup) in the current backup, thus degrading the restore performance. In this paper,we propose Selectively Rewriting Self-Referenced Chunks(SRSC), a scheme that de-signs a buffer to simulate a restore cache, identify internal fragmentation in the cacheand selectively rewrite them. Our experimental results based on two real-world datas-ets show that SRSC improves the restore performance by 45% with an acceptable sac-rifice of the deduplication ratio.
基金This work was supported by ZTE Industry-University-Institute Cooperation Funds under Grant No.2019ZTE03-01National Keystone R&D Program of China under Grant No.2017YFB0803204+3 种基金National Natural Science Foundation of China(NSFC)under Grant No.61671001Guangdong Provincial R&D Key Program under Grant No.2019B010137001Shenzhen Research Programs under Grant Nos.JCYJ20190808155607340,JSGG20170406144032901,JSGG20170824095858416 and JCYJ20170306092030521PCL Future Regional Network Facilities for Large-scale Experiments and Applications under Grant No.PCL2018KP001.
文摘In architecture of cloud storage, the deduplication technology encrypted with theconvergent key is one of the important data compression technologies, which effectively improvesthe utilization of space and bandwidth. To further refine the usage scenarios for varioususer permissions and enhance user’s data security, we propose a blockchain-based differentialauthorized deduplication system. The proposed system optimizes the traditionalProof of Vote (PoV) consensus algorithm and simplifies the existing differential authorizationprocess to realize credible management and dynamic update of authority. Based on thedecentralized property of blockchain, we overcome the centralized single point fault problemof traditional differentially authorized deduplication system. Besides, the operations oflegitimate users are recorded in blocks to ensure the traceability of behaviors.
文摘In deduplication, index-lookup disk bottleneck is a major obstacle which limits the throughput of backup processes. One way to minimize the effect of this issue and boost speed is to use very high course-grained chunks for deduplication at a cost of low storage saving and limited scalability. Another way is to distribute the deduplication process among multiple nodes but this approach introduces storage node island effect and also incurs high communication cost. In this paper, we explore dCACH, a content-aware clustered and hierarchical deduplication system, which implements a hybrid of inline course grained and offline fine-grained distributed deduplication where routing decisions are made for a set of files instead of single files. It utilizes bloom filters for detecting similarity between a data stream and previous data streams and performs stateful routing which solves the storage node island problem. Moreover, it exploits the negligibly small amount of content shared among chunks from different file types to create groups of files and deduplicate each group in their own fingerprint index space. It implements hierarchical deduplication to reduce the size of fingerprint indexes at the global level, where only files and big sized segments are deduplicated. Locality is created and exploited first using the big sized segments deduplicated at the global level and second by routing a set of consecutive files together to one storage node. Furthermore, the use of bloom filter for similarity detection between streams has low communication and computation cost while it enables to achieve duplicate elimination performance comparable to single node deduplication. dCACH is evaluated using a prototype deployed on a server environment distributed over four separate machines. It is shown to have 10× the speed of Extreme_Binn with a minimal communication overhead, while its duplicate elimination effectiveness is on a par with a single node deduplication system.
基金supported by the National Key R&D Program of China (No.2018YFA0704703)National Natural Science Foundation of China (Nos.61972215,61972073,and 62172238)Natural Science Foundation of Tianjin (No.20JCZDJC00640).
文摘In cloud storage,client-side deduplication is widely used to reduce storage and communication costs.In client-side deduplication,if the cloud server detects that the user’s outsourced data have been stored,then clients will not need to reupload the data.However,the information on whether data need to be uploaded can be used as a side-channel,which can consequently be exploited by adversaries to compromise data privacy.In this paper,we propose a new threat model against side-channel attacks.Different from existing schemes,the adversary could learn the approximate ratio of stored chunks to unstored chunks in outsourced files,and this ratio will affect the probability that the adversary compromises the data privacy through side-channel attacks.Under this threat model,we design two defense schemes to minimize privacy leakage,both of which design interaction protocols between clients and the server during deduplication checks to reduce the probability that the adversary compromises data privacy.We analyze the security of our schemes,and evaluate their performances based on a real-world dataset.Compared with existing schemes,our schemes can better mitigate data privacy leakage and have a slightly lower communication cost.
基金This work is partially supported by the National Science Foundation Awards of USA under Grant Nos. 121756, 1305237, 142191 and 1439622.
文摘Data deduplication (dedupe for short) is a special data compression technique. It has been widely adopted to save backup time as well as storage space, particularly in backup storage systems. Therefore, most dedupe research has primarily focused on improving dedupe write performance. However, backup storage dedupe read performance is also a crucial problem for storage recovery. This paper designs a new dedupe storage read cache for backup applications that improves read performance by exploiting a special characteristic: the read sequence is the same as the write sequence. Consequently, for better cache utilization, by looking ahead for future references within a moving window, it evicts victims from the cache having the smallest future access. Moreover~ to further improve read cache performance. it maintains a small log buffer to judiciously cache future access data chunks. Extensive experiments with real-world backup workloads demonstrate that the proposed read cache scheme improves read performance by up to 64.3%
基金This work was supported by the National Science Fund for Distinguished Young Scholars of China under Grant No. 61125102 and the State Key Program of National Natural Science Foundation of China under Grant No. 61133008.
文摘Data deduplication for file communication across wide area network (WAN) in the applications such as file synchronization and mirroring of cloud environments usually achieves significant bandwidth saving at the cost of significant time overheads of data deduplication. The time overheads include the time required for data deduplication at two geographi- cally distributed nodes (e.g., disk access bottleneck) and the duplication query/answer operations between the sender and the receiver, since each query or answer introduces at least one round-trip time (RTT) of latency. In this paper, we present a data deduplication system across WAN with metadata feedback and metadata utilization (MFMU), in order to harness the data deduplication related time overheads. In the proposed MFMU system, selective metadata feedbacks from the receiver to the sender are introduced to reduce the number of duplication query/answer operations. In addition, to harness the metadata related disk I/O operations at the receiver, as well as the bandwidth overhead introduced by the metadata feedbacks, a hysteresis hash re-chunking mechanism based metadata utilization component is introduced. Our experimental results demonstrated that MFMU achieved an average of 20%~40% deduplication acceleration with the bandwidth saving ratio not reduced by the metadata feedbacks, as compared with the "baseline" content defined chunking (CDC) used in LBFS (Low-bandwith Network File system) and exiting state-of-the-art Bimodal chunking algorithms based data deduplication solutions.
文摘Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fin- gerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplica- tion system. Leach is motivated by the characteristics of real- world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to or- ganize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cache mem- ory, with a goal to service a majority of duplicated data de- tection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fin- gerprint index and cache updates. In comprehensive experi- ments on several real-world datasets, Leach outperforms con- ventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.
基金the National High Technology Research and Development Program(863)of China(No.2013AA013201)the National Natural Science Foundation of China(Nos.61025009,61232003,61170288 and 61332003)
文摘As data are growing rapidly in data centers,inline cluster deduplication technique has been widely used to improve storage efficiency and data reliability.However,there are some challenges faced by the cluster deduplication system:the decreasing data deduplication rate with the increasing deduplication server nodes,high communication overhead for data routing,and load balance to improve the throughput of the system.In this paper,we propose a well-performed cluster deduplication system called AR-Dedupe.The experimental results of two real datasets demonstrate that AR-Dedupe can achieve a high data deduplication rate with a low communication overhead and keep the system load balancing well at the same time through a new data routing algorithm.In addition,we utilize application-aware mechanism to speed up the index of handprints in the routing server which has a 30%performance improvement.
基金supported by the Doctoral Foundation in Henan University of Technology (31401152)
文摘Updatable block-level message-locked encryption(MLE) can efficiently update encrypted data, and public auditing can verify the integrity of cloud storage data by utilizing a third party auditor(TPA). However, there are seldom schemes supporting both updatable block-level deduplication and public auditing. In this paper, an updatable block-level deduplication scheme with efficient auditing is proposed based on a tree-based authenticated structure. In the proposed scheme, the cloud server(CS) can perform block-level deduplication, and the TPA achieves integrity auditing tasks. When a data block is updated, the ciphertext and auditing tags could be updated efficiently. The security analysis demonstrates that the proposed scheme can achieve privacy under chosen distribution attacks in secure deduplication and resist uncheatable chosen distribution attacks(UNC-CDA) in proof of ownership(PoW). Furthermore, the integrity auditing process is proven secure under adaptive chosen-message attacks. Compared with previous relevant schemes, the proposed scheme achieves better functionality and higher efficiency.
文摘Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution.
基金Supported by the National Natural Science Foundation of China(61373040,61173137)the Ph.D.Programs Foundation of Ministry of Education of China(20120141110002)the Key Project of Natural Science Foundation of Hubei Province(2010CDA004)
文摘Storage auditing and client-side deduplication techniques have been proposed to assure data integrity and improve storage efficiency, respectively. Recently, a few schemes start to consider these two different aspects together. However, these schemes either only support plaintext data file or have been proved insecure. In this paper, we propose a public auditing scheme for cloud storage systems, in which deduplication of encrypted data and data integrity checking can be achieved within the same framework. The cloud server can correctly check the ownership for new owners and the auditor can correctly check the integrity of deduplicated data. Our scheme supports deduplication of encrypted data by using the method of proxy re-encryption and also achieves deduplication of data tags by aggregating the tags from different owners. The analysis and experiment results show that our scheme is provably secure and efficient.
基金This work is supported by the Natural Science Foundation of Beijing under Grant No. 4172031, the Pundamental Research FSmds for the Central Universities of China, and the Research Funds of Renmin University of China under Grant No. 16XNLQ02. Xiao Qin's work is supported by the U.S. National Science Foundation under Grant Nos. IIS-1618669, CCF-0845257 (CAREER), CNS-0917137, CNS-0757778, CCF-0742187, CNS-0831502, CNS-0855251, and OCI-0753305. Xiao Qin's study is also supported by the Programme of Introducing Talents of Discipline to Universities (111 Project) in China under Grant No. B07038.
文摘Deduplication has been commonly used in both enterprise storage systems and cloud storage. To overcome the performance challenge for the selective restore operations of deduplication systems, solid-state-drive-based (i.e., SSD-based) re^d cache cm, be deployed for speeding up by caching popular restore contents dynamically. Unfortunately, frequent data updates induced by classical cache schemes (e.g., LRU and LFU) significantly shorten SSDs' lifetime while slowing down I/O processes in SSDs. To address this problem, we propose a new solution -- LOP-Cache to greatly improve tile write durability of SSDs as well as I/O performance by enlarging the proportion of long-term popular (LOP) data among data written into SSD-based cache. LOP-Cache keeps LOP data in the SSD cache for a long time period to decrease the number of cache replacements. Furthermore, it prevents unpopular or unnecessary data in deduplication containers from being written into the SSD cache. We implemented LOP-Cache in a prototype deduplication system to evaluate its pertbrmance. Our experimental results indicate that LOP-Cache shortens the latency of selective restore by an average of 37.3% at the cost of a small SSD-based cache with only 5.56% capacity of the deduplicated data. Importantly, LOP-Cache improves SSDs' lifetime by a factor of 9.77. The evidence shows that LOP-Cache offers a cost-efficient SSD-based read cache solution to boost performance of selective restore for deduplication systems.
基金supported by the National Natural Science Foundation of China (61472470)the Science and Technology Bureau Project of Weiyang District of Xi’an City (201403)the National Natural Science Foundation of Shaanxi Province (2014JM2-6091, 2015JQ1007)
文摘Ciphertext-policy attribute-based searchable encryption (CP-ABSE) can achieve fine-grained access control for data sharing and retrieval, and secure deduplication can save storage space by eliminating duplicate copies. However, there are seldom schemes supporting both searchable encryption and secure deduplication. In this paper, a large universe CP-ABSE scheme supporting secure block-level deduplication are proposed under a hybrid cloud mechanism. In the proposed scheme, after the ciphertext is inserted into bloom filter tree (BFT), private cloud can perform fine-grained deduplication efficiently by matching tags, and public cloud can search efficiently using homomorphic searchable method and keywords matching. Finally, the proposed scheme can achieve privacy under chosen distribution attacks block-level (PRV-CDA-B) secure deduplication and match-concealing (MC) searchable security. Compared with existing schemes, the proposed scheme has the advantage in supporting fine-grained access control, block-level deduplication and efficient search, simultaneously.