The Internet of Things(IoT)and cloud technologies have encouraged massive data storage at central repositories.Software-defined networks(SDN)support the processing of data and restrict the transmission of duplicate va...The Internet of Things(IoT)and cloud technologies have encouraged massive data storage at central repositories.Software-defined networks(SDN)support the processing of data and restrict the transmission of duplicate values.It is necessary to use a data de-duplication mechanism to reduce communication costs and storage overhead.Existing State of the art schemes suffer from computational overhead due to deterministic or random tree-based tags generation which further increases as the file size grows.This paper presents an efficient file-level de-duplication scheme(EFDS)where the cost of creating tags is reduced by employing a hash table with key-value pair for each block of the file.Further,an algorithm for hash table-based duplicate block identification and storage(HDBIS)is presented based on fingerprints that maintain a linked list of similar duplicate blocks on the same index.Hash tables normally have a consistent time complexity for lookup,generating,and deleting stored data regardless of the input size.The experiential results show that the proposed EFDS scheme performs better compared to its counterparts.展开更多
Evidence-based literature reviews play a vital role in contemporary research,facilitating the synthesis of knowledge from multiple sources to inform decisionmaking and scientific advancements.Within this framework,de-...Evidence-based literature reviews play a vital role in contemporary research,facilitating the synthesis of knowledge from multiple sources to inform decisionmaking and scientific advancements.Within this framework,de-duplication emerges as a part of the process for ensuring the integrity and reliability of evidence extraction.This opinion review delves into the evolution of de-duplication,highlights its importance in evidence synthesis,explores various de-duplication methods,discusses evolving technologies,and proposes best practices.By addressing ethical considerations this paper emphasizes the significance of deduplication as a cornerstone for quality in evidence-based literature reviews.展开更多
Based on variable sized chunking, this paper proposes a content aware chunking scheme, called CAC, that does not assume fully random file contents, but tonsiders the characteristics of the file types. CAC uses a candi...Based on variable sized chunking, this paper proposes a content aware chunking scheme, called CAC, that does not assume fully random file contents, but tonsiders the characteristics of the file types. CAC uses a candidate anchor histogram and the file-type specific knowledge to refine how anchors are determined when performing de- duplication of file data and enforces the selected average chunk size. CAC yields more chunks being found which in turn produces smaller average chtmks and a better reduction in data. We present a detailed evaluation of CAC and the experimental results show that this scheme can improve the compression ratio chunking for file types whose bytes are not randomly distributed (from 11.3% to 16.7% according to different datasets), and improve the write throughput on average by 9.7%.展开更多
Apart from high space efficiency,other demanding requirements for enterprise de-duplication backup are high performance,high scalability,and availability for large-scale distributed environments.The main challenge is ...Apart from high space efficiency,other demanding requirements for enterprise de-duplication backup are high performance,high scalability,and availability for large-scale distributed environments.The main challenge is reducing the significant disk input/output(I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks.Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck,thus suffering from degradation under poor duplicate locality workload.This paper presents Chunkfarm,a post-processing de-duplication backup system designed to improve capacity,throughput,and scalability for de-duplication.Chunkfarm performs de-duplication backup using the hash join algorithm,which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os,hence achieving high write throughput not influenced by workload locality.More importantly,by decentralizing fingerprint lookup and update,Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel;it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.展开更多
With the growing maturity of blockchain technology,its peer-topeer model and fully duplicated data storage pattern enable blockchain to act as a distributed ledger in untrustworthy environments.Blockchain storage has ...With the growing maturity of blockchain technology,its peer-topeer model and fully duplicated data storage pattern enable blockchain to act as a distributed ledger in untrustworthy environments.Blockchain storage has also become a research hotspot in industry,finance,and academia due to its security,and its unique data storage management model is gradually becoming a key technology to play its value in various fields’applications.However,with the increasing amount of data written into the blockchain,the blockchain system faces many problems in its actual implementation of the application,such as high storage space occupation,low data flexibility and availability,low retrieval efficiency,poor scalability,etc.To improve the above problems,this paper combines off-chain storage technology and deduplication technology to optimize the blockchain storage model.Firstly,this paper adopts the double-chain model to reduce the data storage of the major chain system,which stores a small amount of primary data and supervises the vice chain through an Application Programming Interface(API).The vice chain stores a large number of copies of data as well as non-transactional data.Our model divides the vice chain storage system into two layers,including a storage layer and a processing layer.In the processing layer,deduplication technology is applied to reduce the redundancy of vice chain data.Our doublechain storage model with high scalability enhances data flexibility,is more suitable as a distributed storage system,and performs well in data retrieval.展开更多
With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage s...With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.展开更多
基金supported in part by Hankuk University of Foreign Studies’Research Fund for 2023 and in part by the National Research Foundation of Korea(NRF)grant funded by the Ministry of Science and ICT Korea No.2021R1F1A1045933.
文摘The Internet of Things(IoT)and cloud technologies have encouraged massive data storage at central repositories.Software-defined networks(SDN)support the processing of data and restrict the transmission of duplicate values.It is necessary to use a data de-duplication mechanism to reduce communication costs and storage overhead.Existing State of the art schemes suffer from computational overhead due to deterministic or random tree-based tags generation which further increases as the file size grows.This paper presents an efficient file-level de-duplication scheme(EFDS)where the cost of creating tags is reduced by employing a hash table with key-value pair for each block of the file.Further,an algorithm for hash table-based duplicate block identification and storage(HDBIS)is presented based on fingerprints that maintain a linked list of similar duplicate blocks on the same index.Hash tables normally have a consistent time complexity for lookup,generating,and deleting stored data regardless of the input size.The experiential results show that the proposed EFDS scheme performs better compared to its counterparts.
文摘Evidence-based literature reviews play a vital role in contemporary research,facilitating the synthesis of knowledge from multiple sources to inform decisionmaking and scientific advancements.Within this framework,de-duplication emerges as a part of the process for ensuring the integrity and reliability of evidence extraction.This opinion review delves into the evolution of de-duplication,highlights its importance in evidence synthesis,explores various de-duplication methods,discusses evolving technologies,and proposes best practices.By addressing ethical considerations this paper emphasizes the significance of deduplication as a cornerstone for quality in evidence-based literature reviews.
基金Supported by the National Natural Science Foundation of China (No.60673001) the State Key Development Program of Basic Research of China (No. 2004CB318203).
文摘Based on variable sized chunking, this paper proposes a content aware chunking scheme, called CAC, that does not assume fully random file contents, but tonsiders the characteristics of the file types. CAC uses a candidate anchor histogram and the file-type specific knowledge to refine how anchors are determined when performing de- duplication of file data and enforces the selected average chunk size. CAC yields more chunks being found which in turn produces smaller average chtmks and a better reduction in data. We present a detailed evaluation of CAC and the experimental results show that this scheme can improve the compression ratio chunking for file types whose bytes are not randomly distributed (from 11.3% to 16.7% according to different datasets), and improve the write throughput on average by 9.7%.
基金supported by the National Basic Research Program (973) of China (No.2004CB318201)the National High-Tech Research and Development Program (863) of China (No.2008AA01A402)the National Natural Science Foundation of China (Nos.60703046 and 60873028)
文摘Apart from high space efficiency,other demanding requirements for enterprise de-duplication backup are high performance,high scalability,and availability for large-scale distributed environments.The main challenge is reducing the significant disk input/output(I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks.Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck,thus suffering from degradation under poor duplicate locality workload.This paper presents Chunkfarm,a post-processing de-duplication backup system designed to improve capacity,throughput,and scalability for de-duplication.Chunkfarm performs de-duplication backup using the hash join algorithm,which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os,hence achieving high write throughput not influenced by workload locality.More importantly,by decentralizing fingerprint lookup and update,Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel;it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.
基金This work is supported by the Key Research and Development Project of Sichuan Province(No.2021YFSY0012,No.2020YFG0307,No.2021YFG0332)the Key Research and Development Project of Chengdu(No.2019-YF05-02028-GX)+1 种基金the Innovation Team of Quantum Security Communication of Sichuan Province(No.17TD0009)the Academic and Technical Leaders Training Funding Support Projects of Sichuan Province(No.2016120080102643).
文摘With the growing maturity of blockchain technology,its peer-topeer model and fully duplicated data storage pattern enable blockchain to act as a distributed ledger in untrustworthy environments.Blockchain storage has also become a research hotspot in industry,finance,and academia due to its security,and its unique data storage management model is gradually becoming a key technology to play its value in various fields’applications.However,with the increasing amount of data written into the blockchain,the blockchain system faces many problems in its actual implementation of the application,such as high storage space occupation,low data flexibility and availability,low retrieval efficiency,poor scalability,etc.To improve the above problems,this paper combines off-chain storage technology and deduplication technology to optimize the blockchain storage model.Firstly,this paper adopts the double-chain model to reduce the data storage of the major chain system,which stores a small amount of primary data and supervises the vice chain through an Application Programming Interface(API).The vice chain stores a large number of copies of data as well as non-transactional data.Our model divides the vice chain storage system into two layers,including a storage layer and a processing layer.In the processing layer,deduplication technology is applied to reduce the redundancy of vice chain data.Our doublechain storage model with high scalability enhances data flexibility,is more suitable as a distributed storage system,and performs well in data retrieval.
基金supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA01A303)
文摘With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.