Abstract
As data progressively grows within data centers, the cloud storage models face several issues while storing data and offers abilities needed to shift data in an adequate time frame. This study aims to develop a distributed deduplication model to achieve scalable throughput and capacity utilizing many data servers for duplicating data in parallel with minimal loss. This paper proposes a new cloud storage model based on a distributed deduplication with the fingerprint index management (DDFI) model. The DDFI model operates on three main stages. At the initial stage, the DDFI model makes use of an effective routing technique depending upon the similarity level of the data, which leads to low network overhead by rapid identification of storage locations. In the second stage, the duplicate data identification procedure is carried out by the use of the MD5 algorithm. At the final stage, a fingerprint index management process is executed where a fingerprint index comprises fingerprints and its corresponding position details of every written chunk. For optimizing the results of the deduplication performance, the DDFI model manages the fingerprint index in storage space and only sometimes writes to disk at the same time as the cloud database scheme is idle. The simulation outcome exhibited that the presented DDFI model offered maximum results with a higher deduplication ratio (DR) with a minimum overhead of network bandwidth. From the detailed comparative analysis, it is inferred that the presented DFFI model offered maximum relative DR, maximum duplication performance, minimum read bandwidth, and write bandwidth.
Similar content being viewed by others
References
Biggar H (2012) Experiencing data de-duplication: improving efficiency and reducing capacity requirements. White paper, Feb. 2007. The Enterprise Strategy Group, Dublin
Kubiatowicz J, Bindel D, Chen Y et al (2000) Oceanstore: an architecture for global-scale persistent storage. ACM Sigplan Not 35(11):190–201
Quinlan S, Dorward S (2002) Venti: a new approach to archival storage. In: Proceedings of the conference on file and storage technologies, vol 2, pp 89–101
Lillibridge M, Eshghi K, Bhagwat D et al (2009) Sparse indexing: large scale, inline deduplication using sampling and locality In: Proceedings of the conference on file and storage technologies, vol 9, pp 111–123
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of compression complexity sequences, pp 21–29
Debnath B, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of conference on USENIX annual technical conference, pp 16–16
EMC Data Domain Global Deduplication Array. https://www.datadomain.com/products/global-deduplication-array.html. Visited in 2015
Dubnicki C, Gryz L, Heldt L et al (2009) HYDRAstor: a scalable secondary storage. In: FAST, vol 9, pp 197–210
Dong W, Douglis F, Li K et al (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of the conference on file and storage technologies, pp 15–29
Wang L, Zhu Z, Zhang X, Dong X, Wang Y (2017) DOMe: a deduplication optimization method for the NewSQL database backups. PLoS ONE 12(10):e0185189
Luo S, Zhang G, Wu C, Khan S, Li K (2015) Boafft: distributed deduplication for big data storage in the cloud. IEEE Trans Cloud Comput 61:1–13
Li M, Zhang H, Wu Y, Zhao C (2019) Prefetch-aware fingerprint cache management for data deduplication systems. Front Comput Sci 13(3):500–515
Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. ACM SIGOPS Oper Syst Rev 35(5):174–187
Vijayan MK, Kochunni JO, Attarde DR, Ankireddypalle RR, CommVault Systems Inc (2019) Deduplication replication in a distributed deduplication data storage system. U.S. patent application 16/232,950
Thakur MA, Bari S, Deshmukh R, Auty S (2020) Secure key agreement model for group data sharing and achieving data deduplication in cloud computing. In Information and communication technology for sustainable development. Springer, Singapore, pp 121–127
Hema S, Kangaiammal A (2019) Distributed storage hash algorithm (DSHA) for file-based deduplication in cloud computing. In: International conference on computer networks and inventive communication technologies. Springer, Cham, pp 572–581
An B, Li Y, Ma J, Huang G, Chen X, Cao D (2019) DCStore: a deduplication-based cloud-of-clouds storage service. In: 2019 IEEE international conference on web services (ICWS). IEEE, pp 291–295
Yuan H, Chen X, Li J, Jiang T, Wang J, Deng R (2019) Secure cloud data deduplication with efficient re-encryption. IEEE Trans Serv Comput. https://doi.org/10.1109/TSC.2019.2948007
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Saraswathi, S.S., Malarvizhi, N. Distributed deduplication with fingerprint index management model for big data storage in the cloud. Evol. Intel. 14, 683–690 (2021). https://doi.org/10.1007/s12065-020-00395-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12065-020-00395-8