A Novel approach of data deduplication for distributed storage

  • Authors

    • Shubhanshi Singhal
    • Akanksha Kaushik
    • Pooja Sharma
    2018-03-10
    https://doi.org/10.14419/ijet.v7i2.4.10040
  • Data Deduplication, Chunking, Fingerprints, Indexing, Genetic Algorithm
  • Due to drastic growth of digital data, data deduplication has become a standard component of modern backup systems. It reduces data redundancy, saves storage space, and simplifies the management of data chunks. This process is performed in three steps: chunking, fingerprinting, and indexing of fingerprints. In chunking, data files are divided into the chunks and the chunk boundary is decided by the value of the divisor. For each chunk, a unique identifying value is computed using a hash signature (i.e. MD-5, SHA-1, SHA-256), known as fingerprint. At last, these fingerprints are stored in the index to detect redundant chunks means chunks having the same fingerprint values. In chunking, the chunk size is an important factor that should be optimal for better performance of deduplication system. Genetic algorithm (GA) is gaining much popularity and can be applied to find the best value of the divisor. Secondly, indexing also enhances the performance of the system by reducing the search time. Binary search tree (BST) based indexing has the time complexity of  which is minimum among the searching algorithm. A new model is proposed by associating GA to find the value of the divisor. It is the first attempt when GA is applied in the field of data deduplication. The second improvement in the proposed system is that BST index tree is applied to index the fingerprints. The performance of the proposed system is evaluated on VMDK, Linux, and Quanto datasets and a good improvement is achieved in deduplication ratio.

  • References

    1. style='font-size:8.0pt'>
    2. style='mso-spacerun:yes'> ADDIN EN.REFLIST
    3. field-separator'>[1] D. Reinsel, J. Gantz, and J. Rydning, "Data Age 2025: The Evolution of Data to Life-Critical," Seagate, An IDC White Paper2017.

      [2] Q. He, Z. Li, and X. Zhang, "Data deduplication techniques," in Future Information Technology and Management Engineering (FITME), 2010 International Conference on, 2010, pp. 430-433.

      [3] C. Bo, Z. F. Li, and W. Can, "Research on chunking algorithms of data de-duplication," in Proceedings of the 2012 International Conference on Communication, Electronics and Automation Engineering, 2013, pp. 1019-1025.

      [4] V. Henson, "An Analysis of Compare-by-hash," in HotOS, 2003, pp. 13-18.

      [5] S. He, C. Zhang, and P. Hao, "Comparative study of features for fingerprint indexing," in Image Processing (ICIP), 2009 16th IEEE International Conference on, 2009, pp. 2749-2752.

      [6] W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, et al., "A comprehensive study of the past, present, and future of data deduplication," Proceedings of the IEEE, vol. 104, pp. 1681-1710, 2016.

      [7] F. Guo and P. Efstathopoulos, "Building a High-performance Deduplication System," in USENIX annual technical conference, 2011.

      [8] A. Venish and K. Siva Sankar, "Study of Chunking Algorithm in Data Deduplication," in International Conference on Soft Computing Systems (ICSCS), 2016, pp. 13-20.

      [9] K. Eshghi and H. K. Tang, "A framework for analyzing and improving content-based chunking algorithms," Hewlett-Packard Labs Technical Report TR, vol. 30, 2005.

      [10] T.-S. Moh and B. Chang, "A running time improvement for the two thresholds two divisors algorithm," in Proceedings of the 48th Annual Southeast Regional Conference, 2010, p. 69.

      [11] I. Lkhagvasuren, J. M. So, J. G. Lee, C. Yoo, and Y. W. Ko, "Byte-index Chunking algorithm for data deduplication system," International Journal of Security and its Applications, vol. 7, pp. 415-424, 2013.

      [12] W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, et al., "FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication," in USENIX Annual Technical Conference, 2016, pp. 101-114.

      [13] J. J. Grefenstette, "How genetic algorithms work: A critical look at implicit parallelism," in Proc. 3rd Int. Joint Conf. on Genetic Algorithms (ICGA89), 1989.

      [14] S. Quinlan and S. Dorward, "Venti: A New Approach to Archival Storage," in FAST, 2002, pp. 89-101.

      [15] T. White, Hadoop: The definitive Guide: O'Reilly Media, Inc, 2012.

      [16] V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok, "Generating Realistic Datasets for Deduplication Analysis," in USENIX Annual Technical Conference, 2012, pp. 261-272.

      [17] T. L. Foundation, "The Linux Kernel Archives," ed, 2017.

      [18] M. Oberhumer, "LZO real-time data compression library," in User Manual for LZO, 0.28 ed, 1997.

      [19] M. R. Nelson, "LZW data compression," Dr. Dobb's Journal, vol. 14, pp. 29-36, 1989.

      [20] A. Z. Broder, "Some applications of Rabin’s fingerprinting method," in Sequences II, ed: Springer, 1993, pp. 143-152.

      [21] J. D. Cohen, "Recursive hashing functions for n-grams," ACM Transactions on Information Systems (TOIS), vol. 15, pp. 291-320, 1997.

      [22] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur, "Single instance storage in Windows 2000," in Proceedings of the 4th USENIX Windows Systems Symposium, 2000, pp. 13-24.

      [23] A. Muthitacharoen, B. Chen, and D. Mazieres, "A low-bandwidth network file system," in ACM SIGOPS Operating Systems Review, 2001, pp. 174-187.

      [24] C. Yu, C. Zhang, Y. Mao, and F. Li, "Leap-based content defined chunking—theory and implementation," in Mass Storage Systems and Technologies (MSST), 2015 31st Symposium on, 2015, pp. 1-12.

      [25] E. Kruus, C. Ungureanu, and C. Dubnicki, "Bimodal Content Defined Chunking for Backup Streams," in Fast, 2010, pp. 239-252.

      [26] I. Lkhagvasuren, J. M. So, J. G. Lee, and Y. W. Ko, "Multi-level Byte Index Chunking Approach for File Synchronization," 2013.

      [27] Y. Zhang, H. Jiang, D. Feng, W. Xia, M. Fu, F. Huang, et al., "AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication," in Computer Communications (INFOCOM), 2015 IEEE Conference on, 2015, pp. 1337-1345.

      [28] D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka, "Optimizing file replication over limited-bandwidth networks using remote differential compression," 2006.

      [29] D. T. Meyer and W. J. Bolosky, "A study of practical deduplication," ACM Transactions on Storage (TOS), vol. 7, p. 14, 2012.

      [30] J. Black, "Compare-by-Hash: A Reasoned Analysis," in USENIX Annual Technical Conference, General Track, 2006, pp. 85-90.

      [31] M. Stevens, E. Bursztein, P. Karpman, A. Albertini, and Y. Markov, "The first collision for full SHA-1," in Annual International Cryptology Conference, 2017, pp. 570-596.

      [32] A. W. Appel, "Verification of a cryptographic primitive: SHA-256," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 37, p. 7, 2015.

      [33] K. Srinivasan, T. Bisson, G. R. Goodson, and K. Voruganti, "iDedup: latency-aware, inline data deduplication for primary storage," in FAST, 2012, pp. 1-14.

      [34] J. Wei, H. Jiang, K. Zhou, and D. Feng, "MAD2: A scalable high-throughput exact deduplication approach for network backup services," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-14.

      [35] B. Zhu, K. Li, and R. H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," in Fast, 2008, pp. 1-14.

      [36] ZFS, ed, 2012.

      [37] I. Drago, M. Mellia, M. M Munafo, A. Sperotto, R. Sadre, and A. Pras, "Inside dropbox: understanding personal cloud storage services," in Proceedings of the 2012 Internet Measurement Conference, 2012, pp. 481-494.

      [38] J. H. Holland, Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence: MIT press, 1992.

      [39] N. M. Razali and J. Geraghty, "Genetic algorithm performance with different selection strategies in solving TSP," in Proceedings of the world congress on engineering, 2011, pp. 1134-1139.

      [40] L. Brown and L. Gruenwald, "Tree-based indexes for image data," Journal of Visual Communication and Image Representation, vol. 9, pp. 300-313, 1998.

      [41] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble, "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," in Fast, 2009, pp. 111-123.

    4. style='font-size:8.0pt'>
  • Downloads

  • How to Cite

    Singhal, S., Kaushik, A., & Sharma, P. (2018). A Novel approach of data deduplication for distributed storage. International Journal of Engineering & Technology, 7(2.4), 46-52. https://doi.org/10.14419/ijet.v7i2.4.10040