Compression of text files using genomic code compression algorithm

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Text files utilize substantial amount of memory or disk space. Transmission of these files across a network depends upon a considerable amount of bandwidth. Compression procedures are explicitly advantageous in telecommunications and information technology because it facilitate devices to disseminate or reserve the equivalent amount of data in fewer bits. Text compression techniques section, the English passage by observing the patters and provide alternative symbols for larger patters of text. To diminish the depository of copious information and data storage expenditure, compression algorithms were used. Compression of significant and massive cluster of information can head to the improvement in retrieval time. Novel lossless compression algorithms have been introduced for better compression ratio. In this work, the various existing compression mechanisms that are particular for compressing the text files and Deoxyribonucleic acid (DNA) sequence files are analyzed. The performance is correlated in terms of compression ratio, time taken to compress/decompress the sequence and file size. In this proposed work, the input file is converted to DNA format and then DNA compression procedure is applied.



  • Keywords

    Data compression, text compression, lossy and lossless compression, DNA, bases, bit reduction, hexa decimal format, variable length code, huffman codes.

  • References

      [1] Radescu R & Pasca S, “String Matching in Text Compression”, ECAI 2017-International Conference, Targoviste, Romania, 9th edition, (2017).

      [2] Dufourq E & Bassett BA, “Text Compression for Sentiment Analysis via Evolutionary Agorithms”, PRASA-RobMech International Conference, Bloemfontein, South Africa, (2017).

      [3] Conrad KJ & Wilson PR, “Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios with LZ-Class Decompression Speed”, Data Compression Conference (DCC), (2016).

      [4] Barua L, Dhar PK, Alam L & Echizen I, “Bangla text compression based on modified Lempel-Ziv-Welch algorithm”, International Conference on Electrical, Computer and Communication Engineering (ECCE), (2017), pp.855-859.

      [5] Eric PV, Gopalakrishnan G & Karunakaran M, “An Optimal Seed Based Compression Algorithm for DNA Sequences”, Advances in Bioinformatics, (2016).

      [6] Zhu Z, Zhang Y, Ji Z, He S & Yang X, “High - throughput DNA sequence data compression”, Briefings in bioinformatics, (2015).

      [7] Mehta K & Ghrera SP, “DNA compression using referential compression algorithm”, Eighth International Conference Contemporary Computing (IC3), (2015).

      [8] Saada B & Zhang J, “DNA Sequences Compression Algorithm Based on Extended-ASCII Representation”, Proceedings of the world congress on engineering and computer science, (2015).

      [9] Baloul FM, Abdullah MH & Babikir EA, “ETAO: Symbol Mapping Tranformation Method for Text Compression”, International Conference on Computer Electrical and Electronics Engineering (ICCEEE), (2013), pp.384-389.

      [10] Satyanvesh D, Balleda K & Padyana A, “GenCodex- A Novel Algorithm for Compressing DNA seuences on Multi-cores and GPUs”, Proc. IEEE, 19th International Conf. on High Performance Computing (HiPC), (2012).

      [11] Prasad VH & Kumar PV, “A New Revised DNA Cramp Tool Based Approach of Chopping DNA Repetitive and Non- Repetitive Genome Sequences”, International Journal of Computer Science Issues (IJCSI), Vol.9, No.6,(2012), pp.448-454.

      [12] Rajeswari PR & Apparao A, “DNABIT Compress-Genome compression algorithm”, Bioinformatics, Vol.5, No.8,(2011), pp.350-360.

      [13] Rajeswari PR & Apparao A, “GenBit Compress Tool (GBC): A Java-Based Tool To Compress DNA Sequences and Compute Compression Ratio (BITS/BASE) Of Genomes”, International Journal of Computer Science and Information Technology, Vol.2, No.3,(2013), pp.181-191.

      [14] Afify H, Islam M, Wahed MA & Kadah YM, “Genomic sequences differential compression model”, International Journal of Computer Science and Information Technology, Vol.3, (2011), pp.145-154.

      [15] Cao MD, Dix TI, Allison L & Mears C, “A simple statistical algorithm for biological sequence compression”, Proceedings of the Data Compression Conference, (2007), pp.43-52.

      [16] Myung JI, Navarro DJ & Pitt MA, “Model selection by normalized maximum likelihood”, Journal of Mathematical Psychology, Vol.50, No.2, (2006), pp.167-179.

      [17] Behzadi B & Le Fessant F, “DNA compression challege revisited: a dynamic programming approach”, Proceedings of the Annual Symposium on Combinatorial Pattern Matching, (2005).

      [18] Abel J & Teahan W, “Universal Text Preprocessing for Data Compression”, IEEE Transactions On Computers, Vol.54, No.5, (2005).

      [19] Ma B, Tromp J & Li M, “PatternHunter: fast and more sensitive homology search”, Bioinformatics, Vol.18, No.3, (2002), pp.440-445.

      [20] Chen X, Li M, Ma B & Tromp J, “DNACompress: fast and effective DNA sequence compression”, Bioinformatics, Vol.18, no. 12, (2002), pp.1696-1698.

      [21] Chen X, Kwong S & Li M, “Compression algorithm for DNA sequences and its applications in genome comparison”, Proceedings of the 4th Annual International Conference on Computation Molecular Biology, (2000).

      [22] Matsumoto T, Sadakane K & Imai H, “Biological sequence compression algorithms”, Genome Informatics, (2000), pp.43-52.

      [23] Loewenstern D & Yianilos PN, “Significantly lower entropy estimates for natural DNA sequences”, Journal of Computational Biology, Vol.6, No.1, (1999), pp.125-142.

      [24] Grumbach S & Tahi F, “A new challenge for compression algorithms: genetic sequences”, Information Processing & Management, Vol.30, No.6,(1994), pp.875-886.

      [25] Grumbach S & Tahi F, “Compression of DNA sequences”, Proceedings of the IEEE Symposium on Data Compression, (1993).




Article ID: 13399
DOI: 10.14419/ijet.v7i2.31.13399

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.