Effective processing of unstructured data using python in Hadoop map reduce

  • Authors

    • K Kousalya
    • Shaik Javed Parvez
    2018-04-20
    https://doi.org/10.14419/ijet.v7i2.21.12456
  • Hadoop map reduce, unstructured data, streaming, performance, non-java based
  • In present scenario, the growing data are naturally unstructured. In this case to handle the wide range of data, is difficult. The proposed paper is to process the unstructured text data effectively in Hadoop map reduce using Python. Apache Hadoop is an open source platform and it widely uses Map Reduce framework. Map Reduce is popular and effective for processing the unstructured data in parallel manner.  There are two stages in map reduce, namely transform and repository. Here the input splits into small blocks and worker node process individual blocks in parallel. This map reduce generally is based on java. While Hadoop Streaming allows writing mapper and reducer in other languages like Python. In this paper, we are going to show an alternative way of processing the growing unstructured content data by using python. We will also compare the performance between java based and non-java based programs.

     

  • References

    1. [1] Subramaniya SV & Vijayakumar V, “Unstructured Data Analysis n Big data using Map Reduceâ€, 2nd International Symposium on Bid data & cloud computing, (2005).

      [2] Leu JS, Yee YS & Chen WL, “Comparison of map-reduce and SQL on large-scale data processingâ€, IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), (2010), pp.244-248.

      [3] Sudha P & Gunavathi R, “A Survey Paper on Map reduce in Big dataâ€, International Journal of Science and Research (LJSR), Vol.5, No.9, (2016).

      [4] Grolinger K, Hayes M, Higashino WA, L'Heureux A, Allison DS & Capretz MA, “Challenges for map reduce in big data. IEEE World Congress on Services (SERVICES), (2014), pp.182-189.

      [5] Ekanayake J, “Map Reduce Implementation for Streaming Science Applicationâ€, IEEE 8th International Conference on E-Science, (2012).

      [6] Dittrich J & Quiané-Ruiz JA, “Efficient big data processing in Hadoop MapReduceâ€, Proceedings of the VLDB Endowment, Vol.5, No.12, (2012).

      [7] Simone L & Gianluigi Z, “Python Map Reduce and HDFS API for Hadoopâ€, Proceeding of the 19th ACM International Symposium on High Performance Distributing Computing, Chicago, USA, (2015).

      [8] Lammel, “Google’s Map Reduce programming model Revisitedâ€. Science Computer Program.

      [9] Apache Hadoop http://hadoop.apache.org/

      [10] Dinesh P, Processing Unstructured Dataâ€, Senior Architect Specialist, Virtusa Private Limited, (2015).

      [11] Zaharia M, Konwinski AJ & Katz AD, “Improving Map Reduce performance in heterogeneous environmentsâ€, Proceeding of the 8th USENIX conference on Operating system design and implementation, 2008.

      [12] Michael, G, “Big Data & Distributed Systemsâ€, International Journal of Science and Research, (2015).

      [13] Zachary R & Donald M, “Programming in Pythonâ€, Published by O’Reilly Media, (2016).

  • Downloads

  • How to Cite

    Kousalya, K., & Javed Parvez, S. (2018). Effective processing of unstructured data using python in Hadoop map reduce. International Journal of Engineering & Technology, 7(2.21), 417-419. https://doi.org/10.14419/ijet.v7i2.21.12456