Effective processing of unstructured data using python in Hadoop map reduce

K Kousalya; Shaik Javed Parvez

doi:10.14419/ijet.v7i2.21.12456

Authors

K Kousalya
Shaik Javed Parvez

Received date: May 4, 2018

Accepted date: May 4, 2018

Published date: April 20, 2018

DOI:

https://doi.org/10.14419/ijet.v7i2.21.12456

Keywords:

Hadoop map reduce, unstructured data, streaming, performance, non-java based

Abstract

In present scenario, the growing data are naturally unstructured. In this case to handle the wide range of data, is difficult. The proposed paper is to process the unstructured text data effectively in Hadoop map reduce using Python. Apache Hadoop is an open source platform and it widely uses Map Reduce framework. Map Reduce is popular and effective for processing the unstructured data in parallel manner.Â There are two stages in map reduce, namely transform and repository. Here the input splits into small blocks and worker node process individual blocks in parallel. This map reduce generally is based on java. While Hadoop Streaming allows writing mapper and reducer in other languages like Python. In this paper, we are going to show an alternative way of processing the growing unstructured content data by using python. We will also compare the performance between java based and non-java based programs.
Â

References

[1] Subramaniya SV & Vijayakumar V, â€œUnstructured Data Analysis n Big data using Map Reduceâ€, 2^nd International Symposium on Bid data & cloud computing, (2005).
[2] Leu JS, Yee YS & Chen WL, â€œComparison of map-reduce and SQL on large-scale data processingâ€, IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), (2010), pp.244-248.
[3] Sudha P & Gunavathi R, â€œA Survey Paper on Map reduce in Big dataâ€, International Journal of Science and Research (LJSR), Vol.5, No.9, (2016).
[4] Grolinger K, Hayes M, Higashino WA, L'Heureux A, Allison DS & Capretz MA, â€œChallenges for map reduce in big data. IEEE World Congress on Services (SERVICES), (2014), pp.182-189.
[5] Ekanayake J, â€œMap Reduce Implementation for Streaming Science Applicationâ€, IEEE 8^th International Conference on E-Science, (2012).
[6] Dittrich J & QuianÃ©-Ruiz JA, â€œEfficient big data processing in Hadoop MapReduceâ€, Proceedings of the VLDB Endowment, Vol.5, No.12, (2012).
[7] Simone L & Gianluigi Z, â€œPython Map Reduce and HDFS API for Hadoopâ€, Proceeding of the 19^th ACM International Symposium on High Performance Distributing Computing, Chicago, USA, (2015).
[8] Lammel, â€œGoogleâ€™s Map Reduce programming model Revisitedâ€. Science Computer Program.
[9] Apache Hadoop http://hadoop.apache.org/
[10] Dinesh P, Processing Unstructured Dataâ€, Senior Architect Specialist, Virtusa Private Limited, (2015).
[11] Zaharia M, Konwinski AJ & Katz AD, â€œImproving Map Reduce performance in heterogeneous environmentsâ€, Proceeding of the 8^th USENIX conference on Operating system design and implementation, 2008.
[12] Michael, G, â€œBig Data & Distributed Systemsâ€, International Journal of Science and Research, (2015).
[13] Zachary R & Donald M, â€œProgramming in Pythonâ€, Published by Oâ€™Reilly Media, (2016).

Effective processing of unstructured data using python in Hadoop map reduce

Authors

K Kousalya

Shaik Javed Parvez

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite