Performance evaluation and resource optimization of cloud based parallel Hadoop clusters with an intelligent scheduler.

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Data generated from real time information systems are always incremental in nature. Processing of such a huge incremental data in large scale requires a parallel processing system like Hadoop based cluster. Major challenge that arises in all cluster-based system is how efficiently the resources of the system can be used. The research carried out proposes a model architecture for Hadoop cluster with additional components integrated such as super node who manages the clusters computations and a mediation manager who does the performance monitoring and evaluation. Super node in the system is equipped with intelligent or adaptive scheduler that does the scheduling of the job with optimal resources. The scheduler is termed intelligent as it automatically decides which resource to be taken for which computation, with the help of a cross mapping of resource and job with a genetic algorithm which finds the best matching resource. The mediation node deploys ganglia a standard monitoring tool for Hadoop cluster to collect and record the performance parameters of the Hadoop cluster. The system over all does the scheduling of different jobs with optimal usage of resources thus achieving better efficiency compared to the native capacity scheduler in Hadoop. The system is deployed on top of OpenNebula Cloud environment for scalability.






  • Keywords

    Big Data; Hadoop; Parallel Processing; Intelligent Scheduler; Ganglia Monitor; Super Node; Mediation Manager.

  • References

      [1] J. Eckroth, “Teaching Future Big Data Analysts : Curriculum and Experience Report,” 2017.

      [2] J. V Gautam, H. B. Prajapati, V. K. Dabhi, and S. Chaudhary, “A survey on job scheduling algorithms in Big data processing,” 2015 IEEE Int. Conference. Electronic. Computer. Communication. Technol., pp. 1–11, 2015.

      [3] A. Sfrent and F. Pop, “Asymptotic scheduling for many task computing in Big Data platforms,” Inf. Sci. (Ny). vol. 319, pp. 71–91, 2015.

      [4] Q. Lu, S. Li, W. Zhang, and L. Zhang, “A genetic algorithm-based job scheduling model for big data analytics,” Eurasip J. Wireless. Communication. Network. vol. 2016, no. 1, 2016.

      [5] R. Kune, P. K. Konugurthi, A. Agarwal, R. R. Chillarige, and R. Buyya, “Genetic Algorithm Based Data-Aware Group Scheduling for Big Data Clouds,” in Proceedings - 2014 International Symposium on Big Data Computing, BDC 2014, 2015, pp. 96–104.

      [6] D. Cheng, J. Rao, C. Jiang, and X. Zhou, “Resource and Deadline-Aware Job Scheduling in Dynamic Hadoop Clusters,” in Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015, 2015, pp. 956–965.

      [7] D. Jiang, B. Ooi, L. Shi, and S. Wu, “Big Data Processing Using Hadoop: Survey on Scheduling,” Proc. VLDB Endow. vol. 3, no. 10, pp. 272–277, 2010.

      [8] L. De Giovanni and F. Pezzella, “An Improved Genetic Algorithm for the Distributed and Flexible Job-shop Scheduling problem,” European. Journal. Operaton. Research. vol. 200, no. 2, pp. 395–408, 2010.

      [9] A. Rasooli and D. G. Down, “A hybrid scheduling approach for scalable heterogeneous hadoop systems,” in Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012, 2012, pp. 1284–1291.

      [10] S. Liu, J. Xu, Z. Liu, and X. Liu, “Evaluating task scheduling in hadoop-based cloud systems,” in Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013, 2013, pp. 47–53.

      [11] A. Rasooli and D. G. Down, “Guidelines for Selecting Hadoop Schedulers Based on System Heterogeneity,” J. Grid Comput. vol. 12, no. 3, pp. 499–519, 2014.

      [12] D. Ding, F. Dong, and J. Luo, “Multi-Q: Multiple Queries Optimization Based on MapReduce in Cloud,” Proc. - 2014 2nd Int. Conf. Adv. Cloud Big Data, CBD 2014, pp. 100–107, 2015.

      [13] J. Zhu, J. Li, E. Hardesty, H. Jiang, and K. C. Li, “GPU-in-Hadoop: Enabling MapReduce across distributed heterogeneous platforms,” in 2014 IEEE/ACIS 13th International Conference on Computer and Information Science, ICIS 2014 - Proceedings, 2014, pp. 321–326.

      [14] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing),” Proc. VLDB Endow., vol. 3, no. 1–2, pp. 515–529, 2010.

      [15] Y. Zhang et al., “Parallel Processing Systems for Big Data: A Survey,” Proc. IEEE, vol. 104, no. 11, pp. 2114–2136, 2016.

      [16] A. Alexandrov et al., “Massively Parallel Data Analysis with PACTs on Nephele,” Proc. 36th International. Conference on. Very Large Data Bases, pp. 1625–1628, 2010.

      [17] B. Jena, M. K. Gourisaria, S. S. Rautaray, and M. Pandey, “Improvising Name Node Performance by Aggregator Aided HADOOP Framework,” pp. 382–388, 2016.

      [18] X. Wu, “A MapReduce Optimization Method on Hadoop Cluster,” Proc. - 2015 Int. Conf. Ind. Informatics - Comput. Technol. Intell. Technol. Ind. Inf. Integr. ICIICII 2015, pp. 18–21, 2016.

      [19] A. Vaccaro, L. Troiano, A. Vaccaro, and M. C. Vitelli, “On-line smart grids optimization by case-based reasoning on big data On-line Smart Grids Optimization by Case-Based Reasoning on Big Data,” no. September 2016.

      [20] A. Ramaprasath, A. Srinivasan, and C.-H. Lung, “Performance optimization of big data in mobile networks,” 2015 IEEE 28th Can. Conference. Electrical. Computer. Engineering. Vol. 2015–June, no. June, pp. 1364–1368, 2015.

      [21] S. Gokuldev and R. Radhakrishnan, “An adaptive job scheduling with efficient fault tolerance strategy in computational grid,”International. Journal of Engineering. Technology. Vol. 6, No.4, pp (1793-1798), 2014.




Article ID: 13372
DOI: 10.14419/ijet.v7i4.13372

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.