An XML based Web Crawler with Page Revisit Policy and Updation in Local Repository of Search Engine

  • Authors

    • Jyoti Mor Ansal University, Gurugram
    • Dr Dinesh Rai Ansal University, Gurugram
    • Dr Naresh Kumar MSIT, New Delhi
    2018-06-23
    https://doi.org/10.14419/ijet.v7i3.12924
  • WWW, Search Engine, Web Crawler, Network Resources, Page Revisit.
  • In a large collection of web pages, it is difficult for search engines to keep their online repository updated. Major search engines have hundreds of web crawlers that crawl the WWW day and night and send the downloaded web pages via a network to be stored in the search engine’s database. These results in over utilization of network resources like bandwidth, CPU cycles and so on. This paper proposes an architecture that tries to reduce the utilization of shared network resources with the help of an advanced XML based approach. This focused crawling based architecture is trained to download only the high quality data from the internet leaving behind the web pages which are not relevant to the desired domain. Here, a detailed layout of the proposed system is described which is capable of reducing the load on network and reducing the problem arise in residency of mobile agent at the remote server.

     

     

  • References

    1. [1] B. Mahar and C. K. Jha. “A Comparative Study on Web Crawling for searching Hidden Web.†International Journal of Computer Science and Information Technologies, 6, (2015), 2159-2163.

      [2] M. S. Ahuja, J. S. Bal and Varnica. “Web Crawler: Extracting the Web Data.†International Journal of Computer Trends and Technology, 13(2014), 132-137. https://doi.org/10.14445/22312803/IJCTT-V13P128.

      [3] R. Nath and N. Kumar. “A Novel Parallel Domain Focused Crawler for Reduction in Load on the Network.†International Journal of Computational Engineering Research2 (2012), 77-84.

      [4] A. Amaliae, D. Gunwan and A. Najwan. “Focused crawler for the acquisition of health articles†International Conference on Data and Software Engineering, 2016. https://doi.org/10.1109/ICODSE.2016.7936110.

      [5] T. Harry, Y. Achsan and W. C. Wibow. “A Fast Distributed Focused-web Crawling.†24th DAAAM International Symposium on Intelligent Manufacturing and Automation, a proceeding of Science Direct (2014), 492 – 499, https://doi.org/10.1016/j.proeng.2014.03.017.

      [6] A. Pranav and S. Chauhan. “Efficient Focused Web Crawling Approach for Search Engine.†International Journal of Computer Science and Mobile Computing, 4(2015), 545-551.

      [7] A. Gupta and P. Anad. “Focused web crawlers and its approaches.†International Conference on Futuristic Trends on Computational Analysis and Knowledge Management, IEEE (2015). https://doi.org/10.1109/ABLAZE.2015.7154936.

      [8] A. Garg, K. Gupta and A. Singh. “Survey of Web Crawler Algorithms.†International Journal of Advanced Research in Computer Science, 8 (2017), 426-428.

      [9] M. Kausar, V. S. Dhaka and S. K. Singh. “Web Crawler: A Review†International Journal of Computer Applications 63 (2013), 31-36.

      [10] C. Saini and V. Arora. “Information retrieval in web crawling: A survey.†International Conference on Advances in Computing, Communications and Informatics, IEEE (2016). https://doi.org/10.1109/ICACCI.2016.7732456.

      [11] G. Pant, P. Srinivasan and F. Menczer “Crawling the Web.†Web Dynamics. Springer, Berlin, Heidelberg, (2004), 153-177. https://doi.org/10.1007/978-3-662-10874-1_7.

      [12] C. Castillo and R. Yates. “Practical Issues of Crawling Large Web Collections.†URL: http://chato.cl/papers/castillo_05_practical_web_crawling.pdf.

      [13] P. Dahiwale, M. M.Raghuwanshi and L. Malik. “Design and Implementation of Focused Web Crawler Using Genetic Algorithm: An Approach to Web Mining.†International Journal of Scientific & Engineering Research, 6 (2015), 254-259.

      [14] M. A. Kausar, M. Nasarand S. K. Singh. “A Detailed Study on Information Retrieval using Genetic Algorithm.†Journal of Industrial and Intelligent Information, 1 (2013), 122-127. https://doi.org/10.12720/jiii.1.3.122-127.

      [15] A. Sefyi, A. Patel and J.C. Junior. “Empirical evaluation of link and content-based Focused Treasure Crawler.†Computer Standards & Interfaces, 44(2016) 54-62. https://doi.org/10.1016/j.csi.2015.09.007.

      [16] H. Lu, D. Zhan, L. Zhou and D. He, “An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation.†Mathematical Problems in Engineering, 2016(2016), 1-11. https://doi.org/10.1155/2016/6406901.

      [17] https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm, 03/02/2018, at 8.05am IST.

      [18] M. Kumar, R. Bhatia and A Ohri. “Design of focused crawler for information retrieval of Indian Origin Academicians.†IEEE (2016) https://doi.org/10.1109/ICACCA.2016.7578895.

      [19] S. Brin and L. Page. “The Anatomy of a Large-Scale Hyper textual Web Search Engine.†WWW conference (1998).

  • Downloads

  • How to Cite

    Mor, J., Dinesh Rai, D., & Naresh Kumar, D. (2018). An XML based Web Crawler with Page Revisit Policy and Updation in Local Repository of Search Engine. International Journal of Engineering & Technology, 7(3), 1119-1123. https://doi.org/10.14419/ijet.v7i3.12924