Removing Duplicate URLs based on URL Normalization and Query Parameter

Kavita Goel; Jay Shankar Prasad; Saba Hilal

doi:10.14419/ijet.v7i3.12.16107

Authors

Kavita Goel
Jay Shankar Prasad
Saba Hilal

Received date: July 23, 2018

Accepted date: July 23, 2018

Published date: July 20, 2018

DOI:

https://doi.org/10.14419/ijet.v7i3.12.16107

Keywords:

URL Normalization, Query Parameter, Categorization, Duplicate URLs, Execution time.

Abstract

Searching is the important requirement of the web user and results is based on crawler. Users rely on search engines to get desired information in various forms text, images, sound, Video. Search engine gives information on the basis of indexed database and this database is created by the URLs through crawler. Some URLs directly or indirectly leads to same page. Crawling and indexing similar contents URLs implies wastage of resources. Crawler gives such results because of bad crawling algorithm, poor quality Ranking algorithm or low level user experience. The challenge is to remove duplicate results, near duplicate document detection and elimination to improve the performance of any search engine. This paper proposes a Web Crawler which performs crawling in particular category to remove irrelevant URL and implements URL normalization for removing duplicate URLs within particular category. Results are analyzed on the basis of total URL Fetched, Duplicate URLs, and Query execution time.
Â
Â

References

[1] M. Shoaib and A. K. Maurya. â€œURL ordering based performance evaluation of Web crawlerâ€, International Conference on Advances in Engineering & Technology Research, Unnao, pp. 1-7. 2014 doi: 10.1109/ICAETR.2014.7012962.
[2] Jiang, J. Pei. And H.Li. â€œMining search and browse logs for web search: A Surveyâ€, Journal of ACM Transactions on Intelligent Systems and Technology, vol.4, no. 4, 2013.
[3] L. Getoor, â€œLink Mining: A New Data Mining Challengeâ€ .SIGKDD explorations, pp. 1-6, 2003,
[4] K. S. Kim, K. Y. Kim and K. H. Lee, et al. â€œDesign and implementation of web crawler based on dynamic web collection cycleâ€, The International Conference on Information Network 2012, Bali, 2012, pp. 562-566, 2012 doi: 10.1109/ICOIN.2012.6164440.
[5] Agarwal, S H. Koppula and KP. Leela, et al. â€œURL normalization for de-duplication of web pagesâ€, in Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China,2009, pp.1987-1990.
[6] L.k. Soon and S.H. Lee. â€œEnhancing URL Normalization Using Metadata of Web Pagesâ€, 2008 International Conference on Computer and Electrical Engineering, Phuket, pp. 331-335. 2008 doi: https://doi.org/10.1109/ICCEE.2008.112.
[7] Z .B Yossef , I. Keidar and U. Schonfeld,U â€œDo Not Crawl in the DUST: Different URLs with Similar Textâ€, in proceedings of the 16th international conference on World Wide Web (WWW '07), ACM, New York, NY, USA ,2007, pp.111-120 doi: http://dx.doi.org/10.1145/1462148.1462151.
[8] Shestakov. â€œCurrent challenges in web crawlingâ€, in, Lecture notes on computer science, F. Daniel, P. Dolog, Li Q Ed. Springer-Verlag: Berlin, Heidelberg, 2013, pp. 518-521.
[9] T. Lei, R. Cai and J.M.Yang, et al. â€œA pattern tree-based approach to learning URL normalization rulesâ€, in Proceedings of the 19th international conference on World Wide Web (WWWâ€™10), New York, NY, USA, 2010, pp. 611-620. doi: http://dx.doi.org/10.1145/1772690.1772753.
[10] A.Dasgupta, R. Kumar and A. Sasturkar. â€œDe-duping URLs via rewrite rulesâ€, in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD â€™08), ACM, New York, NY, USA, 2008, pp. 186-194. doi: https://doi.org/10.1145/1401890.1401917.
[11] H. Lu, D. Zhan and L. Zhou, et al. â€œAn Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluationâ€, Mathematical Problems in Engineering, vol. 2016, 2016 doi:10.1155/2016/6406901.
[12] H. W. Hao, C. X. Mu and X. C. Yin,et al. â€œAn improved topic relevance algorithm for focused crawlingâ€, IEEE International Conference on Systems, Man, and Cybernetics, Anchorage, AK, 2011, pp. 850-855. doi: 10.1109/ICSMC.2011.6083759.
[13] B. Yohanes, P. Handoko, H K. Wardana. â€œFocused Crawler Optimization Using Genetic Algorithmâ€. Telkomnika, 2011, doi: 10.12928/telkomnika.v9i3.730.
[14] S. Batsakis, G.M. E. Petrakis and E. Milios â€œImproving the performance of focused web crawlersâ€, Data & Knowledge Engineering, 2009, Vol. 68, no.10, pp 1001-1003.ISSN 0169-023X, https://doi.org/10.1016/j.datak.2009.04.002.
[15] S. Khalil and M. Fakir â€œR Crawler: An R package for parallel web crawling and scrapingâ€ , Software, vol. 6, 2017, pp. 98-106, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2017.04.004.
[16] X. Zhang and M. Xian. â€œOptimization of Distributed Crawler under Hadoopâ€. MATEC Web of Conferences, 2015. doi: 10.1051/matecconf/20152202029.
[17] Cao F, Jiang D ,Singh J P. â€œScheduling Web Crawl for Better Performance and Qualityâ€,2003.[Online].Available: ftp://ftp.cs.princeton.edu/reports/2003/682.pdf. Accessed Jan 29, 2018.
[18] K. Rodrigues, M. Cristo and E. S. de Moura et al. "Removing DUST Using Multiple Alignment of Sequences," in IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 8, pp. 2261-2274, Aug 2015.doi: 10.1109/TKDE.2015.2407354.
[19] Purnamasari, L.Y. Banowosari, R.D. Kusumawati, et al. 2017. â€œSemantic Similarity for Search Engine Enhancementâ€. Journal of Engineering and Applied Sciences , Vol 12. 2017, pp. 1979-1982.

Removing Duplicate URLs based on URL Normalization and Query Parameter

Authors

Kavita Goel

Jay Shankar Prasad

Saba Hilal

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite