Removing Duplicate URLs based on URL Normalization and Query Parameter
Keywords:URL Normalization, Query Parameter, Categorization, Duplicate URLs, Execution time.
Searching is the important requirement of the web user and results is based on crawler. Users rely on search engines to get desired information in various forms text, images, sound, Video. Search engine gives information on the basis of indexed database and this database is created by the URLs through crawler. Some URLs directly or indirectly leads to same page. Crawling and indexing similar contents URLs implies wastage of resources. Crawler gives such results because of bad crawling algorithm, poor quality Ranking algorithm or low level user experience. The challenge is to remove duplicate results, near duplicate document detection and elimination to improve the performance of any search engine. This paper proposes a Web Crawler which performs crawling in particular category to remove irrelevant URL and implements URL normalization for removing duplicate URLs within particular category. Results are analyzed on the basis of total URL Fetched, Duplicate URLs, and Query execution time.
 M. Shoaib and A. K. Maurya. â€œURL ordering based performance evaluation of Web crawlerâ€, International Conference on Advances in Engineering & Technology Research, Unnao, pp. 1-7. 2014 doi: 10.1109/ICAETR.2014.7012962.
 Jiang, J. Pei. And H.Li. â€œMining search and browse logs for web search: A Surveyâ€, Journal of ACM Transactions on Intelligent Systems and Technology, vol.4, no. 4, 2013.
 L. Getoor, â€œLink Mining: A New Data Mining Challengeâ€ .SIGKDD explorations, pp. 1-6, 2003,
 K. S. Kim, K. Y. Kim and K. H. Lee, et al. â€œDesign and implementation of web crawler based on dynamic web collection cycleâ€, The International Conference on Information Network 2012, Bali, 2012, pp. 562-566, 2012 doi: 10.1109/ICOIN.2012.6164440.
 Agarwal, S H. Koppula and KP. Leela, et al. â€œURL normalization for de-duplication of web pagesâ€, in Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China,2009, pp.1987-1990.
 L.k. Soon and S.H. Lee. â€œEnhancing URL Normalization Using Metadata of Web Pagesâ€, 2008 International Conference on Computer and Electrical Engineering, Phuket, pp. 331-335. 2008 doi: https://doi.org/10.1109/ICCEE.2008.112.
 Z .B Yossef , I. Keidar and U. Schonfeld,U â€œDo Not Crawl in the DUST: Different URLs with Similar Textâ€, in proceedings of the 16th international conference on World Wide Web (WWW '07), ACM, New York, NY, USA ,2007, pp.111-120 doi: http://dx.doi.org/10.1145/1462148.1462151.
 Shestakov. â€œCurrent challenges in web crawlingâ€, in, Lecture notes on computer science, F. Daniel, P. Dolog, Li Q Ed. Springer-Verlag: Berlin, Heidelberg, 2013, pp. 518-521.
 T. Lei, R. Cai and J.M.Yang, et al. â€œA pattern tree-based approach to learning URL normalization rulesâ€, in Proceedings of the 19th international conference on World Wide Web (WWWâ€™10), New York, NY, USA, 2010, pp. 611-620. doi: http://dx.doi.org/10.1145/1772690.1772753.
 A.Dasgupta, R. Kumar and A. Sasturkar. â€œDe-duping URLs via rewrite rulesâ€, in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD â€™08), ACM, New York, NY, USA, 2008, pp. 186-194. doi: https://doi.org/10.1145/1401890.1401917.
 H. Lu, D. Zhan and L. Zhou, et al. â€œAn Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluationâ€, Mathematical Problems in Engineering, vol. 2016, 2016 doi:10.1155/2016/6406901.
 H. W. Hao, C. X. Mu and X. C. Yin,et al. â€œAn improved topic relevance algorithm for focused crawlingâ€, IEEE International Conference on Systems, Man, and Cybernetics, Anchorage, AK, 2011, pp. 850-855. doi: 10.1109/ICSMC.2011.6083759.
 B. Yohanes, P. Handoko, H K. Wardana. â€œFocused Crawler Optimization Using Genetic Algorithmâ€. Telkomnika, 2011, doi: 10.12928/telkomnika.v9i3.730.
 S. Batsakis, G.M. E. Petrakis and E. Milios â€œImproving the performance of focused web crawlersâ€, Data & Knowledge Engineering, 2009, Vol. 68, no.10, pp 1001-1003.ISSN 0169-023X, https://doi.org/10.1016/j.datak.2009.04.002.
 S. Khalil and M. Fakir â€œR Crawler: An R package for parallel web crawling and scrapingâ€ , Software, vol. 6, 2017, pp. 98-106, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2017.04.004.
 X. Zhang and M. Xian. â€œOptimization of Distributed Crawler under Hadoopâ€. MATEC Web of Conferences, 2015. doi: 10.1051/matecconf/20152202029.
 Cao F, Jiang D ,Singh J P. â€œScheduling Web Crawl for Better Performance and Qualityâ€,2003.[Online].Available: ftp://ftp.cs.princeton.edu/reports/2003/682.pdf. Accessed Jan 29, 2018.
 K. Rodrigues, M. Cristo and E. S. de Moura et al. "Removing DUST Using Multiple Alignment of Sequences," in IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 8, pp. 2261-2274, Aug 2015.doi: 10.1109/TKDE.2015.2407354.
 Purnamasari, L.Y. Banowosari, R.D. Kusumawati, et al. 2017. â€œSemantic Similarity for Search Engine Enhancementâ€. Journal of Engineering and Applied Sciences , Vol 12. 2017, pp. 1979-1982.
View Full Article:
How to Cite
LicenseAuthors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under aÂ Creative Commons Attribution Licensethat allows others to share the work with an acknowledgement of the work''s authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal''s published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (SeeÂ The Effect of Open Access).