Duplicate detection and elimination in  xml data for a data warehouse

Ghaith O. Mahdi 1; Murtadha M. Hamad

doi:10.14419/ijet.v7i4.20419

Authors

Ghaith O. Mahdi 1
university of anbar
Murtadha M. Hamad
university of anbar

Received date: September 28, 2018

Accepted date: December 16, 2018

Published date: May 27, 2019

DOI:

https://doi.org/10.14419/ijet.v7i4.20419

Keywords:

Blocking Technique, Levenshtein Distance, Smith Waterman Similarity, ANN (Back Propagation)

Abstract

Due to the significant increase in the volume of data in recent decades, the problem of duplicate data has emerged because of the multiplicity of resources where the data is collected in different formats. The presence of duplicates comes as a result of the existence of different formulas of data. Thus, it is necessary to clean the duplicate data to access a pure data set. The main concern of this study is to clean data which Known by its complex hierarchal structure in data warehouse. This can be achieved by detecting duplicate in large data bases in order to increase the efficiency of data mining. In the current study the proposed system of duplicate elements passes through three-stages. The first stage (Pre-processing stage) includes two parts: the first part is the elimination of the exact match which, in turn, works to eliminate many of the identical elements completely. This procedure saves a lot of time and effort by preventing the entrance of many elements to the processing stage which are usually known by its complexity. In the second part blocking technique is used based on Levenshtein distance to minimize the number of comparisons and to maximize the accuracy of blocking elements than the traditional ones. These processes are performed to improve the dataset. The second stage (Processing stage) is taken to compute the similarity ratios between each pair of elements within each block by using smith waterman similarity algorithm. The third stage is the classification stage of the elements in which an element is identified whether it is duplicate or non-duplicate. The Artificial Neural Network technique (Back-Propagation) is used to meet this purpose. The threshold 0.65 has been determined which is relied on the results obtained. The Artificial Neural Network (Back-Propagation) is used to classify the elements in to duplicate and non-duplicate. The efficiency of the proposed system is represented by the accuracy obtained which is closer to 100% through reducing the number of "false negatives" and "false positive" relative to the "true positive".
Â
Â

References

[1] M. R. Pawar, â€œEfficient Duplicate Detection and Elimination in Hierarchical Multimedia Data,â€ vol. 122, no. 12, pp. 15â€“21, 2015. https://doi.org/10.5120/21751-5018.
[2] A. A. Abraham and S. D. Kanmani, â€œA Novel Approach for the Effective Detection of Duplicates in XML Data,â€ Int. J. Comput. Eng. Res., vol. 4, pp. 82â€“87, 2014.
[3] M. M. Hamad and S. S. Sami, â€œUsing Q-Gram and Fuzzy Logic Algorithms for Eliminating Data Warehouse Duplications,â€ 2016.
[4] S. Gaikwad and N. Bogiri, â€œLevenshtein distance algorithm for efficient and effective XML duplicate detection,â€ IEEE Int. Conf. Comput. Commun. Control. IC4 2015, 2016. https://doi.org/10.1109/IC4.2015.7375698.
[5] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, â€œEliminating Fuzzy Duplicates in Data Warehouses,â€ VLDB â€™02 Proc. 28th Int. Conf. Very Large Databases, pp. 586â€“597, 2002. https://doi.org/10.1016/B978-155860869-6/50058-5.
[6] M. Weis and F. Naumann, â€œDogmatiX Tracks down Duplicates in XML,â€ Proc. 2005 ACM SIGMOD Int. Conf. Manag. data,ACM, pp. 431â€“442, 2005. https://doi.org/10.1145/1066157.1066207.
[7] L. LeitÃ£o, P. Calado, and M. Weis, â€œStructure-based inference of xml similarity for fuzzy duplicate detection,â€ 16th ACM Conf. Inf. Knowl. Manag., pp. 293â€“302, 2007. https://doi.org/10.1145/1321440.1321483.
[8] A. R. Petkar and V. B. Patil, â€œDuplicate Detection in Hierarchical Data Using XPath,â€ IOSR J. Comput. Eng. Ver. I, vol. 17, no. 6, pp. 2278â€“661, 2015.
[9] A. N. Mehta, â€œSimilarity Detection for XML Data,â€ Int. J. Adv. Reserch Sci. Eng., vol. 5, no. 1, pp. 152â€“157, 2016.
[10] P. B. K. P. M. Bhavana Dhake1, Dr.S.S.Lomte2, Prof.Y.R.Nagargoje3,Prof.R.A.Auti4,â€œDuplicatDetection in Hierarchical Data Using Improved Network Pruning Algorithm,â€ Compusoft, vol.4, no. 6, pp. 7838â€“7850, 2015.
[11] A. Thesis, â€œPerformance Evaluation of Blocking Methods for Duplicate Record Detection,â€ 2010.
[12] U. Draisbach and F. Naumann, â€œA generalization of blocking and windowing algorithms for duplicate detection,â€ Proc. - 2011 Int. Conf. Data Knowl. Eng. ICDKE 2011, pp. 18â€“24, 2011. https://doi.org/10.1109/ICDKE.2011.6053920.
[13] J.ARUNA, â€œIdentification of Duplication Records For Query Results from Real Time Databases,â€ B.S.Abdur Rahman University, 2012.
[14] R. Haldar and D. Mukhopadhyay, â€œLevenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach,â€ arXiv:1101.1232, no. Ld, pp. 286â€“293, 2011.
[15] G. Recchia and M. Louwerse, â€œA Comparison of String Similarity Measures for Toponym Matching,â€ no. c, 2013.
[16] C. Ling, K. Benkrid, and T. Hamada, â€œA parameterisable and scalable smith-Waterman algorithm implementation on CUDA-compatible GPUs,â€ 2009 IEEE 7th Symp. Appl. Specif. Process. SASP 2009, pp. 94â€“100, 2009. https://doi.org/10.1109/SASP.2009.5226343.
[17] L. Hasan, Z. Al-Ars, and Z. Nawaz, â€œA Novel Approach for Accelerating the Smith-Waterman Algorithm using Recursive Variable Expansion,â€ Proc. 19th Annu. â€¦, 2008. https://doi.org/10.1109/IDT.2008.4802483.
[18] M. Bilenko, M. View, and R. J. Mooney, â€œAdaptive Blocking : Learning to Scale Up Record Linkage,â€ Proc. Sixth IEEE Int. Conf. Data Min., no. December, pp. 87â€“96, 2006. https://doi.org/10.1109/ICDM.2006.13.

Duplicate detection and elimination in xml data for a data warehouse

Authors

Ghaith O. Mahdi 1

Murtadha M. Hamad

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite