Duplicate detection and elimination in xml data for a data warehouse

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Due to the significant increase in the volume of data in recent decades, the problem of duplicate data has emerged because of the multiplicity of resources where the data is collected in different formats. The presence of duplicates comes as a result of the existence of different formulas of data. Thus, it is necessary to clean the duplicate data to access a pure data set. The main concern of this study is to clean data which Known by its complex hierarchal structure in data warehouse. This can be achieved by detecting duplicate in large data bases in order to increase the efficiency of data mining. In the current study the proposed system of duplicate elements passes through three-stages. The first stage (Pre-processing stage) includes two parts: the first part is the elimination of the exact match which, in turn, works to eliminate many of the identical elements completely. This procedure saves a lot of time and effort by preventing the entrance of many elements to the processing stage which are usually known by its complexity. In the second part blocking technique is used based on Levenshtein distance to minimize the number of comparisons and to maximize the accuracy of blocking elements than the traditional ones. These processes are performed to improve the dataset. The second stage (Processing stage) is taken to compute the similarity ratios between each pair of elements within each block by using smith waterman similarity algorithm. The third stage is the classification stage of the elements in which an element is identified whether it is duplicate or non-duplicate. The Artificial Neural Network technique (Back-Propagation) is used to meet this purpose. The threshold 0.65 has been determined which is relied on the results obtained. The Artificial Neural Network (Back-Propagation) is used to classify the elements in to duplicate and non-duplicate. The efficiency of the proposed system is represented by the accuracy obtained which is closer to 100% through reducing the number of "false negatives" and "false positive" relative to the "true positive".



  • Keywords

    Blocking Technique; Levenshtein Distance; Smith Waterman Similarity; ANN (Back Propagation)

  • References

      [1] M. R. Pawar, “Efficient Duplicate Detection and Elimination in Hierarchical Multimedia Data,” vol. 122, no. 12, pp. 15–21, 2015. https://doi.org/10.5120/21751-5018.

      [2] A. A. Abraham and S. D. Kanmani, “A Novel Approach for the Effective Detection of Duplicates in XML Data,” Int. J. Comput. Eng. Res., vol. 4, pp. 82–87, 2014.

      [3] M. M. Hamad and S. S. Sami, “Using Q-Gram and Fuzzy Logic Algorithms for Eliminating Data Warehouse Duplications,” 2016.

      [4] S. Gaikwad and N. Bogiri, “Levenshtein distance algorithm for efficient and effective XML duplicate detection,” IEEE Int. Conf. Comput. Commun. Control. IC4 2015, 2016. https://doi.org/10.1109/IC4.2015.7375698.

      [5] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” VLDB ’02 Proc. 28th Int. Conf. Very Large Databases, pp. 586–597, 2002. https://doi.org/10.1016/B978-155860869-6/50058-5.

      [6] M. Weis and F. Naumann, “DogmatiX Tracks down Duplicates in XML,” Proc. 2005 ACM SIGMOD Int. Conf. Manag. data,ACM, pp. 431–442, 2005. https://doi.org/10.1145/1066157.1066207.

      [7] L. Leitão, P. Calado, and M. Weis, “Structure-based inference of xml similarity for fuzzy duplicate detection,” 16th ACM Conf. Inf. Knowl. Manag., pp. 293–302, 2007. https://doi.org/10.1145/1321440.1321483.

      [8] A. R. Petkar and V. B. Patil, “Duplicate Detection in Hierarchical Data Using XPath,” IOSR J. Comput. Eng. Ver. I, vol. 17, no. 6, pp. 2278–661, 2015.

      [9] A. N. Mehta, “Similarity Detection for XML Data,” Int. J. Adv. Reserch Sci. Eng., vol. 5, no. 1, pp. 152–157, 2016.

      [10] P. B. K. P. M. Bhavana Dhake1, Dr.S.S.Lomte2, Prof.Y.R.Nagargoje3,Prof.R.A.Auti4,“DuplicatDetection in Hierarchical Data Using Improved Network Pruning Algorithm,” Compusoft, vol.4, no. 6, pp. 7838–7850, 2015.

      [11] A. Thesis, “Performance Evaluation of Blocking Methods for Duplicate Record Detection,” 2010.

      [12] U. Draisbach and F. Naumann, “A generalization of blocking and windowing algorithms for duplicate detection,” Proc. - 2011 Int. Conf. Data Knowl. Eng. ICDKE 2011, pp. 18–24, 2011. https://doi.org/10.1109/ICDKE.2011.6053920.

      [13] J.ARUNA, “Identification of Duplication Records For Query Results from Real Time Databases,” B.S.Abdur Rahman University, 2012.

      [14] R. Haldar and D. Mukhopadhyay, “Levenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach,” arXiv:1101.1232, no. Ld, pp. 286–293, 2011.

      [15] G. Recchia and M. Louwerse, “A Comparison of String Similarity Measures for Toponym Matching,” no. c, 2013.

      [16] C. Ling, K. Benkrid, and T. Hamada, “A parameterisable and scalable smith-Waterman algorithm implementation on CUDA-compatible GPUs,” 2009 IEEE 7th Symp. Appl. Specif. Process. SASP 2009, pp. 94–100, 2009. https://doi.org/10.1109/SASP.2009.5226343.

      [17] L. Hasan, Z. Al-Ars, and Z. Nawaz, “A Novel Approach for Accelerating the Smith-Waterman Algorithm using Recursive Variable Expansion,” Proc. 19th Annu. …, 2008. https://doi.org/10.1109/IDT.2008.4802483.

      [18] M. Bilenko, M. View, and R. J. Mooney, “Adaptive Blocking : Learning to Scale Up Record Linkage,” Proc. Sixth IEEE Int. Conf. Data Min., no. December, pp. 87–96, 2006. https://doi.org/10.1109/ICDM.2006.13.




Article ID: 20419
DOI: 10.14419/ijet.v7i4.20419

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.