Using Word Based Features for Word Clustering

  • Authors

    • Farhan Nashwan
    • Mohsen Rashwan
    • Sherif Abdou
    • Hassanin Al-Barhamtoshy
    2019-03-01
    https://doi.org/10.14419/ijet.v8i1.11.28085
  • holistic technique, clustering, optical character recognition, lexicon reduction
  • In the holistic technique (HT) the whole information of the Arabic word is calculated using many possible features. In this paper, this approach is used to test the possible uses of the HT in the Arabic OCR systems. The HT is used to reduce the possible candidates for each word. We succeeded to reduce the candidates to around 115 with accuracy over 99%, given a single font and a single size from a large lexicon of more than 356K words. This vocabulary size has a good coverage for the Arabic Language. This means that the problem facing the OCR classifier is tremendously reduced, and much higher accuracy can be expected for the OCR systems.

     

     

  • References

    1. [1] S. Kaur, P. Mann, and S. Khurana, “Page segmentation in OCR system-A review,†(IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 no.3 ,pp. 420-422,2013.

      [2] T. K. Bhowmik, U. Roy, and S.K. Parui, “Lexicon reduction technique for Bangla handwritten word recognition,†in Document Analysis Systems (DAS), 10th IAPR International Workshop on. IEEE, 2012.

      [3] El rube', M.T.E.S., and S. S. Saleh, “Printed Arabic sub-word recognition using moments,†in World Academy of Science Engineering and Technology, vol. 4, 2010.

      [4] M. Khorsheed, and H. Al-Omari, “Recognizing cursive Arabic text: using statistical features and interconnected mono-HMMs,†in Image and Signal Processing (CISP), 4th International Congress on. 2011. IEEE, vol.3, pp.1540-1543, 2011.

      [5] H. Al-Barhamtoshy, and M. Rashwan, (2014). “Arabic OCR Segmented-based Systemâ€, Life Science Journal, 11 (10), (ISSN: 1097-8135), http://scholar.google.com.eg/scholar_url?hl=en&q=http://www.lifescien cesite.com/lsj/life1110/200_27304life111014_1273_1283.pdf&sa=X&s cisig=AAGBfm0YM6ykkOm8jGglYVhx2mT-ZU8OIA&oi=scholaralrt, http://www.lifesciencesite.com

      [6] . A. Hesham, S. Abdou, A. Badr, H. Al-Barhamtoshy, “Arabic Document Layout Analysisâ€, Pattern Analysis and Applications, 2017, PAAA-D-15-00373R4. http://link.springer.com/article/10.1007/s10044-017-0595-x

      [7] Ebrahimi, and E. Kabir, “A two-step method for the recognition of printed subwords,†Iranian J. Electric. Comput. Eng., vol.2, no. 2, pp. 57– 62 (in Farsi), 2004.

      [8] Ebrahimi, and E. Kabir, “A pictorial dictionary for printed Farsi subwords,†Pattern Recognition Letters, vol. 29, no. 5, pp. 656-663, 2008.

      [9] K. Zagoris, K. Ergina, and N. Papamarkos, “A document image retrieval system,†Engineering Applications of Artificial Intelligence, vol. 23, no. 6, pp. 872-879, 2010.

      [10] UniversiteitGent. “DTW algorithm,†Available at: http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWalgorithm.htm, (Accessed: 21 May 2014)

      [11] Myers, and L.F. HABINER, “A comparative study of several dynamic time-warping algorithms for connected-word,†Bell System Technical Journal, 1981.

      [12] Abdelaziz, S. Abdou, and H. Al-Barhamtoshy, “A large vocabulary system for Arabic online handwriting recognitionâ€, Pattern Analysis & Applications, Springer, Dec. 2015, DOI 10.1007/s10044-015-0526-7. http://link.springer.com/article/10.1007%2Fs10044-015-0526-7#page-1

  • Downloads

  • How to Cite

    Nashwan, F., Rashwan, M., Abdou, S., & Al-Barhamtoshy, H. (2019). Using Word Based Features for Word Clustering. International Journal of Engineering & Technology, 8(1.11), 25-32. https://doi.org/10.14419/ijet.v8i1.11.28085