A Proposed Method for Reducing the Dimension of Arabic Documents

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Dimensionality reduction is an essential data preprocessing technique for large-scale and streaming data classification tasks. It can be used to improve both the efficiency and the effectiveness of documents. Traditional dimensionality reduction approaches fall into two categories: Feature Extraction and Feature Selection. Techniques in the feature extraction category are typically more effective than those in feature selection category. The representation of Arabic texts and the possibility of reducing their size result in reducing dimension among them, thus facilitating the processes and procedures that occur on them such as measurement of similarities, text classification, etc. Though several researchers have innovated many methods to solve this problem, in this paper, we introduce an effective method to represent Arabic texts with the lowest size. This method is based on the structure or form of words in Arabic, in terms of removing all prefixes and suffixes from words in the texts as well as removing the redundant and meaningless words. This methodological procedure could help in increasing the size of texts. Remove these prefixes and suffixes from words in the text aims at reducing dimension. The experimental results presented evidence that the proposed method substantially reduces the size of text representation by 42%, taking into account the origin of texts and words that reduce dimension.



  • Keywords

    Arabic Text Processing; Stop-word List; Text dimensionality reduction.

  • References

      [1] Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.

      [2] Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141-188.

      [3] Baker, K. (2013). Singular value decomposition tutorial. Note for NLP Seminar, pp. 1-24.

      [4] Moh'd A Mesleh, A. (2007). Chi square feature extraction based SVMs Arabic language text categorization system. Journal of Computer Science, 3(6), 430-435.

      [5] Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., & Al-Rajeh, A. (2008). Automatic Arabic text classification. Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data, pp. 77-83.

      [6] Al-Shalabi, R., & Obeidat, R. (2008). Improving KNN Arabic text classification with n-grams based document indexing. Proceedings of the Sixth International Conference on Informatics and Systems, pp. 108-112.

      [7] Saad, M. K., & Ashour, W. (2010). Arabic text classification using decision trees. Proceedings of the 12th International Workshop on Computer Science and Information Technologies, pp. 75-79.

      [8] Kanan, T., & Fox, E. A. (2016). Automated Arabic text classification with P‐S temmer, machine learning, and a tailored news article taxonomy. Journal of the Association for Information Science and Technology, 67(11), 2667-2683.

      [9] Zhang, L., Mistry, K., Lim, C. P., & Neoh, S. C. (2018). Feature selection using firefly optimization for classification and regression models. Decision Support Systems, 106, 64-85.

      [10] Ahmad, S. R., Yusop, N. M. M., Bakar, A. A., & Yaakub, M. R. (2017). Statistical analysis for validating ACO-KNN algorithm as feature selection in sentiment analysis. AIP Conference Proceedings, 1891(1), 1-7.

      [11] Hashimi, H., Hafez, A., & Mathkour, H. (2015). Selection criteria for text mining approaches. Computers in Human Behavior, 51, 729-733.

      [12] Awajan, A. (2016). Semantic similarity based approach for reducing Arabic texts dimensionality. International Journal of Speech Technology, 19(2), 191-201.

      [13] Khorsheed, M. S., & Al-Thubaity, A. O. (2013). Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Language Resources and Evaluation, 47(2), 513-538.

      [14] Alhutaish, R., & Omar, N. (2015). Arabic text classification using k-nearest neighbour algorithm. International Arab Journal of Information Technology, 12, 190-195.

      [15] Emary, E., Zawbaa, H. M., Ghany, K. K. A., Hassanien, A. E., & Parv, B. (2015). Firefly optimization algorithm for feature selection. Proceedings of the ACM 7th Balkan Conference on Informatics Conference, Article No. 26.

      [16] El-Khair, I. A. (2006). Effects of stop words elimination for Arabic information retrieval: A comparative study. International Journal of Computing and Information Sciences, 4(3), 119-133.

      [17] Rank NL. (2018). Arabic stopwords. https://www.ranks.nl/stopwords/arabic.

      [18] Saad, M. K., & Ashour, W. (2010). Arabic morphological tools for text mining. Proceedings of the 6th International Conference on Electrical and Computer Systems, pp. 1-6.




Article ID: 23423
DOI: 10.14419/ijet.v7i3.28.23423

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.