Term Weighting Vs. Logistic Regression Performance on E-Commerce Data

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Text categorization can become a very difficult problem to solve in many cases. However many text categorization algorithms have been developed in the history of computer science, they are not always as accurate as we expect. Some of them are highly accurate in special cases while others perform well in different cases. In this work, we are comparing two famous methods in text categorization; the first one is the well-known term weighting algorithm and the second one is the logistic regression algorithm. All the dataset is got from our previous start-up named “Ume Market Network” which was an online peer-to-peer e-commerce system, and was synchronized with Facebook sales groups. Every offer in this dataset should be categorized as a sale/purchase offer; therefore, the problem is a classical binary categorization on a text dataset of formal as well as colloquial expressions in English, Italian, and German languages. After overcoming all the ambiguities the logistic regression algorithm outperformed the term weighting algorithm by around 25% in acuracy.

  • Keywords

    Machine Learning; Logistic Regression; Supervised Term Weighting; Text Categorization; Multiclass Classification

  • References

      [1] El-Khair I.A. Term Weighting. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA, pp.131, 2009.

      [2] Hofmann, Thomas. "Probabilistic latent semantic indexing." ACM SIGIR Forum. Vol. 51. No. 2. ACM, 2017.

      [3] Harrell, Frank E. "Ordinal logistic regression." Regression modeling strategies. Springer pp. 311-325, 2015.

      [4] Cruyff, M. J. L. F. "A review of regression procedures for randomized response data, including univariate and multivariate logistic regression, the proportional odds model and item response model, and self-protective responses." Handbook of Statistics. Vol. 34. Elsevier, pp. 287-315. 2016.

      [5] Man Lan, Chew Lim Tan, “Supervised and Traditional Term Weighting Methods for Automatic Text Categorization”. JOURNAL OF IEEE PAMI, VOL. 10, NO. 10, pp. 721–735, July 2009.

      [6] Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text retrieval." Information processing & management 24.5, pp.513-523. 1988.

      [7] Lan, Man, et al. "A comparative study on term weighting schemes for text categorization." Neural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on. Vol. 1. IEEE, 2005.

      [8] Buckley, Chris, et al. "Automatic query expansion using SMART: TREC 3." NIST special publication sp (1995): 69- 69.

      [9] Leopold, Edda, and Jörg Kindermann. "Text categorization with support vector machines. How to represent texts in input space?." Machine Learning 46.1-3 (2002): 423-444.

      [10] Wu, Harry, and Gerard Salton. "A comparison of search term weighting: term relevance vs. inverse document frequency." ACM SIGIR Forum. Vol. 16. No. 1. ACM, 1981.

      [11] Ifrim, Georgiana, Gökhan Bakir, and Gerhard Weikum. "Fast logistic regression for text categorization with variable-length n-grams." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008.

      [12] Holmes, David I., and Richard S. Forsyth. "The Federalist revisited: New directions in authorship attribution." Literary and Linguistic computing 10.2 (1995): 111-127.

      [13] Kessler, Brett, Geoffrey Numberg, and Hinrich Schütze. "Automatic detection of text genre." Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 1997.

      [14] Lee, Yong-Bae, and Sung Hyon Myaeng. "Text genre classification with genre-revealing and subject-revealing features." Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2002.

      [15] Peng, Fuchun, Dale Schuurmans, and Shaojun Wang. "Augmenting naive bayes classifiers with statistical language models." Information Retrieval 7.3 (2004): 317-345.

      [16] Zhang, Dell, and Wee Sun Lee. "Extracting key-substring-group features for text classification." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

      [17] Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization." Icml. Vol. 97. 1997.

      [18] McCallum, Andrew, and Kamal Nigam. "A comparison of event models for naive bayes text classification." AAAI-98 workshop on learning for text categorization. Vol. 752. 1998.

      [19] Kudo, Taku, and Yuji Matsumoto. "A Boosting Algorithm for Classification of Semi-Structured Text." EMNLP. Vol. 4. 2004




Article ID: 22738
DOI: 10.14419/ijet.v7i4.35.22738

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.