An novel cluster based feature selection and document classification model on high dimension trec data

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    TREC text documents are complex to analyze the features its relevant similar documents using the traditional document similarity measures. As the size of the TREC repository is increasing, finding relevant clustered documents from a large collection of unstructured documents is a challenging task. Traditional document similarity and classification models are implemented on homogeneous TREC data to find essential features for document entities that are similar to the TREC documents. Also, most of the traditional models are applicable to limited text document sets for text analysis. The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the  document semantic similarity between the intra and inter clusters 3) High mean squared error rate. In this paper, novel feature selection based clustered and classification model is proposed on large number of different TREC repositories. Traditional latent Semantic Indexing and document clustering models are failed to find the topic relevance on large number of TREC clinical text document sets due to computational memory and time. Proposed document feature selection and clustered based classification model is applied on TREC clinical benchmark datasets. From the experimental results, it is proved that the proposed model is efficient than the existing models in terms of computational memory, accuracy and error rate are concerned.

  • Keywords

    TREC Datasets, Information Retrieval, Document Clustering And Classification.

  • References

      [1] M. Rojcek, “System for Fuzzy Document Clusterng and Fast Fuzzy Classification”, “15th IEEE International Symposium on Computational Intelligence and Informatics ”, pp.39-42, 2014.

      [2] A. Aïtelhadj, M. Boughanem, M. Mezghiche and F. Souam, “Using structural similarity for clustering XML documents”, pp.109-139, 2011.

      [3] S. W. Chan and M. W Chong, “Unsupervised clustering for nontextual web document classification”, “Decision Support Systems”, pp.377-396, 2004.

      [4] D. Curtis, V. Kubushyn, E. A. Yfantis and M. Rogers, “A Hierarchical Feature Decomposition Clustering Algorithm for Unsupervised Classification of Document Image Types”, “Sixth International Conference on Machine Learning and Applications”, pp.423-428, 2007.

      [5] W. Dai, G. Xue, Qi. Yang and Y. Yu, “Co-clustering based Classification for Out-of-domain Documents”, “Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM”, pp.210-219, 2007.

      [6] I. Diaz-Valenzuela, V. Loia, M. J. Martin-Bautista, S. Senatore and M. A. Vila, “Automatic constraints generation for semisupervised clustering: experiences with documents classification”, “Soft Computing 20, no. 6 “, pp. 2329-2339, 2016.

      [7] C. Hachenberg and T. Gottron, “Locality Sensitive Hashing for Scalable Structural Classification and Clustering of Web Documents”, “Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM”, pp.359-363, 2013.

      [8] S. Jiang, J. Lewris, M. Voltmer and H. Wang, “Integrating Rich Document Representations for Text Classification”, “IEEE Systems and Information Engineering Design Conference (SIEDS '16)”, pp.303-308, 2016.

      [9] W. Ke, “Least Information Document Representation for Automated Text Classification”, “roceedings of the American Society for Information Science and Technology 49.1”, pp.1-10, 2012.

      [10] B. Lin and T. Chen, “Genre Classification for Musical Documents Based on Extracted Melodic Patterns and Clustering”, “Conference on Technologies and Applications of Artificial Intelligence”, pp. 39-43, 2012.

      [11] L. N. Nam and H. B. Quoc, “A Combined Approach for Filter Feature Selection in Document Classification”, “IEEE 27th International Conference on Tools with Artificial Intelligence “, pp.317-324, 2015.

      [12] S. Shruti and L. Shalini, “Sentence Clustering in Text Document Using Fuzzy Clustering Algorithm”, “International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT)”, pp.1473-1476, 2014.





Article ID: 10146
DOI: 10.14419/ijet.v7i1.1.10146

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.