An novel cluster based feature selection and document classification model on high dimension trec data


  • Lalitha Kumari
  • Ch. Satyanarayana





TREC Datasets, Information Retrieval, Document Clustering And Classification.


TREC text documents are complex to analyze the features its relevant similar documents using the traditional document similarity measures. As the size of the TREC repository is increasing, finding relevant clustered documents from a large collection of unstructured documents is a challenging task. Traditional document similarity and classification models are implemented on homogeneous TREC data to find essential features for document entities that are similar to the TREC documents. Also, most of the traditional models are applicable to limited text document sets for text analysis. The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the  document semantic similarity between the intra and inter clusters 3) High mean squared error rate. In this paper, novel feature selection based clustered and classification model is proposed on large number of different TREC repositories. Traditional latent Semantic Indexing and document clustering models are failed to find the topic relevance on large number of TREC clinical text document sets due to computational memory and time. Proposed document feature selection and clustered based classification model is applied on TREC clinical benchmark datasets. From the experimental results, it is proved that the proposed model is efficient than the existing models in terms of computational memory, accuracy and error rate are concerned.


[1] M. Rojcek, “System for Fuzzy Document Clusterng and Fast Fuzzy Classificationâ€, “15th IEEE International Symposium on Computational Intelligence and Informatics â€, pp.39-42, 2014.

[2] A. Aïtelhadj, M. Boughanem, M. Mezghiche and F. Souam, “Using structural similarity for clustering XML documentsâ€, pp.109-139, 2011.

[3] S. W. Chan and M. W Chong, “Unsupervised clustering for nontextual web document classificationâ€, “Decision Support Systemsâ€, pp.377-396, 2004.

[4] D. Curtis, V. Kubushyn, E. A. Yfantis and M. Rogers, “A Hierarchical Feature Decomposition Clustering Algorithm for Unsupervised Classification of Document Image Typesâ€, “Sixth International Conference on Machine Learning and Applicationsâ€, pp.423-428, 2007.

[5] W. Dai, G. Xue, Qi. Yang and Y. Yu, “Co-clustering based Classification for Out-of-domain Documentsâ€, “Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACMâ€, pp.210-219, 2007.

[6] I. Diaz-Valenzuela, V. Loia, M. J. Martin-Bautista, S. Senatore and M. A. Vila, “Automatic constraints generation for semisupervised clustering: experiences with documents classificationâ€, “Soft Computing 20, no. 6 “, pp. 2329-2339, 2016.

[7] C. Hachenberg and T. Gottron, “Locality Sensitive Hashing for Scalable Structural Classification and Clustering of Web Documentsâ€, “Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACMâ€, pp.359-363, 2013.

[8] S. Jiang, J. Lewris, M. Voltmer and H. Wang, “Integrating Rich Document Representations for Text Classificationâ€, “IEEE Systems and Information Engineering Design Conference (SIEDS '16)â€, pp.303-308, 2016.

[9] W. Ke, “Least Information Document Representation for Automated Text Classificationâ€, “roceedings of the American Society for Information Science and Technology 49.1â€, pp.1-10, 2012.

[10] B. Lin and T. Chen, “Genre Classification for Musical Documents Based on Extracted Melodic Patterns and Clusteringâ€, “Conference on Technologies and Applications of Artificial Intelligenceâ€, pp. 39-43, 2012.

[11] L. N. Nam and H. B. Quoc, “A Combined Approach for Filter Feature Selection in Document Classificationâ€, “IEEE 27th International Conference on Tools with Artificial Intelligence “, pp.317-324, 2015.

[12] S. Shruti and L. Shalini, “Sentence Clustering in Text Document Using Fuzzy Clustering Algorithmâ€, “International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT)â€, pp.1473-1476, 2014.


View Full Article: