Empirical Bayesian Binary Classification Forests Using Bootstrap Prior
Keywords:Binary Classification, Empirical Bayes, High-Dimensional, Random Forest
In this paper, we present a new method called Empirical Bayesian Random Forest (EBRF) for binary classification problem. The prior ingredient for the method was obtained using the bootstrap prior technique. EBRF addresses explicitly low accuracy problem in Random Forest (RF) classifier when the number of relevant input variables is relatively lower compared to the total number of input variables. The improvement was achieved by replacing the arbitrary subsample variable size with empirical Bayesian estimate. An illustration of the proposed, and existing methods was performed using five high-dimensional microarray datasets that emanated from colon, breast, lymphoma and Central Nervous System (CNS) cancer tumours. Results from the data analysis revealed that EBRF provides reasonably higher accuracy, sensitivity, specificity and Area Under Receiver Operating Characteristics Curve (AUC) than RF in most of the datasets used.
 HernÃ¡ndez B, Raftery AE, Pennington SR & Parnell AC (2018), Bayesian additive regression trees using Bayesian model averaging. Statistics and Computing 28(4), 869-890.
 Hastie T, Tibshirani R & Wainwright M (2015), Statistical learning with sparsity: the lasso and generalizations. CRC press.
 Sarica, A., Cerasa, A., & Quattrone, A. (2017). Random Forest Algorithm for the Classification of Neuroimaging Data in Alzheimer's Disease: A Systematic Review. Frontiers in Aging Neuroscience 9, 329.
 GÃ¼ndÃ¼z N & Fokoue E (2017), Predictive performances of implicitly and explicitly robust classifiers on high dimensional data. Communications faculty of sciences university of ankara-series a1 mathematics and statistics 66(2), 14-36.
 Banjoko AW, Yahya WB, Garba MK, Olaniran OR, Dauda KA & Olorede KO (2015), Efficient Support Vector Machine Classification of Diffuse Large B-Cell Lymphoma And Follicular Lymphoma mRNA Tissue Samples. Annals. Computer Science Series 13(2), 69-79.
 Olaniran, OR, Olaniran SF, Yahya WB, Banjoko AW, Garba MK, Amusa LB & Gatta NF (2016), Improved Bayesian Feature Selection and Classification Methods Using Bootstrap Prior Techniques. Annals. Computer Science Series 14(2), 46 â€“ 52.
 Olaniran OR & Abdullah MAA (2017), Gene Selection for Colon Cancer Classification using Bayesian Model Averaging of Linear and Quadratic Discriminants, Journal of Science and Technology: Special Issue on the Application of Science and Technology 9(3), 140-144.
 Breiman L (2001) Random forests. Machine Learning, 45, 5â€“32.
 Kapelner A & Bleich J (2015), Prediction with missing data via Bayesian additive regression trees. Canadian Journal of Statistics 43(2), 224-239.
 Breiman L, Friedman J, Stone CJ & Olshen RA (1984), Classification and regression trees. CRC press.
 Genuer R, Poggi JM & Tuleau C (2008), Random Forests: some methodological insights. arXiv preprint arXiv:0811.3619.
 Huang BF & Boutros PC (2016), The parameter sensitivity of random forests. BMC bioinformatics 17(1), 331.
 Robnik-Å ikonja M (2004, September), Improving random forests. In European conference on machine learning (pp. 359-370). Springer, Berlin, Heidelberg.
 Boinee P, De Angelis A & Foresti GL (2005), Meta random forests. International Journal of Computationnal Intelligence 2(3), 138-147.
 Chaudhary A, Kolhe S & Kamal R (2016), An improved random forest classifier for multi-class classification. Information Processing in Agriculture 3(4), 215-222.
 Hwang K, Lee K & Park S (2017), Variable selection methods for multi-class classification using signomial function. Journal of the Operational Research Society 68(9), 1117-1130.
 Chipman HA, George EI & McCulloch RE (2010), BART: Bayesian additive regression trees. The Annals of Applied Statistics 4(1), 266-298.
 Pratola MT (2016) Efficient Metropolis-Hastings proposal mechanisms for Bayesian regression tree models. Bayesian analysis 11(3), 885-911.
 Chipman HA, George EI & McCulloch RE (1998), Bayesian CART model search. Journal of the American Statistical Association 93(443), 935-948.
 Taddy M, Chen CS, Yu J & Wyle M (2015), Bayesian and empirical Bayesian forests. arXiv preprint arXiv:1502.02312.
 Efron B (1979), Bootstrap methods: Another look at the jackknife. Annals of Statistics 7, 1-26.
 Rubin, DB (1981), The Bayesian bootstrap. The annals of statistics 9(1), 130-134.
 Efron B (2012), Large-scale inference: empirical Bayes methods for estimation, testing, and prediction (Vol. 1). Cambridge University Press.
 Olaniran OR & Yahya WB (2017), Bayesian Hypothesis Testing of Two Normal Samples using Bootstrap Prior Technique. Journal of Modern Applied Statistical Methods 16(2), 618-638.
 Wang S, Zhang J & Lawson AB (2016), A Bayesian normal mixture accelerated failure time spatial model and its application to prostate cancer. Statistical methods in medical research 25(2), 793-806.
 Yahya WB, Olaniran OR & Ige SO (2014), On Bayesian Conjugate Normal Linear Regression and Ordinary Least Square Regression Methods: A monte Carlo Study. Ilorin Journal of Science 1(1), 216-227.
 Olaniran OR & Abdullah MAA (2018), Bayesian Analysis of Extended Cox Model with Time-Varying Covariates using Bootstrap Prior. Journal of Modern Applied Statistical Methods, Accepted. In press.
 Peskun P (2016), Some Relationships and Properties of the Hypergeometric Distribution. arXiv preprint arXiv:1610.07554.
 Dyer D & Pierce RL (1993), On the choice of the prior distribution in hypergeometric sampling. Communications in Statistics-Theory and Methods 22(8), 2125-2146.
 Olaniran OR & Abdullah MAA (2018), BayesRandomForest: An R Implementation of Bayesian Random Forest for Regression Analysis of High-Dimensional Data. Romanian Statistical Review 66(1), 95-102.
 Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D & Levine AJ (1999), Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 96(12), 6745-6750.
 Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME & Allen JC (2002), Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870), 436-442.
 West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R & Nevins JR (2001), Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences 98(20), 11462-11467.
 Gravier E, Pierron G, Vincentâ€Salomon A, Gruel N, Raynal V, Savignoni A & Fourquet A (2010), A prognostic DNA signature for T1T2 nodeâ€negative breast cancer patients. Genes, chromosomes and cancer 49(12), 1125-1134.
 Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC & Ray TS (2002), Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine 8(1), 68-74.
 Fawcett T (2006), An introduction to ROC analysis. Pattern recognition letters 27(8), 861-874.
 Ramey JA (2016). datamicroarray: Collection of Data Sets for Classification.https://github.com/ramhiser/datamicroarray,http://ramhiser.com.
 Yahya WB, Olaniran OR, Garba MK, Oloyede I, Banjoko AW, Dauda KA & Olorede KO (2016), A Test Procedure for Ordered Hypothesis of Population Proportions Against a Control. Turkiye Klinikleri Journal of Biostatistics 8(1).
 Jamil SAM, Abdullah MAA, Kek SL, Olaniran OR & Amran SE (2017, September), Simulation of parametric model towards the fixed covariate of right censored lung cancer data. In Journal of Physics: Conference Series 890(1), p. 012172.
 Adeleke AO, Samsudin NA, Mustapha A & Nawi NM (2017) Comparative analysis of text classification algorithms for automated labelling of Quranic verses. Int. J. Advanc. Sci. Eng. Info. Tech 7, 1419-1427.