A survey of machine learning techniques for genomic diseases and data sets

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    From the very early age of Medical Science, medical practitioners have been concerned about visualizing and analyzing complex biological data which was not so easy. Today is the era of GWAS (genome-wide association studies), so the quest for understanding the genotype of various complex diseases is rapidly increasing day by day. Recently, high throughput molecular data have provided ample information about the whole genome, and have popularized the computational tools in genomics. Due to the humongous size and high dimensionality of genomic data, it is not possible to analyze it with conventional techniques, so machine learning tends to develop efficient computational techniques that will raise with experience, for analysis the vast complex data sets. This article give an outline of different machine learning techniques for examination of the genomics data of diseases and epigenetic, proteomic data.





  • Keywords

    Machine Learning; ANN, KNN, RF, SVM, Genomic; Mutation.

  • References

      [1] International human genome sequencing consortium: Finishing the euchromatic sequence of the human genome. Nature 2004; 431(7011):931- 45. https://doi.org/10.1038/nature03001.

      [2] J. D. Watson and F. H. C. Crick (1953), ‘‘Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid,’’ Nature, vol. 171, no. 4356, pp. 737–738. https://doi.org/10.1038/171737a0.

      [3] E. S. Lander et al. (2001), ‘‘Initial sequencing and analysis of the human genome,’’ Nature, vol. 409, no. 6822, pp. 860–921. https://doi.org/10.1038/35057062.

      [4] J. Harrow et al. (2012), ‘‘GENCODE: The reference human genome annotation for the ENCODE project,’’ Genome Res., vol. 22, no. 9, pp. 1760–1774. https://doi.org/10.1101/gr.135350.111.

      [5] Kevin Jarrett, Mary Williams, Spencer Horn, David Radford, and J. Michael Wyss (2016), “Sickle cell anemia: tracking down a mutation”: an interactive learning laboratory that communicates basic principles of genetics and cellular biology” Advances in Physiology education, vol.40, pp. 110-115. https://doi.org/10.1152/advan.00143.2015.

      [6] Gravitz L, Pincock S. (2014), “Sickle-cell disease” Nature, Vol. 515, Issue.7526. https://doi.org/10.1038/515S1a.

      [7] D. Hanahan and R. A. Weinberg (2011), ‘‘Hallmarks of cancer: The next generation,’’ Cell, vol. 144, no. 5, pp. 646–674. https://doi.org/10.1016/j.cell.2011.02.013.

      [8] M. A. Rubin (2015), ‘‘Make precision medicine work for cancer care,’’ Nature, vol. 520, no.547, pp. 290–291. https://doi.org/10.1038/520290a.

      [9] F. H. Crick, L. Barnett, S. Brenner, and R. J. Watts-Tobin (1961), ‘‘General nature of the genetic code for proteins,’’ Nature, vol. 192, pp. 1227–1232. https://doi.org/10.1038/1921227a0.

      [10] L. A. Hindorff et al. (2009), ‘‘Potential etiologic and functional implications of genome-wide association loci for human diseases and traits,’’ Proc. Nat. Acad. Sci. USA, vol. 106, no. 23, pp. 9362–9367. https://doi.org/10.1073/pnas.0903103106.

      [11] Rabbani B, Mahdieh N, Haghi Ashtiani MT, et al. (2011), “Molecular diagnosis of congenital adrenal hyperplasia in Iran:Focusing on CYP21A2 gene” , Iranian Journal of Pediatrics, vol.21, no.2, pp.139-50.

      [12] Rabbani B, Mahdieh N, Haghi Ashtiani MT, et al. (2012), “In silico structural, functional and pathogenicity evaluation of a novel mutation:An overview of HSD3B2 gene mutations”, Gene, vol.503, no.2, pp.215-219. https://doi.org/10.1016/j.gene.2012.04.080.

      [13] Ghanem N, Girodon E, Vidaud M, et al. (1992), “A comprehensive scanning method for rapid detection of beta-globin gene mutations and polymorphisms”, Human Mutation, vol.1, no.3, pp.229-239. https://doi.org/10.1002/humu.1380010310.

      [14] Mahdieh N, Rabbani B, Wiley S, et al. (2010), “Genetic causes of nonsyndromic hearing loss in Iran in comparison with other populations”, Journal of Human Genetics, vol.55, pp. 639-48. https://doi.org/10.1038/jhg.2010.96.

      [15] Garcia-Garcia AB, Real JT, Puig O, et al. (2001), “Molecular genetics of familial hypercholesterolemia in spain:Ten novel LDLR mutations and population analysis”, Human Mutation, vol.18, no.5, pp.458-469. https://doi.org/10.1002/humu.1218.

      [16] Mahdieh N, Bagherian H, Shirkavand A, et al. (2010), “ High level of intrafamilial phenotypic variability of non- syndromic hearing loss in a Lur family due to DELE120 mutation in GJB2 gene”, International Journal of Pediatric Otorhinolaryngology, vol.74, no.9, pp.1089-91. https://doi.org/10.1016/j.ijporl.2010.06.005.

      [17] Schrijver I, Liu W, Odom R, et al. (2002), “Premature termination mutations in FBN1: Distinct effects on differential allelic expression and on protein and clinical phenotypes”, American Journal of Human Genetics, vol.71, no.2, pp. 223-37. https://doi.org/10.1086/341581.

      [18] Madan K, Seabright M, Lindenbaum RH, et al. (1984), “Paracentric inversions in man”, Journal of Medical Genetics, vol.21, no.6, pp. 407-412. https://doi.org/10.1136/jmg.21.6.407.

      [19] Xi Chen, Hemant Ishwaran (2012), “Random forests for genomic data analysis”, Genomics, vol. 99, pp. 323–329. https://doi.org/10.1016/j.ygeno.2012.04.003.

      [20] X. Chen, L.Wang, H. Ishwaran (2010), “An integrative pathway-based clinical-genomicmodel for cancer survival prediction”, Statistics & Probability Letters. Vol.80 no.17–18, pp. 1313–1319. https://doi.org/10.1016/j.spl.2010.04.011.

      [21] J.S. Wu, H.D. Liu, X.Y. Duan, Y. Ding, H.T. Wu, Y.F. Bai, X. Sun (2009), “Prediction of DNAbinding residues in proteins from amino acid sequences using a random forest model with a hybrid feature”, Bioinformatics, vol.25, no.1, pp.30–35. https://doi.org/10.1093/bioinformatics/btn583.

      [22] Z.P. Liu, L.Y. Wu, Y. Wang, X.S. Zhang, L. Chen (2010), “Prediction of protein–RNA binding sites by a random forest method with combined features”, Bioinformatics, vol. 26, no.13, pp.1616–1622. https://doi.org/10.1093/bioinformatics/btq253.

      [23] M. Sikic, S. Tomic, K. Vlahovicek (2009), “Prediction of protein–protein interaction sites in sequences and 3D structures by random forests”, PLOS Computational Biology, vol.5, no.1, e1000278. https://doi.org/10.1371/journal.pcbi.1000278.

      [24] G. Riddick, H. Song, S. Ahn, J. Walling, D. Borges-Rivera, W. Zhang, H.A. Fine (2011), “Predicting in vitro drug sensitivity using random forests”, Bioinformatics, vol. 27, no. 2, pp.220–224. https://doi.org/10.1093/bioinformatics/btq628.

      [25] Li-ChungChuang, and Po-Hsiu Kuo (2017), “Building a genetic risk model for bipolar disorder from genomewide association data with random forest algorithm”, Scientific Reports, Nature, vol.7, no. 39943, pp. 1-10.

      [26] T. Shi, D. Seligson, A.S. Belldegrun, A. Palotie, S. Horvath (2005), “Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma”, Mod. Pathol, vol. 18, no.4, pp.547–557. https://doi.org/10.1038/modpathol.3800322.

      [27] Vapnik V (1963), “Pattern recognition using generalized portrait method”, Automation Remote Control, vol. 24, pp.774-780.

      [28] Shujun Huang et al. (2018), “Applications of Support Vector Machine (SVM) Learning in Cancer Genomics”, Cancer Genomics & Proteomics, vol.15, pp. 41-51.

      [29] Y. Shen, Z. Liu, and J. Ott (2012), "Support Vector Machines with L 1 penalty for detecting gene–gene interactions," International journal of data mining and bioinformatics, vol. 6, pp. 463-470. https://doi.org/10.1504/IJDMB.2012.049300.

      [30] Waddell M, Page D, Zhan F (2005), Predicting cancer susceptibility from single-nucleotide polymorphism data: A case study in multiple myeloma. Proceedings of the 5th ACM SIGKDD Workshop on Data Mining in Bioinformatics. Chicago, IL. https://doi.org/10.1145/1134030.1134035.

      [31] Moler E, Chow M and Mian I (2000), “Analysis of molecular profile data using generative and discriminative methods”, Physiological Genomics, vol. 4, no.2, pp. 109-126. https://doi.org/10.1152/physiolgenomics.2000.4.2.109.

      [32] Chen L, Xuan J, Riggins RB, Clarke R and Wang Y (2011), “Identifying cancer biomarkers by network-constrained support vector machines,” BMC Systems Biology, vol. 5, no.1, pp. 161. https://doi.org/10.1186/1752-0509-5-161.

      [33] Capriotti E and Altman RB (2011), “A new disease-specific machine learning approach for the prediction of cancer-causing missense variants,” Genomics, vol. 98, no.4, pp. 310-317. https://doi.org/10.1016/j.ygeno.2011.06.010.

      [34] Bari MG, Ung CY, Zhang C, Zhu S and Li H (2017), “Machine Learning-assisted network inference approach to identify a new class of genes that coordinate the functionality of cancer networks,” Scientific Reports, vol.7, pp. 6993. https://doi.org/10.1038/s41598-017-07481-5.

      [35] Taghipour M1, Vand AA, Rezaei Aand Karim GR (2015), “Application of Artificial Neural Network for Modeling and Prediction of MTT Assay on Human Lung Epithelial Cancer Cell Lines,” Journal of Biosensors & Bioelectronics, vol.6, no.2.

      [36] Khan J, Wei JS, Ringner M, et al. (2001), “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol.7, pp.673–679. https://doi.org/10.1038/89044.

      [37] Catalogna M, Cohen E, Fishman S, Halpern Z, Nevo U, Ben-Jacob E 92012), “Artificial neural networks-based controller for glucose monitoring during clamp test,” Public Library of Science One, vol.7, no. e44587.

      [38] Narayanan A, Keedwell EC, Gamalielsson J, et al. (2004), “Singlelayer artificial neural networks for gene expression analysis,” Neurocomputing, vol.61, pp.217–40. https://doi.org/10.1016/j.neucom.2003.10.017.

      [39] Karabulut E, Ibrikçi T. (2012), “Effective diagnosis of coronary artery disease using the rotation forest ensemble method,” Journal of Medical Systems, vol.36, pp.3011–3018. https://doi.org/10.1007/s10916-011-9778-y.

      [40] Samuel, O.W., Asogbon, G.M., Sangaiah, A.K., Fang, P., Li, G. (2017), “An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction,” Expert Systems with Applications, vol.68, pp.163–172. https://doi.org/10.1016/j.eswa.2016.10.020.

      [41] Shouman, M., Turner, T., Stocker, R. (2012), “Applying k-nearest neighbour in diagnosing heart disease patients,” Int. J. Inf. Educ. Technol, vol.2, no.3, pp. 220. https://doi.org/10.7763/IJIET.2012.V2.114.

      [42] V. Anuja Kumari, R.Chitra (2013), “Classification Of Diabetes Disease Using Support Vector Machine,” International Journal of Engineering Research and Applications, vol.3, no. 2, pp.1797-1801.

      [43] Rau, H.-H., Hsu, C.-Y., Lin, Y.-A., Atique, S., Fuad, A., Wei, L.-M., Hsu, M.-H (2016), , “Development of a web-based liver cancer prediction model for type II diabetes patients by using an artificial neural network,” Computer Methods and Programs in Biomedicine, vol.125, pp. 58–65. https://doi.org/10.1016/j.cmpb.2015.11.009.

      [44] Kaya, Y., Uyar, M. (2013), “A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease,” Applied Soft Computing, vol.13, no.8, pp.3429–3438. https://doi.org/10.1016/j.asoc.2013.03.008.

      [45] Joshi J., Doshi R., Patel J. (2014), “Diagnosis and prognosis breast cancer using classification rules,” International Journal of Engineering Research and General Science, vol.2, no.6, pp. 315–323.

      [46] Jilani, T.A., Yasin, H., Yasin, M.M. (2011), “PCA-ANN for classification of Hepatitis-C patients,” International Journal of Computer Applications, vol.14, no.7, pp. 1–6 (0975–8887).

      [47] Gardezi, S.J.S., Faye, I., Bornot, J.M.S., Kamel, N., Hussain, M. (2017), “Mammogram classification using dynamic time warping,” Multimedia Tools and Applications, pp.1–22.

      [48] Abdelaal M.M.A., Farouq M.W., Sena H.A., Salem A.-B., M., “Using data mining for assessing diagnosis of breast cancer,” International Multiconference on Computer Science and Information Technology; 2010 March 17–19; Hong Kong, China. p. 11–17.

      [49] Kumar, M., Rath, N.K., Rath, S.K. (2016), “Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier,” The Journal of Biomedical Informatics, vol.60, pp.395–409. https://doi.org/10.1016/j.jbi.2016.03.002.

      [50] Gasiorek JJ, Blank V. (2015), “Regulation and function of the NFE2 transcription factor in hematopoietic and non-hematopoietic cells,” Cell Mol Life Sci CMLS, vol.72, pp.2323–35. https://doi.org/10.1007/s00018-015-1866-6.

      [51] Mohamed, H., Mabrouk, M.S., Sharawy, A. (2014), “Computer aided detection system for micro calcifications in digital mammograms,” Computer Methods and Programs in Biomedicine, vol.116, no.3, pp. 226–235. https://doi.org/10.1016/j.cmpb.2014.04.010.

      [52] Huang C.-L., Liao H.-C., Chen M.-C. (2008), “Prediction model building and feature selection with support vector machines in breast cancer diagnosis,” Expert Systems with Applications, vol.34, pp.578–587. https://doi.org/10.1016/j.eswa.2006.09.041.

      [53] Xin Wang, Peijie Lin1 and Joshua W. K. Ho1 (2018), “Discovery of cell-type specific DNA motif grammar in cis-regulatory elements using random Forest,” BMC Genomics, vol 19, no.1, pp.929. https://doi.org/10.1186/s12864-017-4340-z.

      [54] Thakur, A., Mishra, V., Jain, S.K. (2011), “Feed forward artificial neural network: tool for early detection of ovarian cancer,” Scientia Pharmaceutica, vol.79, no.3, pp.493–506. https://doi.org/10.3797/scipharm.1105-11.

      [55] Babeu J-P, Boudreau F. (2014), “Hepatocyte nuclear factor 4-alpha involvement in liver and intestinal inflammatory networks,” World J Gastroenterol WJG, vol.20, pp.22–30. https://doi.org/10.3748/wjg.v20.i1.22.

      [56] Mahmoud, A.M., Maher, B.A., El-Horbaty, E.-S.M., Salem, A.B.M. (2013), “Analysis of machine learning techniques for gene selection and classification of microarray data,” Proceedings of the 6th International Conference on Information Technology.

      [57] T. G. Consortium, ‘‘the genotype-tissue expression (GTEx) project. (2013)’’ Nature Genetics, vol. 45, no. 6, pp. 580–585. https://doi.org/10.1038/ng.2653.

      [58] R. H. Shoemaker (2006), ‘‘The NCI60 human tumour cell line anticancer drug screen,’’ Nature Rev. Cancer, vol. 6, no. 10, pp. 813–823. https://doi.org/10.1038/nrc1951.

      [59] M. Kellis et al. (2014), ‘‘Defining functional DNA elements in the human genome,’’ Proc. Nat. Acad. Sci. USA, vol. 111, no. 17, pp. 6131–6138. https://doi.org/10.1073/pnas.1318948111.

      [60] T. J. Hudson et al. (2010), ‘‘International network of cancer genome projects,’’ Nature, vol. 464, no. 7291, pp. 993–998. https://doi.org/10.1038/nature08987.

      [61] K. Chang et al. (2013), ‘‘the cancer genome atlas pan-cancer analysis project,’’ Nature Genetics, vol. 45, no. 10, pp. 1113–1120. https://doi.org/10.1038/ng.2764.

      [62] J. Li et al., ‘‘TCPA: A resource for cancer functional proteomics data,’’ Nature Methods, vol. 10, no. 11, pp. 1046–1047. https://doi.org/10.1038/nmeth.2650.

      [63] G. Project et al. (2013), ‘‘an integrated map of genetic variation from 1,092 human genomes,’’ Nature, vol. 491, no. 7422, pp. 556–665, 2012.

      [64] B. E. Bernstein et al. (2010), ‘‘The NIH roadmap epigenomics mapping consortium,’’ Nature Biotechnol., vol. 28, no. 10, pp. 1045–1048. https://doi.org/10.1038/nbt1010-1045.

      [65] R. E. Consortium et al. (2015), ‘‘Integrative analysis of 111 reference human epigenomes,’’ Nature, vol. 518, no. 7539, pp. 317–330. https://doi.org/10.1038/nature14248.

      [66] A. R. Wood et al. (2014), ‘‘Defining the role of common variation in the genomic and biological architecture of adult human height,’’ Nature Genetics, vol. 46, no. 11, pp. 1173–1186. https://doi.org/10.1038/ng.3097.

      [67] A. E. Locke et al. (2015), ‘‘Genetic studies of body mass index yield new insights for obesity biology,’’ Nature, vol. 518, no. 7538, pp. 197–206. https://doi.org/10.1038/nature14177.




Article ID: 11016
DOI: 10.14419/ijet.v7i4.11016

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.