A survey of machine learning techniques for genomic diseases and data sets
-
https://doi.org/10.14419/ijet.v7i4.11016
Received date: April 3, 2018
Accepted date: July 2, 2018
Published date: April 7, 2019
-
Machine Learning, ANN, KNN, RF, SVM, Genomic, Mutation. -
Abstract
From the very early age of Medical Science, medical practitioners have been concerned about visualizing and analyzing complex biological data which was not so easy. Today is the era of GWAS (genome-wide association studies), so the quest for understanding the genotype of various complex diseases is rapidly increasing day by day. Recently, high throughput molecular data have provided ample information about the whole genome, and have popularized the computational tools in genomics. Due to the humongous size and high dimensionality of genomic data, it is not possible to analyze it with conventional techniques, so machine learning tends to develop efficient computational techniques that will raise with experience, for analysis the vast complex data sets. This article give an outline of different machine learning techniques for examination of the genomics data of diseases and epigenetic, proteomic data.
-
References
- International human genome sequencing consortium: Finishing the euchromatic sequence of the human genome. Nature 2004; 431(7011):931- 45. https://doi.org/10.1038/nature03001.
- J. D. Watson and F. H. C. Crick (1953), ‘‘Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid,’’ Nature, vol. 171, no. 4356, pp. 737–738. https://doi.org/10.1038/171737a0.
- E. S. Lander et al. (2001), ‘‘Initial sequencing and analysis of the human genome,’’ Nature, vol. 409, no. 6822, pp. 860–921. https://doi.org/10.1038/35057062.
- J. Harrow et al. (2012), ‘‘GENCODE: The reference human genome annotation for the ENCODE project,’’ Genome Res., vol. 22, no. 9, pp. 1760–1774. https://doi.org/10.1101/gr.135350.111.
- Kevin Jarrett, Mary Williams, Spencer Horn, David Radford, and J. Michael Wyss (2016), “Sickle cell anemia: tracking down a muta-tion”: an interactive learning laboratory that communicates basic principles of genetics and cellular biology” Advances in Physiology education, vol.40, pp. 110-115. https://doi.org/10.1152/advan.00143.2015.
- Gravitz L, Pincock S. (2014), “Sickle-cell disease” Nature, Vol. 515, Issue.7526. https://doi.org/10.1038/515S1a.
- D. Hanahan and R. A. Weinberg (2011), ‘‘Hallmarks of cancer: The next generation,’’ Cell, vol. 144, no. 5, pp. 646–674. https://doi.org/10.1016/j.cell.2011.02.013.
- M. A. Rubin (2015), ‘‘Make precision medicine work for cancer care,’’ Nature, vol. 520, no.547, pp. 290–291. https://doi.org/10.1038/520290a.
- F. H. Crick, L. Barnett, S. Brenner, and R. J. Watts-Tobin (1961), ‘‘General nature of the genetic code for proteins,’’ Nature, vol. 192, pp. 1227–1232. https://doi.org/10.1038/1921227a0.
- L. A. Hindorff et al. (2009), ‘‘Potential etiologic and functional im-plications of genome-wide association loci for human diseases and traits,’’ Proc. Nat. Acad. Sci. USA, vol. 106, no. 23, pp. 9362–9367. https://doi.org/10.1073/pnas.0903103106.
- Rabbani B, Mahdieh N, Haghi Ashtiani MT, et al. (2011), “Molecu-lar diagnosis of congenital adrenal hyperplasia in Iran:Focusing on CYP21A2 gene” , Iranian Journal of Pediatrics, vol.21, no.2, pp.139-50.
- Rabbani B, Mahdieh N, Haghi Ashtiani MT, et al. (2012), “In silico structural, functional and pathogenicity evaluation of a novel muta-tion:An overview of HSD3B2 gene mutations”, Gene, vol.503, no.2, pp.215-219. https://doi.org/10.1016/j.gene.2012.04.080.
- Ghanem N, Girodon E, Vidaud M, et al. (1992), “A comprehensive scanning method for rapid detection of beta-globin gene mutations and polymorphisms”, Human Mutation, vol.1, no.3, pp.229-239. https://doi.org/10.1002/humu.1380010310.
- Mahdieh N, Rabbani B, Wiley S, et al. (2010), “Genetic causes of nonsyndromic hearing loss in Iran in comparison with other popula-tions”, Journal of Human Genetics, vol.55, pp. 639-48. https://doi.org/10.1038/jhg.2010.96.
- Garcia-Garcia AB, Real JT, Puig O, et al. (2001), “Molecular genet-ics of familial hypercholesterolemia in spain:Ten novel LDLR muta-tions and population analysis”, Human Mutation, vol.18, no.5, pp.458-469. https://doi.org/10.1002/humu.1218.
- Mahdieh N, Bagherian H, Shirkavand A, et al. (2010), “ High level of intrafamilial phenotypic variability of non- syndromic hearing loss in a Lur family due to DELE120 mutation in GJB2 gene”, In-ternational Journal of Pediatric Otorhinolaryngology, vol.74, no.9, pp.1089-91. https://doi.org/10.1016/j.ijporl.2010.06.005.
- Schrijver I, Liu W, Odom R, et al. (2002), “Premature termination mutations in FBN1: Distinct effects on differential allelic expres-sion and on protein and clinical phenotypes”, American Journal of Human Genetics, vol.71, no.2, pp. 223-37. https://doi.org/10.1086/341581.
- Madan K, Seabright M, Lindenbaum RH, et al. (1984), “Paracentric inversions in man”, Journal of Medical Genetics, vol.21, no.6, pp. 407-412. https://doi.org/10.1136/jmg.21.6.407.
- Xi Chen, Hemant Ishwaran (2012), “Random forests for genomic data analysis”, Genomics, vol. 99, pp. 323–329. https://doi.org/10.1016/j.ygeno.2012.04.003.
- X. Chen, L.Wang, H. Ishwaran (2010), “An integrative pathway-based clinical-genomicmodel for cancer survival prediction”, Statis-tics & Probability Letters. Vol.80 no.17–18, pp. 1313–1319. https://doi.org/10.1016/j.spl.2010.04.011.
- J.S. Wu, H.D. Liu, X.Y. Duan, Y. Ding, H.T. Wu, Y.F. Bai, X. Sun (2009), “Prediction of DNAbinding residues in proteins from amino acid sequences using a random forest model with a hybrid feature”, Bioinformatics, vol.25, no.1, pp.30–35. https://doi.org/10.1093/bioinformatics/btn583.
- Z.P. Liu, L.Y. Wu, Y. Wang, X.S. Zhang, L. Chen (2010), “Predic-tion of protein–RNA binding sites by a random forest method with combined features”, Bioinformatics, vol. 26, no.13, pp.1616–1622. https://doi.org/10.1093/bioinformatics/btq253.
- M. Sikic, S. Tomic, K. Vlahovicek (2009), “Prediction of protein–protein interaction sites in sequences and 3D structures by random forests”, PLOS Computational Biology, vol.5, no.1, e1000278. https://doi.org/10.1371/journal.pcbi.1000278.
- G. Riddick, H. Song, S. Ahn, J. Walling, D. Borges-Rivera, W. Zhang, H.A. Fine (2011), “Predicting in vitro drug sensitivity using random forests”, Bioinformatics, vol. 27, no. 2, pp.220–224. https://doi.org/10.1093/bioinformatics/btq628.
- Li-ChungChuang, and Po-Hsiu Kuo (2017), “Building a genetic risk model for bipolar disorder from genomewide association data with random forest algorithm”, Scientific Reports, Nature, vol.7, no. 39943, pp. 1-10.
- T. Shi, D. Seligson, A.S. Belldegrun, A. Palotie, S. Horvath (2005), “Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma”, Mod. Pathol, vol. 18, no.4, pp.547–557. https://doi.org/10.1038/modpathol.3800322.
- Vapnik V (1963), “Pattern recognition using generalized portrait method”, Automation Remote Control, vol. 24, pp.774-780.
- Shujun Huang et al. (2018), “Applications of Support Vector Ma-chine (SVM) Learning in Cancer Genomics”, Cancer Genomics & Proteomics, vol.15, pp. 41-51.
- Y. Shen, Z. Liu, and J. Ott (2012), "Support Vector Machines with L 1 penalty for detecting gene–gene interactions," International journal of data mining and bioinformatics, vol. 6, pp. 463-470. https://doi.org/10.1504/IJDMB.2012.049300.
- Waddell M, Page D, Zhan F (2005), Predicting cancer susceptibility from single-nucleotide polymorphism data: A case study in multiple myeloma. Proceedings of the 5th ACM SIGKDD Workshop on Da-ta Mining in Bioinformatics. Chicago, IL. https://doi.org/10.1145/1134030.1134035.
- Moler E, Chow M and Mian I (2000), “Analysis of molecular pro-file data using generative and discriminative methods”, Physiologi-cal Genomics, vol. 4, no.2, pp. 109-126. https://doi.org/10.1152/physiolgenomics.2000.4.2.109.
- Chen L, Xuan J, Riggins RB, Clarke R and Wang Y (2011), “Iden-tifying cancer biomarkers by network-constrained support vector machines,” BMC Systems Biology, vol. 5, no.1, pp. 161. https://doi.org/10.1186/1752-0509-5-161.
- Capriotti E and Altman RB (2011), “A new disease-specific ma-chine learning approach for the prediction of cancer-causing mis-sense variants,” Genomics, vol. 98, no.4, pp. 310-317. https://doi.org/10.1016/j.ygeno.2011.06.010.
- Bari MG, Ung CY, Zhang C, Zhu S and Li H (2017), “Machine Learning-assisted network inference approach to identify a new class of genes that coordinate the functionality of cancer networks,” Scientific Reports, vol.7, pp. 6993. https://doi.org/10.1038/s41598-017-07481-5.
- Taghipour M1, Vand AA, Rezaei Aand Karim GR (2015), “Appli-cation of Artificial Neural Network for Modeling and Prediction of MTT Assay on Human Lung Epithelial Cancer Cell Lines,” Journal of Biosensors & Bioelectronics, vol.6, no.2.
- Khan J, Wei JS, Ringner M, et al. (2001), “Classification and diag-nostic prediction of cancers using gene expression profiling and arti-ficial neural networks,” Nature Medicine, vol.7, pp.673–679. https://doi.org/10.1038/89044.
- Catalogna M, Cohen E, Fishman S, Halpern Z, Nevo U, Ben-Jacob E 92012), “Artificial neural networks-based controller for glucose monitoring during clamp test,” Public Library of Science One, vol.7, no. e44587.
- Narayanan A, Keedwell EC, Gamalielsson J, et al. (2004), “Single-layer artificial neural networks for gene expression analysis,” Neu-rocomputing, vol.61, pp.217–40. https://doi.org/10.1016/j.neucom.2003.10.017.
- Karabulut E, Ibrikçi T. (2012), “Effective diagnosis of coronary ar-tery disease using the rotation forest ensemble method,” Journal of Medical Systems, vol.36, pp.3011–3018. https://doi.org/10.1007/s10916-011-9778-y.
- Samuel, O.W., Asogbon, G.M., Sangaiah, A.K., Fang, P., Li, G. (2017), “An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction,” Expert Systems with Applications, vol.68, pp.163–172. https://doi.org/10.1016/j.eswa.2016.10.020.
- Shouman, M., Turner, T., Stocker, R. (2012), “Applying k-nearest neighbour in diagnosing heart disease patients,” Int. J. Inf. Educ. Technol, vol.2, no.3, pp. 220. https://doi.org/10.7763/IJIET.2012.V2.114.
- V. Anuja Kumari, R.Chitra (2013), “Classification Of Diabetes Dis-ease Using Support Vector Machine,” International Journal of En-gineering Research and Applications, vol.3, no. 2, pp.1797-1801.
- Rau, H.-H., Hsu, C.-Y., Lin, Y.-A., Atique, S., Fuad, A., Wei, L.-M., Hsu, M.-H (2016), , “Development of a web-based liver cancer prediction model for type II diabetes patients by using an artificial neural network,” Computer Methods and Programs in Biomedicine, vol.125, pp. 58–65. https://doi.org/10.1016/j.cmpb.2015.11.009.
- Kaya, Y., Uyar, M. (2013), “A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease,” Applied Soft Computing, vol.13, no.8, pp.3429–3438. https://doi.org/10.1016/j.asoc.2013.03.008.
- Joshi J., Doshi R., Patel J. (2014), “Diagnosis and prognosis breast cancer using classification rules,” International Journal of Engineer-ing Research and General Science, vol.2, no.6, pp. 315–323.
- Jilani, T.A., Yasin, H., Yasin, M.M. (2011), “PCA-ANN for classi-fication of Hepatitis-C patients,” International Journal of Computer Applications, vol.14, no.7, pp. 1–6 (0975–8887).
- Gardezi, S.J.S., Faye, I., Bornot, J.M.S., Kamel, N., Hussain, M. (2017), “Mammogram classification using dynamic time warping,” Multimedia Tools and Applications, pp.1–22.
- Abdelaal M.M.A., Farouq M.W., Sena H.A., Salem A.-B., M., “Us-ing data mining for assessing diagnosis of breast cancer,” Interna-tional Multiconference on Computer Science and Information Technology; 2010 March 17–19; Hong Kong, China. p. 11–17.
- Kumar, M., Rath, N.K., Rath, S.K. (2016), “Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier,” The Journal of Biomedical Informatics, vol.60, pp.395–409. https://doi.org/10.1016/j.jbi.2016.03.002.
- Gasiorek JJ, Blank V. (2015), “Regulation and function of the NFE2 transcription factor in hematopoietic and non-hematopoietic cells,” Cell Mol Life Sci CMLS, vol.72, pp.2323–35. https://doi.org/10.1007/s00018-015-1866-6.
- Mohamed, H., Mabrouk, M.S., Sharawy, A. (2014), “Computer aided detection system for micro calcifications in digital mammo-grams,” Computer Methods and Programs in Biomedicine, vol.116, no.3, pp. 226–235. https://doi.org/10.1016/j.cmpb.2014.04.010.
- Huang C.-L., Liao H.-C., Chen M.-C. (2008), “Prediction model building and feature selection with support vector machines in breast cancer diagnosis,” Expert Systems with Applications, vol.34, pp.578–587. https://doi.org/10.1016/j.eswa.2006.09.041.
- Xin Wang, Peijie Lin1 and Joshua W. K. Ho1 (2018), “Discovery of cell-type specific DNA motif grammar in cis-regulatory elements us-ing random Forest,” BMC Genomics, vol 19, no.1, pp.929. https://doi.org/10.1186/s12864-017-4340-z.
- Thakur, A., Mishra, V., Jain, S.K. (2011), “Feed forward artificial neural network: tool for early detection of ovarian cancer,” Scientia Pharmaceutica, vol.79, no.3, pp.493–506. https://doi.org/10.3797/scipharm.1105-11.
- Babeu J-P, Boudreau F. (2014), “Hepatocyte nuclear factor 4-alpha involvement in liver and intestinal inflammatory networks,” World J Gastroenterol WJG, vol.20, pp.22–30. https://doi.org/10.3748/wjg.v20.i1.22.
- Mahmoud, A.M., Maher, B.A., El-Horbaty, E.-S.M., Salem, A.B.M. (2013), “Analysis of machine learning techniques for gene selection and classification of microarray data,” Proceedings of the 6th International Conference on Information Technology.
- T. G. Consortium, ‘‘the genotype-tissue expression (GTEx) project. (2013)’’ Nature Genetics, vol. 45, no. 6, pp. 580–585. https://doi.org/10.1038/ng.2653.
- R. H. Shoemaker (2006), ‘‘The NCI60 human tumour cell line anti-cancer drug screen,’’ Nature Rev. Cancer, vol. 6, no. 10, pp. 813–823. https://doi.org/10.1038/nrc1951.
- M. Kellis et al. (2014), ‘‘Defining functional DNA elements in the human genome,’’ Proc. Nat. Acad. Sci. USA, vol. 111, no. 17, pp. 6131–6138. https://doi.org/10.1073/pnas.1318948111.
- T. J. Hudson et al. (2010), ‘‘International network of cancer genome projects,’’ Nature, vol. 464, no. 7291, pp. 993–998. https://doi.org/10.1038/nature08987.
- K. Chang et al. (2013), ‘‘the cancer genome atlas pan-cancer analy-sis project,’’ Nature Genetics, vol. 45, no. 10, pp. 1113–1120. https://doi.org/10.1038/ng.2764.
- J. Li et al., ‘‘TCPA: A resource for cancer functional proteomics data,’’ Nature Methods, vol. 10, no. 11, pp. 1046–1047. https://doi.org/10.1038/nmeth.2650.
- G. Project et al. (2013), ‘‘an integrated map of genetic variation from 1,092 human genomes,’’ Nature, vol. 491, no. 7422, pp. 556–665, 2012.
- B. E. Bernstein et al. (2010), ‘‘The NIH roadmap epigenomics map-ping consortium,’’ Nature Biotechnol., vol. 28, no. 10, pp. 1045–1048. https://doi.org/10.1038/nbt1010-1045.
- R. E. Consortium et al. (2015), ‘‘Integrative analysis of 111 refer-ence human epigenomes,’’ Nature, vol. 518, no. 7539, pp. 317–330. https://doi.org/10.1038/nature14248.
- A. R. Wood et al. (2014), ‘‘Defining the role of common variation in the genomic and biological architecture of adult human height,’’ Nature Genetics, vol. 46, no. 11, pp. 1173–1186. https://doi.org/10.1038/ng.3097.
- A. E. Locke et al. (2015), ‘‘Genetic studies of body mass index yield new insights for obesity biology,’’ Nature, vol. 518, no. 7538, pp. 197–206. https://doi.org/10.1038/nature14177.
-
Downloads
-
How to Cite
Phogat, M., & Dharmender Kumar, D. (2019). A survey of machine learning techniques for genomic diseases and data sets. International Journal of Engineering and Technology, 7(4), 5533-5538. https://doi.org/10.14419/ijet.v7i4.11016
