Journal of Animal Breeding and Genomics (J Anim Breed Genom)
Indexed in KCI
OPEN ACCESS, PEER REVIEWED
pISSN 1226-5543
eISSN 2586-4297
Research Article

Selection of informative markers using machine learning approaches and genome-wide association studies to improve genomic prediction in Hanwoo cattle: a simulation study

1Department of Bio-Big Data, Chungnam National University, Daejeon 34134, Republic of Korea
2Department of Bio-AI Convergence, Chungnam National University, Daejeon 34134, Republic of Korea
3Division of Animal & Dairy Science, Chungnam National University, Daejeon, 34134, Republic of Korea

Correspondence to Seung Hwan Lee, E-mail: slee46@cnu.ac.kr

Volume 8, Number 1, Pages 17-32, March 2024.
Journal of Animal Breeding and Genomics 2024, 8(1), 17-32. https://doi.org/10.12972/jabng.20240103
Received on 07 March, 2024, Revised on 17 March, 2024, Accepted on 28 March, 2024, Published on 31 March, 2024.
Copyright © 2024 Korean Society of Animal Breeding and Genetics.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0).

ABSTRACT

The present study deploys a comparison of Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGBoost), and Genome Wide Association Studies (GWAS) in selecting optimum subsets of single nucleotide polymorphisms (SNPs) to be used in genomic prediction in cattle. The data simulation was carried out for 6,000 animals and 47,841 SNPs which include 43,633 polygenic markers and 4208 quantitative trait loci (QTL) using QMSim software. The genomic prediction was conducted with the best linear unbiased prediction (BLUP) method using the BLUPF90 program. The accuracy of prediction was computed in three different types, namely, Empirical all SNPs, Empirical QTL, and theoretical accuracy, Accuracy PEV . Among the three models, the highest Empirical all SNPs accuracy 0.79 was derived for GBM followed by 0.77 for XGBoost and 0.76 for GWAS. The Empirical QTL accuracy was almost equal for all three models. The maximum theoretical accuracy was obtained for GWAS which was 0.93, whereas GBM and XGBoost obtained 0.86 and 0.85 accuracy levels respectively. Our results indicate that all three models comparably performed in genomic predictions; however, subsets selected by both GBM and GWAS reported higher prediction accuracies compared to the whole SNP set. The number of QTL selected as a proportion of the total number of SNPs was superior in GWAS. These observations can be validated using real data which could enable further optimization of the analysis process.

KEYWORDS

Extreme gradient boosting, Genome-wide association studies, Genomic prediction, Gradient boosting machine, Quantitative trait loci

ACKNOWLEDGEMENTS

This research is funded by the project, deep learning modeling for genomic prediction to increase accuracy of breeding value in Korean and Israeli cattle populations (2022K1A3A1A31093393) and supported by the National Research Foundation of Korea.

CONFLICT OF INTERESTS

No potential conflict of interest relevant to this article is reported.

REFERENCES

Al Kalaldeh, M., Gibson, J., Duijvesteijn, N., Daetwyler, H. D., MacLeod, I., Moghaddar, N., Lee, S. H., & van der Werf, J. H. J. (2019). Using imputed whole-genome sequence data to improve the accuracy of genomic prediction for parasite resistance in Australian sheep. Genetics Selection Evolution, 51(1), 32. doi:10.1186/s12711-019-0476-4
[DOI][PubMed][PMC]

Ayers, K. L., & Cordell, H. J. (2010). SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol, 34(8), 879-891. doi:10.1002/gepi.20543
[DOI][PubMed][PMC]

Brøndum, R. F., Su, G., Janss, L., Sahana, G., Guldbrandtsen, B., Boichard, D., & Lund, M. S. (2015). Quantitative trait loci markers derived from whole genome sequence data increases the reliability of genomic prediction. Journal of Dairy Science, 98(6), 4107-4116. doi:10.3168/jds.2014-9005
[DOI][PubMed]

Browning, B. L., Zhou, Y., & Browning, S. R. (2018). A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet, 103(3), 338-348. doi:10.1016/j.ajhg.2018.07.015
[DOI][PubMed][PMC]

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. doi:10.1145/2939672.2939785
[DOI]

Chen, Z.-Q., Klingberg, A., Hallingbäck, H. R., & Wu, H. X. (2023). Preselection of QTL markers enhances accuracy of genomic selection in Norway spruce. BMC Genomics, 24(1), 147. doi:10.1186/s12864-023-09250-3
[DOI][PubMed][PMC]

Clark, S. A., & van der Werf, J. (2013). Genomic best linear unbiased prediction (gBLUP) for the estimation of genomic breeding values. Methods Mol Biol, 1019, 321-330. doi:10.1007/978-1-62703-447-0_13
[DOI][PubMed]

Daetwyler, H. D., Calus, M. P., Pong-Wong, R., de Los Campos, G., & Hickey, J. M. (2013). Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics, 193(2), 347-365. doi:10.1534/genetics.112.147983
[DOI][PubMed][PMC]

Elgart, M., Lyons, G., Romero-Brufau, S., Kurniansyah, N., Brody, J. A., Guo, X., Lin, H. J., Raffield, L., Gao, Y., Chen, H., de Vries, P., Lloyd-Jones, D. M., Lange, L. A., Peloso, G. M., Fornage, M., Rotter, J. I., Rich, S. S., Morrison, A. C., Psaty, B. M., . . . the, N. s. T.-O. i. P. M. C. (2022). Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Communications Biology, 5(1), 856. doi:10.1038/s42003-022-03812-z
[DOI][PubMed][PMC]

Goddard, M. E., & Hayes, B. J. (2007). Genomic selection. Journal of Animal Breeding and Genetics, 124(6), 323-330. doi:10.1111/j.1439-0388.2007.00702.x
[DOI][PubMed]

Jeong, S., Kim, J.-Y., & Kim, N. (2020). GMStool: GWAS-based marker selection tool for genomic prediction from genomic data. Scientific Reports, 10(1), 19653. doi:10.1038/s41598-020-76759-y
[DOI][PubMed][PMC]

Jerome, H. F. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189-1232. doi:10.1214/aos/1013203451
[DOI]

Johnsen, P. V., Strümke, I., Langaas, M., DeWan, A. T., & Riemer-Sørensen, S. (2023). Inferring feature importance with uncertainties with application to large genotype data. PLOS Computational Biology, 19(3), e1010963. doi:10.1371/journal.pcbi.1010963
[DOI][PubMed][PMC]

Li, B., Zhang, N., Wang, Y. G., George, A. W., Reverter, A., & Li, Y. (2018). Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods. Front Genet, 9, 237. doi:10.3389/fgene.2018.00237
[DOI][PubMed][PMC]

Mancin, E., Mota, L. F. M., Tuliozi, B., Verdiglione, R., Mantovani, R., & Sartori, C. (2022). Improvement of Genomic Predictions in Small Breeds by Construction of Genomic Relationship Matrix Through Variable Selection. Front Genet, 13, 814264. doi:10.3389/fgene.2022.814264
[DOI][PubMed][PMC]

Meuwissen, T. H., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. doi:10.1093/genetics/157.4.1819
[DOI][PubMed][PMC]

Morgante, F., Huang, W., Maltecca, C., & Mackay, T. F. C. (2018). Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals. Heredity, 120(6), 500-514. doi:10.1038/s41437-017-0043-0
[DOI][PubMed][PMC]

Nayeri, S., Sargolzaei, M., & Tulpan, D. (2019). A review of traditional and machine learning methods applied to animal breeding. Anim Health Res Rev, 20(1), 31-46. doi:10.1017/s1466252319000148
[DOI][PubMed]

Nembrini, S., König, I. R., & Wright, M. N. (2018). The revival of the Gini importance? Bioinformatics, 34(21), 3711-3718. doi:10.1093/bioinformatics/bty373
[DOI][PubMed][PMC]

Ober, U., Ayroles, J. F., Stone, E. A., Richards, S., Zhu, D., Gibbs, R. A., Stricker, C., Gianola, D., Schlather, M., Mackay, T. F. C., & Simianer, H. (2012). Using Whole-Genome Sequence Data to Predict Quantitative Trait Phenotypes in Drosophila melanogaster. PLOS Genetics, 8(5), e1002685. doi:10.1371/journal.pgen.1002685
[DOI][PubMed][PMC]

Paudel, D., Dhakal, S., Parajuli, S., Adhikari, L., Peng, Z., Qian, Y., Shahi, D., Avci, M., Makaju, S. O., & Kannan, B. (2020). Chapter 38 – Use of quantitative trait loci to develop stress tolerance in plants. In D. K. Tripathi, V. Pratap Singh, D. K. Chauhan, S. Sharma, S. M. Prasad, N. K. Dubey, & N. Ramawat (Eds.), Plant Life Under Changing Environment (pp. 917-965). Academic Press. doi:10.1016/B978-0-12-818204-8.00048-5
[DOI]

Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W., & O’Sullivan, J. M. (2022). A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front Bioinform, 2, 927312. doi:10.3389/fbinf.2022.927312
[DOI][PubMed][PMC]

Sargolzaei, M., & Schenkel, F. S. (2009). QMSim: a large-scale genome simulator for livestock. Bioinformatics, 25(5), 680-681. doi:10.1093/bioinformatics/btp045
[DOI][PubMed]

Schapire, R. E. (2003). The Boosting Approach to Machine Learning: An Overview. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, & B. Yu (Eds.), Nonlinear Estimation and Classification (pp. 149-171). Springer New York. doi:10.1007/978-0-387-21579-2_9
[DOI]

Sukhavachana, S., Senanan, W., Tunkijjanukij, S., & Poompuang, S. (2022). Improving genomic prediction accuracy for harvest traits in Asian seabass (Lates calcarifer, Bloch 1790) via marker selection. Aquaculture, 550, 737851. doi:10.1016/j.aquaculture.2021.737851
[DOI]

Tadist, K., Najah, S., Nikolov, N. S., Mrabti, F., & Zahi, A. (2019). Feature selection methods and genomic big data: a systematic review. Journal of Big Data, 6(1), 79. doi:10.1186/s40537-019-0241-0
[DOI]

Uffelmann, E., Huang, Q. Q., Munung, N. S., de Vries, J., Okada, Y., Martin, A. R., Martin, H. C., Lappalainen, T., & Posthuma, D. (2021). Genome-wide association studies. Nature Reviews Methods Primers, 1(1), 59. doi:10.1038/s43586-021-00056-9
[DOI]

van der Werf, J. (2013). Genomic selection in animal breeding programs. Methods Mol Biol, 1019, 543-561. doi:10.1007/978-1-62703-447-0_26
[DOI][PubMed]

Veerkamp, R. F., Bouwman, A. C., Schrooten, C., & Calus, M. P. L. (2016). Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in Holstein-Friesian cattle. Genetics Selection Evolution, 48(1), 95. doi:10.1186/s12711-016-0274-1
[DOI][PubMed][PMC]

Wiggans, G. R., Cole, J. B., Hubbard, S. M., & Sonstegard, T. S. (2017). Genomic Selection in Dairy Cattle: The USDA Experience. Annu Rev Anim Biosci, 5, 309-327. doi:10.1146/annurev-animal-021815-111422
[DOI][PubMed]

Witte, J. S. (2010). Genome-wide association studies and beyond. Annu Rev Public Health, 31, 9-20 24 p following 20. doi:10.1146/annurev.publhealth.012809.103723
[DOI][PubMed][PMC]

Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88(1), 76-82. doi:10.1016/j.ajhg.2010.11.011
[DOI][PubMed][PMC]

Zheng, H., Yuan, J., & Chen, L. (2017). Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation. Energies, 10(8).
[DOI]

Section