Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data

Yi Zhong; Jianghua He; Prabhakar Chalise

doi:10.15446/rce.v43n1.80000

Published

2020-01-01

Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data

Validación cruzada anidada y repetida para el modelo de clasificación con datos de alta dimensión

DOI:

https://doi.org/10.15446/rce.v43n1.80000

Keywords:

Area under ROC curve, Cross-validation, Elastic net, Random forest, Support vector machine (en)
Area under ROC curve, Cross-validation, Área bajo la curva ROC, Validación cruzada, Red elástica, Bosque aleatorio, Máquina de vectores de soporte (es)

Downloads

PDF

Authors

Yi Zhong University of Kansas Medical Center - Department of Biostatistics and Data Science
Jianghua He University of Kansas Medical Center - Department of Biostatistics and Data Science
Prabhakar Chalise University of Kansas Medical Center - Department of Biostatistics and Data Science

Abstract (en)
Abstract (es)

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.

Con la llegada de las tecnologías de alto rendimiento, los conjuntos de datos de alta dimensión están cada vez más disponibles. Esto no sólo ha abierto una nueva visión acerca de los sistemas biológicos, sino que también plantea desafíos analíticos. Un problema importante es la selección de subconjuntos de variables y la predicción de resultados futuros. Es crucial que los modelos no sean sobreajustados y que den resultados precisos con nuevos datos. Además, la identificaci ón confiable de variables informativas con alto poder predictivo (selección de características) es de interés en entornos clínicos. Proponemos un procedimiento de dos etapas para la selección de variables y la construcción de modelos de clasificación, el cual utiliza un método de validación cruzada anidada y repetida. Evaluamos nu\-estro enfoque utilizando tanto datos simulados como dos conjuntos de datos de expresión génica disponibles públicamente. El método propuesto mostró una precisión predictiva comparativamente mejor para casos nuevos en comparación con el método estándar de validación cruzada.

References

Braga-Neto, U. M. & Dougherty, E. R. (2004), ‘Is cross-validation valid for small sample microarray classification?’, Bioinformatics 20(3), 374–380.

Breiman, L. (2001), ‘Random Forest’, Machine Learning 5(32).

Cortes, C. & Vapnik, V. (1995), ‘Support-Vector Networks’, Machine Learning 45(1), 5–32.

Dash, M. & Liu, H. (1997), ‘Feature Selection for Classification’, Intell. Data Anal 1(3), 131–156.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loa, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999), ‘Molecular classification of cancer: class discovery and class prediction by gene expression monitoring’, Science 286(5439), 531–537.

Guyon, I. (2006), Feature extraction: foundations and applications, Springer Verlag, Berlin.

Hastie, T., Tibshirani, R. & H., F. J. (2009), The elements of statistical learning: data mining, inference, and prediction, 2nd edn, Springer, New York.

Hernán dez, F. & Correa, J. C. (2009), ‘Comparison for three classification techniques’, Revista Colo mbiana de Estadística 32(2), 247–265.

Hira, Z. M. & Gillies, D. F. (2015), ‘A review of feature selection and feature extraction methods applied on microarray data’, Advances in Bioinformatics 13.

Krstajic, D., Buturovic, L. J., Leahy, D. E. & Thomas, S. (2014), ‘Cross-validation pitfalls when selecting and assessing regression and classification models’, Journal of cheminformatics 6(1), 10.

Kumar, V. & Minz, S. (2014), ‘Feature Selection: A Literature Review’, Smart Computing Review 4(3), 211–229.

Lu, Y. & Han, J. W. (2003), ‘Cancer classification using gene expression data’, Information Systems 28(4), 243–268.

Nguyen, M. H. & de la Torre F. (2010), ‘Optimal feature selection for support vector machines’, Pattern Recognition 43(3), 584–591.

Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y. H., Goumnerova, L. C., Black, P. M., Lau, C. et al. (2002), ‘Prediction of central nervous system embryonal tumour outcome based on gene expression’, Nature 415(6870), 436–442.

Saeys, Y., Inza, I. & Larranaga, P. (2007), ‘A review of feature selection techniques in bioinformatics.’, Bioinformatics 23(19), 2507–2517.

Salazar, D. A. (2012), ‘Comparison between SVM and Logistic Regression: Which One is Better to Discriminate? ’, Revista Colombiana de Estadística 35(2), 223–237.

Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011), ‘Pegasos: primal estimated sub-gradient solver for SVM’, Mathematical Programming 127(1), 3–30.

Stone, M. (1974), ‘Cross-Validatory Choice and Assessment of Statistical Predictions’, Journal of the Royal Statistical Society 36(2), 111–147.

Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T. & Zeileis, A. (2008), ‘Conditional variable importance for random forests’, BMC bioinformatics 9(1), 307.

TCGA Network (2017), ‘Integrated genomic and molecular characterization of cervical cancer’, Nature 543(7645), 378.

Trevino, V., Falciani, F. & Barrera-Saldana, H. A. (2007), ‘DNA microarrays: a powerful genomic tool for biomedical and clinical research’, Molecular Medicine 13(9), 527–541.

Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J., Witteveen, A. T. et al. (2002), ‘Gene expression profiling predicts clinical outcome of breast cancer’, nature 415(6871), 530.

Varma, S. & Simon, R. (2006), ‘Bias in error estimation when using cross-validation for model selection’, BMC bioinformatics 7(1), 91.

Whelan, R., Watts, R., Orr, C. A., Althoff, R., Artiges, E., Banaschewski, T., Barker, G. J., Bokde, A. L. W., Büchel, C., Carvalho, F. M. et al. (2014), ‘Neuropsychosocial profiles of current and future adolescent alcohol misusers’, Nature 512(7513), 185–189.

Zhang, L., Zhou, W., Velculescu, V. E., Kern, S. E., Hruban, R. H., Hamilton, S. R., Vogelstein, B. & Kinzler, K. W. (1997), ‘Gene expression profiles in normal and cancer cells’, Science 276(5316), 1268–1272.

Zhang, T. (2004), Solving large scale linear prediction problems using stochastic gradient descent algorithms, in ‘Proceedings of the twenty-first international conference on Machine learning’, ACM, p. 116.

Zou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the elastic net ’, Journal of the Royal Statistical Society. Series B-Statistical Methodology 67, 301–320.

How to Cite

APA

Zhong, Y., He, J. and Chalise, P. (2020). Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data. Revista Colombiana de Estadística, 43(1), 103–125. https://doi.org/10.15446/rce.v43n1.80000

ACM

[1]

Zhong, Y., He, J. and Chalise, P. 2020. Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data. Revista Colombiana de Estadística. 43, 1 (Jan. 2020), 103–125. DOI:https://doi.org/10.15446/rce.v43n1.80000.

ACS

(1)

Zhong, Y.; He, J.; Chalise, P. Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data. Rev. colomb. estad. 2020, 43, 103-125.

ABNT

ZHONG, Y.; HE, J.; CHALISE, P. Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data. Revista Colombiana de Estadística, [S. l.], v. 43, n. 1, p. 103–125, 2020. DOI: 10.15446/rce.v43n1.80000. Disponível em: https://revistas.unal.edu.co/index.php/estad/article/view/80000. Acesso em: 28 mar. 2025.

Chicago

Zhong, Yi, Jianghua He, and Prabhakar Chalise. 2020. “Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data”. Revista Colombiana De Estadística 43 (1):103-25. https://doi.org/10.15446/rce.v43n1.80000.

Harvard

Zhong, Y., He, J. and Chalise, P. (2020) “Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data”, Revista Colombiana de Estadística, 43(1), pp. 103–125. doi: 10.15446/rce.v43n1.80000.

IEEE

[1]

Y. Zhong, J. He, and P. Chalise, “Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data”, Rev. colomb. estad., vol. 43, no. 1, pp. 103–125, Jan. 2020.

MLA

Zhong, Y., J. He, and P. Chalise. “Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data”. Revista Colombiana de Estadística, vol. 43, no. 1, Jan. 2020, pp. 103-25, doi:10.15446/rce.v43n1.80000.

Turabian

Zhong, Yi, Jianghua He, and Prabhakar Chalise. “Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data”. Revista Colombiana de Estadística 43, no. 1 (January 1, 2020): 103–125. Accessed March 28, 2025. https://revistas.unal.edu.co/index.php/estad/article/view/80000.

Vancouver

1.

Zhong Y, He J, Chalise P. Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data. Rev. colomb. estad. [Internet]. 2020 Jan. 1 [cited 2025 Mar. 28];43(1):103-25. Available from: https://revistas.unal.edu.co/index.php/estad/article/view/80000

Download Citation

CrossRef Cited-by

18

1. Roberto Dailey, Sam Bertelson, Jinki Kim, Dragan Djurdjanovic. (2024). Virtual Metrology of Critical Dimensions in Plasma Etch Processes Using Entire Optical Emission Spectrum. IEEE Transactions on Semiconductor Manufacturing, 37(3), p.363. https://doi.org/10.1109/TSM.2024.3416844.

2. Weijie Ding, Dianshu Liu. (2023). 露天矿数码电子雷管逐孔起爆条件下质点峰值振速预测. Earth Science-Journal of China University of Geosciences, 48(5), p.2000. https://doi.org/10.3799/dqkx.2022.144.

3. Frank Westad. (2021). A retrospective look at cross model validation and its applicability in vibrational spectroscopy. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 255, p.119676. https://doi.org/10.1016/j.saa.2021.119676.

4. Ahmad Roumiani, Abbas Mofidi. (2022). Predicting ecological footprint based on global macro indicators in G-20 countries using machine learning approaches. Environmental Science and Pollution Research, 29(8), p.11736. https://doi.org/10.1007/s11356-021-16515-5.

5. John Noel Victorino, Yuko Shibata, Sozo Inoue, Tomohiro Shibata. (2021). Predicting Wearing-Off of Parkinson’s Disease Patients Using a Wrist-Worn Fitness Tracker and a Smartphone: A Case Study. Applied Sciences, 11(16), p.7354. https://doi.org/10.3390/app11167354.

6. Isabelle Ayoub, Bethany J. Wolf, Linyu Geng, Huijuan Song, Aastha Khatiwada, Betty P. Tsao, Jim C. Oates, Brad H. Rovin. (2022). Prediction models of treatment response in lupus nephritis. Kidney International, 101(2), p.379. https://doi.org/10.1016/j.kint.2021.11.014.

7. Jacqui Bergner, David Wallin, Sylvia Yang, John Rybczyk. (2024). Using Drone-Captured Imagery and a Digital Elevation Model to Differentiate Eelgrass Species: Padilla Bay, Washington. Journal of Coastal Research, 41(1) https://doi.org/10.2112/JCOASTRES-D-24-00014.1.

8. Patrik Gilley, Ke Zhang, Neman Abdoli, Youkabed Sadri, Laura Adhikari, Kar-Ming Fung, Yuchen Qiu. (2024). Utilizing a Pathomics Biomarker to Predict the Effectiveness of Bevacizumab in Ovarian Cancer Treatment. Bioengineering, 11(7), p.678. https://doi.org/10.3390/bioengineering11070678.

9. Jinlong He, Jialiang Ren, Guangming Niu, Aishi Liu, Qiong Wu, Shenghui Xie, Xueying Ma, Bo Li, Peng Wang, Jing Shen, Jianlin Wu, Yang Gao. (2022). Multiparametric MR radiomics in brain glioma: models comparation to predict biomarker status. BMC Medical Imaging, 22(1) https://doi.org/10.1186/s12880-022-00865-8.

10. Christian Blüthgen, Miriam Patella, André Euler, Bettina Baessler, Katharina Martini, Jochen von Spiczak, Didier Schneiter, Isabelle Opitz, Thomas Frauenfelder, Omar Sultan Al-Kadi. (2021). Computed tomography radiomics for the prediction of thymic epithelial tumor histology, TNM stage and myasthenia gravis. PLOS ONE, 16(12), p.e0261401. https://doi.org/10.1371/journal.pone.0261401.

11. Tomas Mendoza, Chia-Hsuan Lee, Chien-Hua Huang, Tien-Lung Sun. (2021). Random Forest for Automatic Feature Importance Estimation and Selection for Explainable Postural Stability of a Multi-Factor Clinical Test. Sensors, 21(17), p.5930. https://doi.org/10.3390/s21175930.

12. Ahmed Arafa, Marwa Radad, Mohammed Badawy, Nawal El - Fishawy. (2021). Regularized Logistic Regression Model for Cancer Classification. 2021 38th National Radio Science Conference (NRSC). , p.251. https://doi.org/10.1109/NRSC52299.2021.9509831.

13. Matias F. Lucero, Carlos M. Hernández, Ana J. P. Carcedo, Ariel Zajdband, Pierre C. Guillevic, Rasmus Houborg, Kevin Hamilton, Ignacio A. Ciampitti. (2024). Enhancing Alfalfa Biomass Prediction: An Innovative Framework Using Remote Sensing Data. Remote Sensing, 16(18), p.3379. https://doi.org/10.3390/rs16183379.

14. Fatemeh Salehi, Luis I. Lopera Gonzalez, Sara Bayat, Arnd Kleyer, Dario Zanca, Alexander Brost, Georg Schett, Bjoern M. Eskofier. (2024). Machine Learning Prediction of Treatment Response to Biological Disease-Modifying Antirheumatic Drugs in Rheumatoid Arthritis. Journal of Clinical Medicine, 13(13), p.3890. https://doi.org/10.3390/jcm13133890.

15. Nils Brandenstein. (2022). Going beyond simplicity: Using machine learning to predict belief in conspiracy theories. European Journal of Social Psychology, 52(5-6), p.910. https://doi.org/10.1002/ejsp.2859.

16. Giuseppe D’Aniello, Giancarlo Fortino, Matteo Gaeta, Zia Ur Rehman. (2024). A data-driven approach for context definition in situation-aware wearable computing systems. 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE). , p.1525. https://doi.org/10.1109/CASE59546.2024.10711334.

17. Nils Brandenstein, Kathrin Ackermann, Nicole Aeschbach, Jan Rummel. (2023). The key determinants of individual greenhouse gas emissions in Germany are mostly domain-specific. Communications Earth & Environment, 4(1) https://doi.org/10.1038/s43247-023-01092-x.

18. Anmol Saraf, Anupama Kowli. (2024). Appliance Ownership Prediction With Smart Meter Data. The 15th ACM International Conference on Future and Sustainable Energy Systems. , p.633. https://doi.org/10.1145/3632775.3661989.

Dimensions

PlumX

Citations
CrossRef - Citation Indexes: 14
Scopus - Citation Indexes: 21

Usage
SciELO - Full Text Views: 2150
SciELO - Abstract Views: 37

Captures
Mendeley - Readers: 54

Article abstract page views

1046

Downloads

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

	IBN Publindex
	El Índice Bibliográfico Nacional Publindex es un sistema colombiano para la clasificación, actualización, escalafonamiento y certificación de las publicaciones científicas y tecnológicas. Es regido por COLCIENCIAS y el ICFES en Colombia.
	SciELO Colombia
	SciELO Colombia es una librería virtual para América Latina, el Caribe, España y Portugal, fue creada por FAPESP en el año de 1997 en Sao Pablo Brasil, actualmente en Colombia es gestionada por la Universidad Nacional de Colombia.
	REDIB
	Portal donde se muestran las revistas electrónicas españolas y latinoamericanas de acceso abierto (Open Access). Fue creado en España.
	Scopus
	Scopus es una base de datos bibliográfica de resúmenes y citas de artículos de revistas científicas. Cubre aproximadamente 19.500 títulos de más de 5.000 editores internacionales, incluyendo la cobertura de de 16.500 revistas.
	Latindex
	Latindex es producto de la cooperación de una red de instituciones latinoamericanas que funcionan de manera coordinada para reunir y diseminar información bibliográfica sobre las publicaciones científicas seriadas producidas en la región.
	Dialnet
	Dialnet es un portal de difusión de la producción científica hispana que inició su funcionamiento en el año 2001 especializado en ciencias humanas y sociales. Su base de datos, de acceso libre, fue creada por la Universidad de La Rioja (España).
	Zentralblatt Math
	Zentralblatt MATH (zbMATH) es el servicio de resumen y revisión más completo y de más larga duración del mundo en matemática pura y aplicada. Está editado por la European Mathematical Society (EMS), la Academia de Ciencias y Humanidades de Heidelberg y FIZ Karlsruhe. El trabajo editorial lo realiza la oficina de Berlín de FIZ Karlsruhe que, como miembro de la Asociación Leibniz, es una empresa sin fines de lucro y una organización reconocida de interés público. zbMATH es distribuido por Springer Nature.

Revista Colombiana de Estadística

Published

Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data

Validación cruzada anidada y repetida para el modelo de clasificación con datos de alta dimensión

DOI:

Keywords:

Downloads

Authors

References

How to Cite

APA

ACM

ACS

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

CrossRef Cited-by

Dimensions

PlumX

Article abstract page views

Downloads

License

Make a Submission

Information for Authors

Scimago Journal & Country Rank (SJR)

Keywords