Published
Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data
Validación cruzada anidada y repetida para el modelo de clasificación con datos de alta dimensión
Keywords:
Area under ROC curve, Cross-validation, Elastic net, Random forest, Support vector machine (en)Area under ROC curve, Cross-validation, Área bajo la curva ROC, Validación cruzada, Red elástica, Bosque aleatorio, Máquina de vectores de soporte (es)
Downloads
With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.
Downloads
References
Braga-Neto, U. M. & Dougherty, E. R. (2004), ‘Is cross-validation valid for small sample microarray classification?’, Bioinformatics 20(3), 374–380.
Breiman, L. (2001), ‘Random Forest’, Machine Learning 5(32).
Cortes, C. & Vapnik, V. (1995), ‘Support-Vector Networks’, Machine Learning 45(1), 5–32.
Dash, M. & Liu, H. (1997), ‘Feature Selection for Classification’, Intell. Data Anal 1(3), 131–156.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loa, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999), ‘Molecular classification of cancer: class discovery and class prediction by gene expression monitoring’, Science 286(5439), 531–537.
Guyon, I. (2006), Feature extraction: foundations and applications, Springer Verlag, Berlin.
Hastie, T., Tibshirani, R. & H., F. J. (2009), The elements of statistical learning: data mining, inference, and prediction, 2nd edn, Springer, New York.
Hernán dez, F. & Correa, J. C. (2009), ‘Comparison for three classification techniques’, Revista Colo mbiana de Estadística 32(2), 247–265.
Hira, Z. M. & Gillies, D. F. (2015), ‘A review of feature selection and feature extraction methods applied on microarray data’, Advances in Bioinformatics 13.
Krstajic, D., Buturovic, L. J., Leahy, D. E. & Thomas, S. (2014), ‘Cross-validation pitfalls when selecting and assessing regression and classification models’, Journal of cheminformatics 6(1), 10.
Kumar, V. & Minz, S. (2014), ‘Feature Selection: A Literature Review’, Smart Computing Review 4(3), 211–229.
Lu, Y. & Han, J. W. (2003), ‘Cancer classification using gene expression data’, Information Systems 28(4), 243–268.
Nguyen, M. H. & de la Torre F. (2010), ‘Optimal feature selection for support vector machines’, Pattern Recognition 43(3), 584–591.
Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y. H., Goumnerova, L. C., Black, P. M., Lau, C. et al. (2002), ‘Prediction of central nervous system embryonal tumour outcome based on gene expression’, Nature 415(6870), 436–442.
Saeys, Y., Inza, I. & Larranaga, P. (2007), ‘A review of feature selection techniques in bioinformatics.’, Bioinformatics 23(19), 2507–2517.
Salazar, D. A. (2012), ‘Comparison between SVM and Logistic Regression: Which One is Better to Discriminate? ’, Revista Colombiana de Estadística 35(2), 223–237.
Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011), ‘Pegasos: primal estimated sub-gradient solver for SVM’, Mathematical Programming 127(1), 3–30.
Stone, M. (1974), ‘Cross-Validatory Choice and Assessment of Statistical Predictions’, Journal of the Royal Statistical Society 36(2), 111–147.
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T. & Zeileis, A. (2008), ‘Conditional variable importance for random forests’, BMC bioinformatics 9(1), 307.
TCGA Network (2017), ‘Integrated genomic and molecular characterization of cervical cancer’, Nature 543(7645), 378.
Trevino, V., Falciani, F. & Barrera-Saldana, H. A. (2007), ‘DNA microarrays: a powerful genomic tool for biomedical and clinical research’, Molecular Medicine 13(9), 527–541.
Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J., Witteveen, A. T. et al. (2002), ‘Gene expression profiling predicts clinical outcome of breast cancer’, nature 415(6871), 530.
Varma, S. & Simon, R. (2006), ‘Bias in error estimation when using cross-validation for model selection’, BMC bioinformatics 7(1), 91.
Whelan, R., Watts, R., Orr, C. A., Althoff, R., Artiges, E., Banaschewski, T., Barker, G. J., Bokde, A. L. W., Büchel, C., Carvalho, F. M. et al. (2014), ‘Neuropsychosocial profiles of current and future adolescent alcohol misusers’, Nature 512(7513), 185–189.
Zhang, L., Zhou, W., Velculescu, V. E., Kern, S. E., Hruban, R. H., Hamilton, S. R., Vogelstein, B. & Kinzler, K. W. (1997), ‘Gene expression profiles in normal and cancer cells’, Science 276(5316), 1268–1272.
Zhang, T. (2004), Solving large scale linear prediction problems using stochastic gradient descent algorithms, in ‘Proceedings of the twenty-first international conference on Machine learning’, ACM, p. 116.
Zou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the elastic net ’, Journal of the Royal Statistical Society. Series B-Statistical Methodology 67, 301–320.
License
Copyright (c) 2020 Revista Colombiana de Estadística

This work is licensed under a Creative Commons Attribution 4.0 International License.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).






