Published

2025-01-01

Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote

Muestras pequeñas, nuevos virus, insumos para la toma de decisiones y metodología: Bootstrap y SMOTE

DOI:

https://doi.org/10.15446/rce.v48n1.113819

Keywords:

Death predictors, Early warning systems, New viruses, Small sample methodologies, SMOTE. (en)
Bootstrapping, Metodologías para muestras pequeñas, Nuevos virus, Predictores de mortalidad, Sistemas de alerta temprana, SMOTE. (es)

Downloads

Authors

  • Wilson Fernando Rodríguez Gómez Universidad de La Sabana https://orcid.org/0000-0002-6326-5386
  • Martha Misas-Arango Universidad de La Sabana
  • Catherine Pereira-Villa Universidad de La Sabana
  • José E. Gomez-Gonzalez City University of New York

This study presents a comprehensive methodology that combines resampling and oversampling techniques to address the challenges of limited and balanced data, specifically in the context of viral emergencies such as the COVID-19 pandemic. Utilizing advanced statistical techniques like Bootstrap and SMOTE, the study conducts a retrospective analysis of COVID-19 patients, identifying those at higher risk of mortality. The proposed methodology not only enhances the accuracy of predictions in scenarios with limited data but also facilitates better decision-making in clinical triage systems. By applying these methods, the study achieves early and accurate identification of high-risk individuals, optimizing resource allocation and timely medical interventions. The results demonstrate that this combination of statistical techniques effectively improves health systems and responses to new viral threats, providing a robust foundation for informed decision-making in medical emergencies.

Este estudio presenta una metodología integral que combina técnicas de remuestreo y sobremuestreo para abordar los desafíos de datos limitados y desbalanceados, específicamente en el contexto de emergencias virales como la pandemia de COVID-19. Utilizando técnicas estadísticas avanzadas como Bootstrap y SMOTE, el estudio realiza un análisis retrospectivo de pacientes con COVID-19, identificando a aquellos con mayor riesgo de mortalidad. La metodología propuesta no solo mejora la precisión de las predicciones en escenarios con datos limitados, sino que también facilita una mejor toma de decisiones en los sistemas de triaje clínico. Al aplicar estos métodos, el estudio logra una identificación temprana y precisa de individuos de alto riesgo, optimizando la asignación de recursos y las intervenciones médicas oportunas. Los resultados demuestran que esta combinación de técnicas estadísticas mejora de manera efectiva los sistemas de salud y las respuestas ante nuevas amenazas virales, proporcionando una base sólida para la toma de decisiones informadas en emergencias médicas.

References

Analytics India Magazine (2023), 'Handling imbalanced data with class weights in logistic regression'. https://analyticsindiamag.com/handling-imbalanced-datawith-class-weights-in-logistic-regression/

Banik, A., Nag, T., Chowdhury, S. R. & Chatterjee, R. (2020), 'Why do covid-19 fatality rates di_er across countries? an explorative cross-country study based on select indicators', Global Business Review 21(3), 607_625.

Bhandari, S., Shaktawat, A., Tak, A., Patel, B., Shukla, J., Singhal, S., Gupta, K., Kakkar, S. & Dube, A. (2022), 'Logistic regression analysis to predict mortality risk in covid-19 patients from routine hematologic parameters', Ibnosina Journal of Medicine and Biomedical Sciences 12, 123_129.

Breiman, L. (1996), 'Bagging predictors', Machine Learning 24(2), 123_140.

Brown, M. B. & Benedetti, J. K. (1977), 'On the mean and variance of the tetrachoric correlation coe_cient', Psychometrika 42(3), 347_355.

Castro, M. C., Gurzenda, S., Macário, E. M. & França, G. V. A. (2021), 'Characteristics, outcomes and risk factors for mortality of 522,167 patients hospitalized with covid-19 in brazil: a retrospective cohort study', BMJ Open 11(5).

Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002), 'Smote: Synthetic minority over-sampling technique', Journal of Artificial Intelligence Research 16, 321_357.

Cifuentes, M., Rodriguez-Villamizar, L., Rojas-Botero, M., Alvarez, C. & Fernández-Niño, J. (2021), 'Socioeconomic inequalities associated with mortality for covid-19 in colombia: A cohort nationwide study', Journal of Epidemiology and Community Health 75, jech_2020.

Cornilly, D., Van Aelst, S. & Verdonck, T. (2023), 'Robust inference and modeling of mean and dispersion for generalized linear models', Journal of the American Statistical Association . Disponible en: https://link.springer.com/article/10.1080/01621459.2022.2140054.

Dal Pozzolo, A., Caelen, O. & Bontempi, G. (2015), When is undersampling effective in unbalanced classi_cation tasks?, in 'Proceedings of the International Conference on Data Mining'.

De la Hoz-Restrepo, F., Alvis-Zakzuk, N. J., De la Hoz-Gomez, J. F., De la Hoz, A., Gómez Del Corral, L. & Alvis-Guzmán, N. (2020), 'Is colombia an example of successful containment of the 2020 covid-19 pandemic? a critical analysis of the epidemiological data, march to july 2020', International Journal of Infectious Diseases 99, 522_529.

Efron, B. & Tibshirani, R. J. (1994), An Introduction to the Bootstrap, Chapman & Hall/CRC.

Elkan, C. (2001), The foundations of cost-sensitive learning, in 'Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI)', pp. 973_ 978. https://www.ijcai.org/Proceedings/01/Papers/145.pdf

Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B. & Herrera, F. (2018), Learning from Imbalanced Data Sets, Springer International Publishing.

Fernández-Niño, J., Guerra-Gómez, J. & Idrovo, A. (2020), 'Multimorbidity patterns among covid-19 deaths: Proposal for the construction of etiological models', Revista Panamericana de Salud Pública 44, 1.

Hanifah, F., Wijayanto, H. & Kurnia, A. (2015), 'Smote bagging algorithm for imbalanced dataset in logistic regression analysis (case: Credit of bank x)', Applied Mathematical Sciences 9(13), 6857_6865.

He, H. & Garcia, E. A. (2009), 'Learning from imbalanced data', IEEE Transactions on Knowledge and Data Engineering 21(9), 1263_1284.

Hosmer, D. W., Lemeshow, S. & Sturdivant, R. X. (2013), Applied Logistic Regression, 3rd edn, Wiley.

Laajaj, R., De Los Rios, C., Sarmiento-Barbieri, I., Aristizabal, D., Behrentz, E., Bernal, R., Buitrago, G., Cucunubá, Z., de la Hoz, F., Gaviria, A., Hernández, L. J., León, L., Moyano, D., Osorio, E., Varela, A. R., Restrepo, S., Rodriguez, R., Schady, N., Vives, M. & Webb, D. (2021), 'Covid-19 spread, detection, and dynamics in bogota, colombia', Nature Communications 12(1), 4726.

Le Thi, H. A. & Nguyen, M. C. (2023), 'Dca-based weighted bagging: A new ensemble learning approach', Advances in Data Analysis and Classification. Disponible en: https://link.springer.com/article/10.1007/s00477-022-02185-6.

Li, J., Huang, D., Zou, B., Yang, H., Hui, W., Rui, F., Yee, N., Liu, C., Nerurkar, S., Kai, J., Teng, M., Li, X., Zeng, H., Borghi, J., Henry, L., Cheung, R. & Nguyen, M. (2020), 'Epidemiology of covid-19: A systematic review and meta analysis of clinical characteristics, risk factors and outcomes', Journal of Medical Virology 93. Disponible en: https://doi.org/10.1002/jmv.26424.

Lupei, M. I., Li, D., Ingraham, N. E., Baum, K. D., Benson, B., Puskarich, M., Milbrandt, D., Melton, G. B., Scheppmann, D., Usher, M. G. & Tignanelli, C. J. (2022), 'A 12-hospital prospective evaluation of a clinical decision support prognostic algorithm based on logistic regression as a form of machine learning to facilitate decision making for patients with suspected covid-19', PLOS ONE 17(1), e0262193.

Morgenthaler, S. (2023), 'Robust regression against heavy heterogeneous contamination', Metrika . Disponible en: https://link.springer.com/article/10.1007/s00184-022-00832-6.

Neptune.ai (2023), 'How to deal with imbalanced classification and regression data'. Disponible en: https://neptune.ai/blog/imbalanced-data.

Roscino, A. & Pollice, A. (2006), A generalization of the polychoric correlation coefficient, in S. Zani, A. Cerioli, M. Riani & M. Vichi, eds, 'Data Analysis, Classification and the Forward Search', Springer Berlin Heidelberg, pp. 135_142.

Toya, H. & Skidmore, M. (2021), 'A cross-country analysis of the determinants of covid-19 fatalities', SSRN Electronic Journal . Disponible en: https://doi.org/10.2139/ssrn.3832483.

Upshaw, T. L., Brown, C., Smith, R., Perri, M., Ziegler, C. & Pinto, A. D. (2021), 'Social determinants of covid-19 incidence and outcomes: A rapid review', PLoSONE 16(3), e0248336.

Yalaman, A., Basbug, G., Elgin, C. & Galvani, A. (2021), 'Cross-country evidence on the association between contact tracing and covid-19 case fatality rates', Scientific Reports 11(2145).

How to Cite

APA

Rodríguez Gómez, W. F., Misas-Arango, M., Pereira-Villa, C. and Gomez-Gonzalez, J. E. (2025). Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote. Revista Colombiana de Estadística, 48(1), 99–115. https://doi.org/10.15446/rce.v48n1.113819

ACM

[1]
Rodríguez Gómez, W.F., Misas-Arango, M., Pereira-Villa, C. and Gomez-Gonzalez, J.E. 2025. Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote. Revista Colombiana de Estadística. 48, 1 (Jan. 2025), 99–115. DOI:https://doi.org/10.15446/rce.v48n1.113819.

ACS

(1)
Rodríguez Gómez, W. F.; Misas-Arango, M.; Pereira-Villa, C.; Gomez-Gonzalez, J. E. Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote. Rev. colomb. estad. 2025, 48, 99-115.

ABNT

RODRÍGUEZ GÓMEZ, W. F.; MISAS-ARANGO, M.; PEREIRA-VILLA, C.; GOMEZ-GONZALEZ, J. E. Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote. Revista Colombiana de Estadística, [S. l.], v. 48, n. 1, p. 99–115, 2025. DOI: 10.15446/rce.v48n1.113819. Disponível em: https://revistas.unal.edu.co/index.php/estad/article/view/113819. Acesso em: 19 feb. 2025.

Chicago

Rodríguez Gómez, Wilson Fernando, Martha Misas-Arango, Catherine Pereira-Villa, and José E. Gomez-Gonzalez. 2025. “Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote”. Revista Colombiana De Estadística 48 (1):99-115. https://doi.org/10.15446/rce.v48n1.113819.

Harvard

Rodríguez Gómez, W. F., Misas-Arango, M., Pereira-Villa, C. and Gomez-Gonzalez, J. E. (2025) “Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote”, Revista Colombiana de Estadística, 48(1), pp. 99–115. doi: 10.15446/rce.v48n1.113819.

IEEE

[1]
W. F. Rodríguez Gómez, M. Misas-Arango, C. Pereira-Villa, and J. E. Gomez-Gonzalez, “Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote”, Rev. colomb. estad., vol. 48, no. 1, pp. 99–115, Jan. 2025.

MLA

Rodríguez Gómez, W. F., M. Misas-Arango, C. Pereira-Villa, and J. E. Gomez-Gonzalez. “Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote”. Revista Colombiana de Estadística, vol. 48, no. 1, Jan. 2025, pp. 99-115, doi:10.15446/rce.v48n1.113819.

Turabian

Rodríguez Gómez, Wilson Fernando, Martha Misas-Arango, Catherine Pereira-Villa, and José E. Gomez-Gonzalez. “Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote”. Revista Colombiana de Estadística 48, no. 1 (January 21, 2025): 99–115. Accessed February 19, 2025. https://revistas.unal.edu.co/index.php/estad/article/view/113819.

Vancouver

1.
Rodríguez Gómez WF, Misas-Arango M, Pereira-Villa C, Gomez-Gonzalez JE. Small Samples, New Viruses, Inputs for Decision-Making and Methodology: Bootstrap and Smote. Rev. colomb. estad. [Internet]. 2025 Jan. 21 [cited 2025 Feb. 19];48(1):99-115. Available from: https://revistas.unal.edu.co/index.php/estad/article/view/113819

Download Citation