Full Model Selection Problem and Pipelines for Time-Series Databases: Contrasting Population-Based and Single-point Search Metaheuristics

Nancy Pérez-Castro; Héctor Gabriel Acosta-Mesa; Efrén Mezura-Montes; Nicandro Cruz-Ramírez

doi:10.15446/ing.investig.v41n3.79308

Published

2021-07-06

Full Model Selection Problem and Pipelines for Time-Series Databases: Contrasting Population-Based and Single-point Search Metaheuristics

Problema de selección de modelo completo y tuberías para bases de datos de series de tiempo: contrastando metaheurísticas basadas en población y de un solo punto de búsqueda

DOI:

https://doi.org/10.15446/ing.investig.v41n3.79308

Keywords:

Full Model Selection, Temporal Databases, Time-Series (en)
Selección del Modelo Completo, Bases de Datos Temporales, Series de Tiempo (es)

Downloads

PDF
XML

Authors

Nancy Pérez-Castro University of Veracruz https://orcid.org/0000-0002-1831-6148
Héctor Gabriel Acosta-Mesa University of Veracruz https://orcid.org/0000-0002-0935-7642
Efrén Mezura-Montes University of Veracruz https://orcid.org/0000-0002-1565-5267
Nicandro Cruz-Ramírez University of Veracruz https://orcid.org/0000-0002-0708-9875

Abstract (en)
Abstract (es)

The increasing production of temporal data, especially time series, has motivated valuable knowledge to understand phenomena or for decision-making. As the availability of algorithms to process data increases, the problem of choosing the most suitable one becomes more prevalent. This problem is known as the Full Model Selection (FMS), which consists of finding an appropriate set of methods and hyperparameter optimization to perform a set of structured tasks as a pipeline. Multiple approaches (based on metaheuristics) have been proposed to address this problem, in which automated pipelines are built for multitasking without much dependence on user knowledge. Most of these approaches propose pipelines to process non-temporal data. Motivated by this, this paper proposes an architecture for finding optimized pipelines for time-series tasks. A micro-differential evolution algorithm (µ-DE, population-based metaheuristic) with different variants and continuous encoding is compared against a local search (LS, single-point search) with binary and mixed encoding. Multiple experiments are carried out to analyze the performance of each approach in ten time-series databases. The final results suggest that the µ-DE approach with rand/1/bin variant is useful to find competitive pipelines without sacrificing performance, whereas a local search with binary encoding achieves the lowest misclassification error rates but has the highest computational cost during the training stage.

La creciente producción de datos temporales, especialmente de series de tiempo, ha motivado la extracción analítica de conocimiento valioso para comprender fenómenos o para la toma de decisiones. A medida que aumenta la disponibilidad de algoritmos para procesar datos, el problema de elegir el más adecuado se vuelve más frecuente. Este problema se conoce como la Selección del Modelo Completo (SMC), que consiste en encontrar un conjunto apropiado de métodos y la optimización de hiperparámetros para realizar un conjunto de tareas estructuradas como una tubería. Se han propuesto múltiples enfoques (basados en metaheurísticas) para abordar este problema, en los que se construyen tuberías automatizadas para realizar múltiples tareas sin mucha dependencia del conocimiento del usuario. La mayoría de estos enfoques proponen tuberías para procesar datos no temporales. Motivado por esto, este artículo propone una arquitectura para encontrar tuberías optimizadas para tareas de series de tiempo. El algoritmo de micro-Evolución Diferencial (µ-ED, metaheurística basada en población) con diferentes variantes y codificación continua, es comparado contra una búsqueda local (BL, búsqueda de un solo punto) con codificación binaria y mixta. Se realizan múltiples experimentos para analizar el rendimiento de cada enfoque en diez bases de datos de series de tiempo. Los resultados finales sugieren que el enfoque de µ-ED con la variante rand/1/bin es útil para encontrar tuberías competitivas sin sacrificar el rendimiento, mientras que la BL con codificación binaria logra las tasas de error de clasificación incorrecta más bajas, pero tiene el costo computacional más alto durante la etapa de entrenamiento.

Downloads

Download data is not yet available.

References

Al-Jowder, O., Kemsley, E., and Wilson, R. H. (2002). Detection of adulteration in cooked meat products by mid-infrared spectroscopy. Journal of Agricultural and Food Chemistry, 50(6), 1325–1329. https://doi.org/10.1021/jf0108967

Ali, M., Alqahtani, A., Jones, M. W., and Xie, X. (2019). Clustering and classification for time series data in visual analytics: A survey. IEEE Access, 7, 181314–181338. https://doi.org/10.1109/ACCESS.2019.2958551

Aly, A., Guadagni, G., and Dugan, J. B. (2019). Derivativefree optimization of neural networks using local search. In IEEE (Eds.) 2019 IEEE 10th Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON) (pp. 0293–0299). New York, NY: IEEE. https://doi.org/10.1109/UEMCON47517.2019.8993007

Bagnall, A., Davis, L., Hills, J., and Lines, J. (2012). Transformation based ensembles for time series classification. In SIAM (Eds.) Proceedings of the 2012 SIAM international conference on data mining (pp. 307– 318). Philadelphia, PA: Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972825.27

Baijal, S., Singh, S., Rani, A., and Agarwal, S. (2016). Performance evaluation of s-golay and ma filter on the basis of white and flicker noise. In Proceedings of Second International Symposium on Signal Processing and Intelligent Recognition Systems (SIRS-2015) (pp. 245–255). New York, NY: Springer. https://doi.org/10.1007/978-3-319-28658-7_21

Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(2), 281–305. https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a

Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., and Jones, Z. M. (2016). mlr: Machine learning in R. The Journal of Machine Learning Research, 17(170), 1–5. http://jmlr.org/papers/v17/15-066.html

Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer.

Boullé, N., Dallas, V., Nakatsukasa, Y., and Samaddar, D. (2020). Classification of chaotic time series with deep learning. Physica D: Nonlinear Phenomena, 403, 132261. https://doi.org/10.1016/j.physd.2019.132261

Buza, K., Nanopoulos, A., and Schmidt-Thieme, L. (2011). Insight: Efficient and effective instance selection for time-series classification. In Huang, J. Z., Cao, L., and Srivastava, J. (Eds.) Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 149–160). Heidelberg/Berlin, Germany: Springer

Caraffini, F., Neri, F., and Poikolainen, I. (2013). Microdifferential evolution with extra moves along the axes. In IEEE (Eds.) 2013 IEEE Symposium on Differential Evolution (SDE) (pp. 46–53). New York, NY: IEEE. https://doi.org/10.1109/SDE.2013.6601441

Cleveland, W. S. and Loader, C. (1996). Smoothing by local regression: Principles and methods. In Hardle, W., and Scmiek, M. G. (Eds.) Statistical Theory and Computational Aspects of Smoothing (pp. 10–49). Heidelberg, Germany: Physica-Verlag HD. https://doi.org/10.1007/978-3-642-48425-4_2

de Sa, A. G. C., Pinto, W. J. G. S., Oliveira, L. O. V. B., and Pappa, G. L. (2017). RECIPE: A grammar-based framework for automatically evolving classification pipelines. In McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., and García-Sánchez, P. (Eds.) European Conference on Genetic Programming (pp. 246-261), Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-55696-3_16

Díaz-Pacheco, A., Gonzalez-Bernal, J. A., Reyes-García, C. A., and Escalante-Balderas, H. J. (2018). Full model selection in big data. In Castro, F., Miranda-Jiménez, S., and González-Mendoza, M. (Eds.) Advances in Soft Computing (pp. 279–289). Springer International Publishing, Cham. https://doi.org/10.1007/978-3-030-02837-4_23

Eads, D. R., Hill, D., Davis, S., Perkins, S. J., Ma, J., Porter, R. B., and Theiler, J. P. (2002). Genetic algorithms and support vector machines for time series classification. In Bosacchi, B., Fogel, D. B., and Bezdek, J. C. (Eds.) Applications and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation V (vol. 4787, pp. 74-85). Bellingham, WA: International Society for Optics and Photonics. https://doi.org/10.1117/12.453526

Escalante, H. J., Montes, M., and Sucar, E. (2010). Ensemble particle swarm model selection. In IEEE (Eds.)The 2010 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). New York, NY: IEEE. https://doi.org/10.1109/IJCNN.2010.5596915

Escalante, H. J., Montes, M., and Sucar, L. E. (2009). Particle swarm model selection. Journal of Machine Learning Research, 10(2), 405–440. http://jmlr.org/papers/v10/escalante09a.html

Esling, P. and Agon, C. (2012). Time-series data mining. ACM Computing Surveys (CSUR), 45(1), 1–12. https://doi.org/10.1145/2379776.2379788

Fu, T.-c. (2011). A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1), 164–181. 10.1016/j.engappai.2010.09.00

Gantza, J. and Reisel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future, 2007(2012), 1–16. https://www.speicherguide.de/download/dokus/IDC-Digital-Universe-Studie-iView-11.12.pdf

Garcia, S., Derrac, J., Cano, J., and Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence, 34(3), 417– 435. https://doi.org/10.1109/TPAMI.2011.142

García, S., Fernández, A., Luengo, J., and Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10), 2044–2064. https://doi.org/10.1016/j.ins.2009.12.010

Giron-Sierra, J. (2018). Digital Signal Processing with Matlab Examples, Volume 3: Model-Based Actions and Sparse Representation. Singapore: Springer Singapore.

Gong, Z., Chen, H., Yuan, B., and Yao, X. (2019). Multiobjective learning in the model space for time series classification. IEEE Transactions on Cybernetics, 49(3), 918–932. https://doi.org/10.1109/TCYB.2018.2789422

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The weka data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278

Hutter, F., Kotthoff, L., and Vanschoren, J. (2019). Automated Machine Learning: Methods, Systems, Challenges. New York, NY: Springer. https://doi.org/10.1007/978-3-030-05318-5

Jastrzebska, A. (2019). Time series classification through visual pattern recognition. Journal of King Saud University - Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2019.12.012

Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. (2001). Dimensionality reduction for fast similarity search in large time series databases. Knowledge and Information Systems, 3(3), 263–286. https://doi.org/10.1007/PL00011669

Keogh, E., Zhu, Q., Hu, B., Hao, Y., Xi, X., Wei, L., and Ratanamahatana, C. A. (2011). The UCR Time Series Classification/Clustering Homepage. https://www.cs.ucr.edu/~eamonn/time_series_data/

Lin, J., Keogh, E., Wei, L., and Lonardi, S. (2007). Experiencing sax: a novel symbolic representation of time series. Data Mining and Knowledge Discovery, 15(2), 107–144. https://doi.org/10.1007/s10618-007-0064-z

Olguín-Carbajal, M., Herrera-Lozada, J. C., Sandoval-Gutierrez, J., Vasquez-Gomez, J. I., Serrano-Talamantes, J. F., Chavez-Estrada, F. A., Rivera-Zarate, I., and Hernandez- Boláos, M. (2019). A micro-differential evolution algorithm for continuous complex functions. IEEE Access, 7, 172783–172795. https://doi.org/10.1109/ACCESS.2019.2954296

Olson, R. S., Urbanowicz, R. J., Andrews, P. C., Lavender, N. A., Kidd, L. C., and Moore, J. H. (2016). Automating biomedical data science through tree-based pipeline optimization. In Squillero, G., and Burelli, P. (Eds.) European Conference on the Applications of Evolutionary Computation (pp. 123–137). Cham, Germany: Springer. https://doi.org/10.1007/978-3-319-31204-0_9

Olszewski, R. T. (2001). Generalized feature extraction for structural pattern recognition in time-series data (Doctoral thesis, Carnegie Mellon University, Pittsburgh, PA). https://apps.dtic.mil/sti/pdfs/ADA457624.pdf

Page, R. M., Lischeid, G., Epting, J., and Huggenberger, P. (2012). Principal component analysis of time series for identifying indicator variables for riverine groundwater extraction management. Journal of Hydrology, 432, 137– 144. https://doi.org/10.1016/j.jhydrol.2012.02.025

Parsopoulos, K. E. (2009). Cooperative micro-differential evolution for high-dimensional problems. In ACM (Eds.) GECCO ’09: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (pp. 531– 538). New York, NY: ACM. https://doi.org/10.1145/1569901.1569975

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., and Cournapeau, D. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825–2830. https://doi.org/10.5555/1953048.2078195

Pérez-Castro, N., Acosta-Mesa, H., Mezura-Montes, E., and Cruz-Ramírez, N. (2015). Towards the full model selection in temporal databases by using microdifferential evolution. an empirical study. In IEEE (Eds.) 2015 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC) (pp. 1–6). New York, NY: IEEE. https://doi.org/10.1109/ROPEC.2015.7395161

Rashid, A. and Hossain, M. A. (2012) Challenging issues of spatio-temporal data mining. Computer Engineering and Intelligent Systems, 3(4), 55–63. https://www.iiste.org/Journals/index.php/CEIS/article/view/1484

Ratanamahatana, C. A. and Keogh, E. (2005). Three myths about dynamic time warping data mining. In SIAM (Eds.) Proceedings of the 2005 SIAM International Conference on Data Mining (pp. 506–510). Philadelphia, PA: Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972757.50

Rice, J. R. (1976). The algorithm selection problem. In Rubinoff, M. and Yovits, M. C. (Eds.) Advances in computers (vol. 15, pp. 65-118). Amsterdam, Netherlands: Elsevier. https://doi.org/10.1016/S0065-2458(08)60520-3

Rosales-Pérez, A., Escalante, H. J., Gonzalez, J. A., Reyes- Garcia, C. A., and Coello-Coello, C. A. (2013). Bias and variance multi-objective optimization for support vector machines model selection. In Sanches, J. a. M., Micó, l., and Cardoso, J. S. (Eds.) Iberian Conference on Pattern Recognition and Image Analysis (pp. 108-116). Berlin/Heidelberg, Germany: Springer. https://doi.org/10.1007/978-3-642-38628-2_12

Rosales-Pérez, A., Gonzalez, J. A., Coello-Coello, C. A., Escalante, H. J., and Reyes-Garcia, C. A. (2015). Surrogate-assisted multi-objective model selection for support vector machines. Neurocomputing, 150, 163– 172. https://doi.org/10.1016/j.neucom.2014.08.075

Rosales-Pérez, A., Gonzalez, J. A., Coello-Coello, C. A., Escalante, H. J., and Reyes-Garcia, C. A. (2014). Multiobjective model type selection. Neurocomputing, 146, 83–94. https://doi.org/10.1016/j.neucom.2014.05.077

Roverso, D. (2000). Multivariate temporal classification by windowed wavelet decomposition and recurrent neural networks. In ANS (Eds.) 3rd ANS international topical meeting on nuclear plant instrumentation, control and human-machine interface (vol. 20, pp. 527–538). La Grange Park, IL: American Nuclear Society

Rydning, D. R.-J. G.-J. (2018). The digitization of the world from edge to core. http://cloudcode.me/media/1014/idc.pdf

Saito, N. (2000). Local feature extraction and its applications using a library of bases. In Coifman, R. (Ed.) Topics in Analysis and Its Applications: Selected Theses (pp. 269- 451). https://doi.org/10.1142/9789812813305_0005

Salehinejad, H., Rahnamayan, S., and Tizhoosh, H. R. (2017). Micro-differential evolution: Diversity enhancement and a comparative study. Applied Soft Computing, 52, 812– 833 https://doi.org/10.1016/j.asoc.2016.09.042

Savitzky, A. and Golay, M. J. E. (1964). Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8), 1627–1639. https://doi.org/10.1021/ac60214a047

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148–175. https://doi.org/10.1109/JPROC.2015.2494218

Sun, J., Yang, Y., Liu, Y., Chen, C., Rao, W., and Bai, Y. (2019). Univariate time series classification using information geometry. Pattern Recognition, 95, 24-35. https://doi.org/10.1016/j.patcog.2019.05.040

Sun, Q., Pfahringer, B., and Mayo, M. (2013). Towards a framework for designing full model selection and optimization systems. In Zhou, Z.-H., Roli, F., and Kittler, J. (Eds.) International Workshop on Multiple Classifier Systems (pp. 259-270). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38067-9_23

Talbi, E. (2009). Metaheuristics: From Design to Implementation. John Wiley & Sons.

Viveros-Jiménez, F., Mezura-Montes, E., and Gelbukh, A. (2012). Empirical analysis of a micro-evolutionary algorithm for numerical optimization. International Journal of Physical Sciences, 7(8), 1235–1258. https://doi.org/10.5897/IJPS11.303

Yang, M., Li, C., Cai, Z., and Guan, J. (2015). Differential evolution with auto-enhanced population diversity. IEEE transactions on cybernetics, 45(2), 302–315. https://doi.org/10.1109/TCYB.2014.2339495

Yang, Y. (2017). Chapter 2 - temporal data mining. In Y. Yang (Ed.) Temporal Data Mining Via Unsupervised Ensemble Learning (pp. 9–18). Amsterdam, Netherlands: Elsevier. https://doi.org/10.1016/B978-0-12-811654-8.00002-6

Yu, T. and Zhu, H. (2020). Hyper-parameter optimization: A review of algorithms and applications. https://arxiv.org/pdf/2003.05689

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

The authors or holders of the copyright for each article hereby confer exclusive, limited and free authorization on the Universidad Nacional de Colombia's journal Ingeniería e Investigación concerning the aforementioned article which, once it has been evaluated and approved, will be submitted for publication, in line with the following items:

1. The version which has been corrected according to the evaluators' suggestions will be remitted and it will be made clear whether the aforementioned article is an unedited document regarding which the rights to be authorized are held and total responsibility will be assumed by the authors for the content of the work being submitted to Ingeniería e Investigación, the Universidad Nacional de Colombia and third-parties;

2. The authorization conferred on the journal will come into force from the date on which it is included in the respective volume and issue of Ingeniería e Investigación in the Open Journal Systems and on the journal's main page (https://revistas.unal.edu.co/index.php/ingeinv), as well as in different databases and indices in which the publication is indexed;

3. The authors authorize the Universidad Nacional de Colombia's journal Ingeniería e Investigación to publish the document in whatever required format (printed, digital, electronic or whatsoever known or yet to be discovered form) and authorize Ingeniería e Investigación to include the work in any indices and/or search engines deemed necessary for promoting its diffusion;

4. The authors accept that such authorization is given free of charge and they, therefore, waive any right to receive remuneration from the publication, distribution, public communication and any use whatsoever referred to in the terms of this authorization.

	IBN Publindex El Índice Bibliográfico Nacional Publindex es un sistema colombiano para la clasificación, actualización, escalafonamiento y certificación de las publicaciones científicas y tecnológicas. Es regido por COLCIENCIAS y el ICFES en Colombia.
	Directory of Open Access Journals DOAJ aumenta la visibilidad y la facilidad de uso de las revistas científicas y académicas de acceso abierto, pretende ser global y abarcar todas las revistas que utilizan un sistema de control de calidad para garantizar el contenido.
	SciELO Colombia SciELO Colombia es una librería virtual para América Latina, el Caribe, España y Portugal, fue creada por FAPESP en el año de 1997 en Sao Pablo Brasil, actualmente en Colombia es gestionada por la Universidad Nacional de Colombia.
	REDIB Portal donde se muestran las revistas electrónicas españolas y latinoamericanas de acceso abierto (Open Access). Fue creado en España.
	Science Citation Index Expanded^TM SCI es un prestigio sistema de indexación en línea que incorpora información bibliográfica y de citación de publicaciones científicas alrededor del mundo.
	Scopus Scopus es una base de datos bibliográfica de resúmenes y citas de artículos de revistas científicas. Cubre aproximadamente 19.500 títulos de más de 5.000 editores internacionales, incluyendo la cobertura de de 16.500 revistas.
	Latindex Latindex es producto de la cooperación de una red de instituciones latinoamericanas que funcionan de manera coordinada para reunir y diseminar información bibliográfica sobre las publicaciones científicas seriadas producidas en la región.
	Dialnet Dialnet es un portal de difusión de la producción científica hispana que inició su funcionamiento en el año 2001 especializado en ciencias humanas y sociales. Su base de datos, de acceso libre, fue creada por la Universidad de La Rioja (España).
see more

Ingeniería e Investigación

Published

Full Model Selection Problem and Pipelines for Time-Series Databases: Contrasting Population-Based and Single-point Search Metaheuristics

Problema de selección de modelo completo y tuberías para bases de datos de series de tiempo: contrastando metaheurísticas basadas en población y de un solo punto de búsqueda

DOI:

Keywords:

Downloads

Authors

Downloads

References

License

Make a Submission

customblockmanagerplugin__5f7ba9d826f621x07772386

Guide for authors

customblockmanagerplugin__5f7ba9d82898f9x78580793

Peer-review process

customblockmanagerplugin__5f7ba9d8296d69x32788504

Ethics

customblockmanagerplugin__5f7ba9d827c7a1x34128013

Journal Citation Reports^™

Scimago Journal & Country Rank - SJR

Keywords

IBN Publindex

Directory of Open Access Journals

SciELO Colombia

REDIB

Science Citation Index Expanded^TM

Scopus

Latindex

Dialnet

see more

Published

Full Model Selection Problem and Pipelines for Time-Series Databases: Contrasting Population-Based and Single-point Search Metaheuristics

Problema de selección de modelo completo y tuberías para bases de datos de series de tiempo: contrastando metaheurísticas basadas en población y de un solo punto de búsqueda

DOI:

Keywords:

Downloads

Authors

Downloads

References

License

Make a Submission

customblockmanagerplugin__5f7ba9d826f621x07772386

Guide for authors

customblockmanagerplugin__5f7ba9d82898f9x78580793

Peer-review process

customblockmanagerplugin__5f7ba9d8296d69x32788504

Ethics

customblockmanagerplugin__5f7ba9d827c7a1x34128013

Journal Citation Reports™

Scimago Journal & Country Rank - SJR

Keywords

Journal Citation Reports^™