Published
Three Similarity Measures between One-Dimensional Data Sets
Tres medidas de similitud entre conjuntos de datos unidimensionales
DOI:
https://doi.org/10.15446/rce.v37n1.44359Keywords:
Data mining, Interval distance, Kernel methods, Non-parametric tests. (en)distancia entre intervalos, métodos del núcleo, minería de datos, tests no paramétricos (es)
Downloads
Based on an interval distance, three functions are given in order to quantify similarities between one-dimensional data sets by using first-order statistics. The Glass Identification Database is used to illustrate how to analyse a data set prior to its classification and/or to exclude dimensions. Furthermore, a non-parametric hypothesis test is designed to show how these similarity measures, based on random samples from two populations, can be used to decide whether these populations are identical. Two comparative analyses are also carried out with a parametric test and a non-parametric test. This new non-parametric test performs reasonably well in comparison with classic tests.
Basadas en una distancia intervalar, se dan tres funciones para cuantificar similaridades entre conjuntos de datos unidimensionales mediante el uso de estadísticos de primer orden. Se usa la base de datos Glass Identification para ilustrar cómo esas medidas de similaridad se pueden usar para analizar un conjunto de datos antes de su clasificación y/o para excluir dimensiones. Además, se diseña un test de hipótesis no parámetrico para mostrar cómo similaridad, basadas en muestras aleatorias de dos poblaciones, se pueden usar para decidir si esas poblaciones son idénticas. También se realizan dos análisis comparativos con un test paramétrico y un test no paramétrico. Este nuevo test se comporta razonablemente bien en comparación con test clásicos.
https://doi.org/10.15446/rce.v37n1.44359
1Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: luisgon@us.es
2Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: gavi@us.es
3Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: velasco@us.es
Based on an interval distance, three functions are given in order to quantify similarities between one-dimensional data sets by using first-order statistics. The Glass Identification Database is used to illustrate how to analyse a data set prior to its classification and/or to exclude dimensions. Furthermore, a non-parametric hypothesis test is designed to show how these similarity measures, based on random samples from two populations, can be used to decide whether these populations are identical. Two comparative analyses are also carried out with a parametric test and a non-parametric test. This new non-parametric test performs reasonably well in comparison with classic tests.
Key words: Data mining, Interval distance, Kernel methods, Non-parametric tests.
Basadas en una distancia intervalar, se dan tres funciones para cuantificar similaridades entre conjuntos de datos unidimensionales mediante el uso de estadísticos de primer orden. Se usa la base de datos Glass Identification para ilustrar cómo esas medidas de similaridad se pueden usar para analizar un conjunto de datos antes de su clasificación y/o para excluir dimensiones. Además, se diseña un test de hipótesis no parámetrico para mostrar cómo similaridad, basadas en muestras aleatorias de dos poblaciones, se pueden usar para decidir si esas poblaciones son idénticas. También se realizan dos análisis comparativos con un test paramétrico y un test no paramétrico. Este nuevo test se comporta razonablemente bien en comparación con test clásicos.
Palabras clave: distancia entre intervalos, métodos del núcleo, minería de datos, tests no paramétricos.
Texto completo disponible en PDF
References
1. A.K.C. Wong, & D.K.Y. Chiu, (1987), 'Synthesizing statistical knowledge from incomplete mixed-mode data', IEEE Transactions on Pattern Analysis and Machine Intelligence 9(6), 796-805.
2. Anguita, D., Ridella, S. & Sterpi, D. (2004), A New Method for Multiclass Support Vector Machines, 'Proceedings of the IEEE IJCNN2004', Budapest, Hungary.
3. B. Skhólkopf, & A. J. Smola, (2002), Learning with Kernel, MIT Press.
4. Bach, F. R. & Jordan, M. I. (2003), 'Kernel independent component analysis', Journal of Machine Learning Research 3, 1-48.
5. Bache, K. & Lichman, M. (2013), 'UCI Machine Learning Repository', http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences.
6. Burrell, Q. L. (2005), 'Measuring Similarity of Concentration Between Different Informetric Distributions: Two New Approaches', Journal of the American Society for Information Science and Technology 56(7), 704-714.
7. Chiu, D., Wong, A. & Cheung, B. (1991), Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis, 'Knowledge Discovery in Databases', MIT Press, p. 125-140.
8. Cristianini, N. & Shawe-Taylor, J. (2000), An introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University press.
9. González, L. & Gavilan, J. M. (2001), Una metodología para la construcción de histogramas. Aplicación a los ingresos de los hogares andaluces, 'XIV Reunión ASEPELT-Spain'.
10. González, L., Velasco, F., Angulo, C., Ortega, J. & Ruiz, F. (2004), 'Sobre núcleos, distancias y similitudes entre intervalos', Inteligencia Artificial 8(23), 113-119.
11. González, L., Velasco, F. & Gasca, R. (2005), 'A Study of the Similarities between Topics', Computational Statistics 20(3), 465-479.
12. González-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. (2009), 'Ameva: An autonomous discretization algorithm', Expert Systems with Applications 36(3), 5327 - 5332.
13. González-Abril, L., Velasco, F., Gavilán, J. & Sánchez-Reyes, L. (2010), 'The Similarity between the Square of the Coeficient of Variation and the Gini Index of a General Random Variable', Revista de métodos cuantitativos para la economía y la empresa 10, 5-18.
14. Hartigan, J. (1975), Clustering Algorithms, Wiley, New York.
15. Hsu, Chih-Wei & Lin, Chih-Jen (2002), 'A Comparison of Methods for Multiclass Support Vector Machine', IEEE Transactions on Neural Networks 13(2), 415-425.
16. Lee, J., Kim, M. & Lee, Y. (1993), 'Information retrieval based on conceptual distance in is-a hierarchies', Journal of Documentation 49(2), 188-207.
17. Lin, D. (1998), An Information-Theoretic Definition of Similarity, 'Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998)', p. 296-304.
18. Nielsen, J., Ghugre, N. & Panigrahy, A. (2004), 'Affine and polynomial mutual information coregistration for artifact elimination in diffusion tensor imaging of newborns', Magnetic Resonance Imaging 22, 1319-1323.
19. Parthasarathy, S. & Ogihara, M. (2000), 'Exploiting Dataset Similarity for Distributed Mining', http://ipdps.eece.unm.edu/2000/datamine/18000400.pdf.
20. Rada, R., Mili, H., Bicknell, E. & Blettner, M. (1989), 'Development and application of a metric on semantic nets', IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17-30.
21. Salazar, D. A., Vélez, J. I. & Salazar, J. C. (2012), 'Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?', Revista Colombiana de Estadística 35, 2, 223-237.
22. Sheridan, R., Feuston, B., Maiorov, V. & Kearsley, S. (2004), 'Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR', Journal of Chemical Information and Modeling 44, 1912-1928.
23. V. Vapnik, (1998), Statistical Learning Theory, John Wiley & Sons, Inc.
Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:
@ARTICLE{RCEv37n1a06,
AUTHOR = {Gonzalez-Abril, Luis and Gavilan, Jose M. and Velasco Morente, Francisco},
TITLE = {{Three Similarity Measures between One-Dimensional DataSets}},
JOURNAL = {Revista Colombiana de Estadística},
YEAR = {2014},
volume = {37},
number = {1},
pages = {79-94}
}
References
Anguita, D., Ridella, S. & Sterpi, D. (2004), A new method for multiclass support vector machines, in ‘Proceedings of the IEEE IJCNN2004’, Budapest, Hungary.
Bach, F. R. & Jordan, M. I. (2003), ‘Kernel independent component analysis’, Journal of Machine Learning Research 3, 1–48.
Bache, K. & Lichman, M. (2013), ‘UCI machine learning repository’, http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences.
Burrell, Q. L. (2005), ‘Measuring similarity of concentration between different informetric distributions: Two new approaches’, Journal of the American Society for Information Science and Technology 56(7), 704–714.
Chiu, D., Wong, A. & Cheung, B. (1991), Information discovery through hierarchical maximum entropy discretization and synthesis, in G. Piatetsky-Shapiro & W. J. Frawley, eds, ‘Knowledge Discovery in Databases’, MIT Press, pp. 125–140.
Cristianini, N. & Shawe-Taylor, J. (2000), An introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University press.
González-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. (2009), ‘Ameva: An autonomous discretization algorithm’, Expert Systems with Applications 36(3), 5327 – 5332.
González-Abril, L., Velasco, F., Gavilán, J. & Sánchez-Reyes, L. (2010), ‘The similarity between the square of the coeficient of variation and the Gini index of a general random variable’, Revista de métodos cuantitativos para la economía y la empresa 10, 5–18.
González, L. & Gavilan, J. M. (2001), Una metodología para la construcción de histogramas. Aplicación a los ingresos de los hogares andaluces, in ‘XIV Reunión ASEPELT-Spain’.
González, L., Velasco, F., Angulo, C., Ortega, J. & Ruiz, F. (2004), ‘Sobre núcleos, distancias y similitudes entre intervalos’, Inteligencia Artificial 8(23), 113–119.
González, L., Velasco, F. & Gasca, R. (2005), ‘A study of the similarities between topics’, Computational Statistics 20(3), 465–479.
Hartigan, J. (1975), Clustering Algorithms, Wiley, New York.
Hsu, C.-W. & Lin, C.-J. (2002), ‘A comparison of methods for multiclass support vector machine’, IEEE Transactions on Neural Networks 13(2), 415–425.
Lee, J., Kim, M. & Lee, Y. (1993), ‘Information retrieval based on conceptual distance in is-a hierarchies’, Journal of Documentation 49(2), 188–207.
Lin, D. (1998), An information-theoretic definition of similarity, in ‘Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998)’, pp. 296–304.
Nielsen, J., Ghugre, N. & Panigrahy, A. (2004), ‘Affine and polynomial mutual information coregistration for artifact elimination in diffusion tensor imaging of newborns’, Magnetic Resonance Imaging 22, 1319–1323.
Parthasarathy, S. & Ogihara, M. (2000), ‘Exploiting dataset similarity for distributed mining’, http://ipdps.eece.unm.edu/2000/datamine/18000400.pdf.
Rada, R., Mili, H., Bicknell, E. & Blettner, M. (1989), ‘Development and application of a metric on semantic nets’, IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17–30.
Salazar, D. A., Vélez, J. I. & Salazar, J. C. (2012), ‘Comparison between SVM and logistic regression: Which one is better to discriminate?’, Revista Colombiana de Estadística 35, 2, 223–237.
Sheridan, R., Feuston, B., Maiorov, V. & Kearsley, S. (2004), ‘Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR’, Journal of Chemical Information and Modeling 44, 1912–1928.
Skhölkopf, B. & Smola, A. J. (2002), Learning with Kernel, MIT Press.
Vapnik, V. (1998), Statistical Learning Theory, John Wiley & Sons, Inc.
Wong, A. & Chiu, D. (1987), ‘Synthesizing statistical knowledge from incomplete mixed-mode data’, IEEE Transactions on Pattern Analysis and Machine Intelligence 9(6), 796–805.
How to Cite
APA
ACM
ACS
ABNT
Chicago
Harvard
IEEE
MLA
Turabian
Vancouver
Download Citation
CrossRef Cited-by
1. Divyansh Agarwal, Nancy R. Zhang. (2019). Semblance: An empirical similarity kernel on probability spaces. Science Advances, 5(12) https://doi.org/10.1126/sciadv.aau9630.
2. Yanjie Tong, Iris Tien. (2022). Time-Series Prediction in Nodal Networks Using Recurrent Neural Networks and a Pairwise-Gated Recurrent Unit Approach. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part A: Civil Engineering, 8(2) https://doi.org/10.1061/AJRUA6.0001221.
3. Yurii Hodlevskyi, Tetiana Vakaliuk. (2024). Information Technology for Education, Science, and Technics. Lecture Notes on Data Engineering and Communications Technologies. 221, p.95. https://doi.org/10.1007/978-3-031-71801-4_8.
Dimensions
PlumX
Article abstract page views
Downloads
License
Copyright (c) 2014 Revista Colombiana de Estadística

This work is licensed under a Creative Commons Attribution 4.0 International License.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).