Published

2014-01-01

Three Similarity Measures between One-Dimensional Data Sets

Tres medidas de similitud entre conjuntos de datos unidimensionales

DOI:

https://doi.org/10.15446/rce.v37n1.44359

Keywords:

Data mining, Interval distance, Kernel methods, Non-parametric tests. (en)
distancia entre intervalos, métodos del núcleo, minería de datos, tests no paramétricos (es)

Authors

  • Jose M. Gavilan Universidad de Sevilla
  • Francisco Velasco Morente Universidad de Sevilla
  • Luis Gonzales-Abril Universidad de Sevilla

Based on an interval distance, three functions are given in order to quantify similarities between one-dimensional data sets by using first-order statistics. The Glass Identification Database is used to illustrate how to analyse a data set prior to its classification and/or to exclude dimensions. Furthermore, a non-parametric hypothesis test is designed to show how these similarity measures, based on random samples from two populations, can be used to decide whether these populations are identical. Two comparative analyses are also carried out with a parametric test and a non-parametric test. This new non-parametric test performs reasonably well in comparison with classic tests.

Basadas en una distancia intervalar, se dan tres funciones para cuantificar similaridades entre conjuntos de datos unidimensionales mediante el uso de estadísticos de primer orden. Se usa la base de datos Glass Identification para ilustrar cómo esas medidas de similaridad se pueden usar para analizar un conjunto de datos antes de su clasificación y/o para excluir dimensiones. Además, se diseña un test de hipótesis no parámetrico para mostrar cómo similaridad, basadas en muestras aleatorias de dos poblaciones, se pueden usar para decidir si esas poblaciones son idénticas. También se realizan dos análisis comparativos con un test paramétrico y un test no paramétrico. Este nuevo test se comporta razonablemente bien en comparación con test clásicos.

https://doi.org/10.15446/rce.v37n1.44359

Three Similarity Measures between One-Dimensional DataSets

Tres medidas de similitud entre conjuntos de datosunidimensionales

LUIS GONZALEZ-ABRIL1, JOSE M. GAVILAN2, FRANCISCO VELASCO MORENTE3

1Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: luisgon@us.es
2Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: gavi@us.es
3Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: velasco@us.es


Abstract

Based on an interval distance, three functions are given in order to quantify similarities between one-dimensional data sets by using first-order statistics. The Glass Identification Database is used to illustrate how to analyse a data set prior to its classification and/or to exclude dimensions. Furthermore, a non-parametric hypothesis test is designed to show how these similarity measures, based on random samples from two populations, can be used to decide whether these populations are identical. Two comparative analyses are also carried out with a parametric test and a non-parametric test. This new non-parametric test performs reasonably well in comparison with classic tests.

Key words: Data mining, Interval distance, Kernel methods, Non-parametric tests.


Resumen

Basadas en una distancia intervalar, se dan tres funciones para cuantificar similaridades entre conjuntos de datos unidimensionales mediante el uso de estadísticos de primer orden. Se usa la base de datos Glass Identification para ilustrar cómo esas medidas de similaridad se pueden usar para analizar un conjunto de datos antes de su clasificación y/o para excluir dimensiones. Además, se diseña un test de hipótesis no parámetrico para mostrar cómo similaridad, basadas en muestras aleatorias de dos poblaciones, se pueden usar para decidir si esas poblaciones son idénticas. También se realizan dos análisis comparativos con un test paramétrico y un test no paramétrico. Este nuevo test se comporta razonablemente bien en comparación con test clásicos.

Palabras clave: distancia entre intervalos, métodos del núcleo, minería de datos, tests no paramétricos.


Texto completo disponible en PDF


References

1. A.K.C. Wong, & D.K.Y. Chiu, (1987), 'Synthesizing statistical knowledge from incomplete mixed-mode data', IEEE Transactions on Pattern Analysis and Machine Intelligence 9(6), 796-805.

2. Anguita, D., Ridella, S. & Sterpi, D. (2004), A New Method for Multiclass Support Vector Machines, 'Proceedings of the IEEE IJCNN2004', Budapest, Hungary.

3. B. Skhólkopf, & A. J. Smola, (2002), Learning with Kernel, MIT Press.

4. Bach, F. R. & Jordan, M. I. (2003), 'Kernel independent component analysis', Journal of Machine Learning Research 3, 1-48.

5. Bache, K. & Lichman, M. (2013), 'UCI Machine Learning Repository', http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences.

6. Burrell, Q. L. (2005), 'Measuring Similarity of Concentration Between Different Informetric Distributions: Two New Approaches', Journal of the American Society for Information Science and Technology 56(7), 704-714.

7. Chiu, D., Wong, A. & Cheung, B. (1991), Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis, 'Knowledge Discovery in Databases', MIT Press, p. 125-140.

8. Cristianini, N. & Shawe-Taylor, J. (2000), An introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University press.

9. González, L. & Gavilan, J. M. (2001), Una metodología para la construcción de histogramas. Aplicación a los ingresos de los hogares andaluces, 'XIV Reunión ASEPELT-Spain'.

10. González, L., Velasco, F., Angulo, C., Ortega, J. & Ruiz, F. (2004), 'Sobre núcleos, distancias y similitudes entre intervalos', Inteligencia Artificial 8(23), 113-119.

11. González, L., Velasco, F. & Gasca, R. (2005), 'A Study of the Similarities between Topics', Computational Statistics 20(3), 465-479.

12. González-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. (2009), 'Ameva: An autonomous discretization algorithm', Expert Systems with Applications 36(3), 5327 - 5332.

13. González-Abril, L., Velasco, F., Gavilán, J. & Sánchez-Reyes, L. (2010), 'The Similarity between the Square of the Coeficient of Variation and the Gini Index of a General Random Variable', Revista de métodos cuantitativos para la economía y la empresa 10, 5-18.

14. Hartigan, J. (1975), Clustering Algorithms, Wiley, New York.

15. Hsu, Chih-Wei & Lin, Chih-Jen (2002), 'A Comparison of Methods for Multiclass Support Vector Machine', IEEE Transactions on Neural Networks 13(2), 415-425.

16. Lee, J., Kim, M. & Lee, Y. (1993), 'Information retrieval based on conceptual distance in is-a hierarchies', Journal of Documentation 49(2), 188-207.

17. Lin, D. (1998), An Information-Theoretic Definition of Similarity, 'Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998)', p. 296-304.

18. Nielsen, J., Ghugre, N. & Panigrahy, A. (2004), 'Affine and polynomial mutual information coregistration for artifact elimination in diffusion tensor imaging of newborns', Magnetic Resonance Imaging 22, 1319-1323.

19. Parthasarathy, S. & Ogihara, M. (2000), 'Exploiting Dataset Similarity for Distributed Mining', http://ipdps.eece.unm.edu/2000/datamine/18000400.pdf.

20. Rada, R., Mili, H., Bicknell, E. & Blettner, M. (1989), 'Development and application of a metric on semantic nets', IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17-30.

21. Salazar, D. A., Vélez, J. I. & Salazar, J. C. (2012), 'Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?', Revista Colombiana de Estadística 35, 2, 223-237.

22. Sheridan, R., Feuston, B., Maiorov, V. & Kearsley, S. (2004), 'Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR', Journal of Chemical Information and Modeling 44, 1912-1928.

23. V. Vapnik, (1998), Statistical Learning Theory, John Wiley & Sons, Inc.


[Recibido en julio de 2013. Aceptado en enero de 2014]

Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:

@ARTICLE{RCEv37n1a06,
    AUTHOR  = {Gonzalez-Abril, Luis and Gavilan, Jose M. and Velasco Morente, Francisco},
    TITLE   = {{Three Similarity Measures between One-Dimensional DataSets}},
    JOURNAL = {Revista Colombiana de Estadística},
    YEAR    = {2014},
    volume  = {37},
    number  = {1},
    pages   = {79-94}
}

References

Anguita, D., Ridella, S. & Sterpi, D. (2004), A new method for multiclass support vector machines, in ‘Proceedings of the IEEE IJCNN2004’, Budapest, Hungary.

Bach, F. R. & Jordan, M. I. (2003), ‘Kernel independent component analysis’, Journal of Machine Learning Research 3, 1–48.

Bache, K. & Lichman, M. (2013), ‘UCI machine learning repository’, http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences.

Burrell, Q. L. (2005), ‘Measuring similarity of concentration between different informetric distributions: Two new approaches’, Journal of the American Society for Information Science and Technology 56(7), 704–714.

Chiu, D., Wong, A. & Cheung, B. (1991), Information discovery through hierarchical maximum entropy discretization and synthesis, in G. Piatetsky-Shapiro & W. J. Frawley, eds, ‘Knowledge Discovery in Databases’, MIT Press, pp. 125–140.

Cristianini, N. & Shawe-Taylor, J. (2000), An introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University press.

González-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. (2009), ‘Ameva: An autonomous discretization algorithm’, Expert Systems with Applications 36(3), 5327 – 5332.

González-Abril, L., Velasco, F., Gavilán, J. & Sánchez-Reyes, L. (2010), ‘The similarity between the square of the coeficient of variation and the Gini index of a general random variable’, Revista de métodos cuantitativos para la economía y la empresa 10, 5–18.

González, L. & Gavilan, J. M. (2001), Una metodología para la construcción de histogramas. Aplicación a los ingresos de los hogares andaluces, in ‘XIV Reunión ASEPELT-Spain’.

González, L., Velasco, F., Angulo, C., Ortega, J. & Ruiz, F. (2004), ‘Sobre núcleos, distancias y similitudes entre intervalos’, Inteligencia Artificial 8(23), 113–119.

González, L., Velasco, F. & Gasca, R. (2005), ‘A study of the similarities between topics’, Computational Statistics 20(3), 465–479.

Hartigan, J. (1975), Clustering Algorithms, Wiley, New York.

Hsu, C.-W. & Lin, C.-J. (2002), ‘A comparison of methods for multiclass support vector machine’, IEEE Transactions on Neural Networks 13(2), 415–425.

Lee, J., Kim, M. & Lee, Y. (1993), ‘Information retrieval based on conceptual distance in is-a hierarchies’, Journal of Documentation 49(2), 188–207.

Lin, D. (1998), An information-theoretic definition of similarity, in ‘Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998)’, pp. 296–304.

Nielsen, J., Ghugre, N. & Panigrahy, A. (2004), ‘Affine and polynomial mutual information coregistration for artifact elimination in diffusion tensor imaging of newborns’, Magnetic Resonance Imaging 22, 1319–1323.

Parthasarathy, S. & Ogihara, M. (2000), ‘Exploiting dataset similarity for distributed mining’, http://ipdps.eece.unm.edu/2000/datamine/18000400.pdf.

Rada, R., Mili, H., Bicknell, E. & Blettner, M. (1989), ‘Development and application of a metric on semantic nets’, IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17–30.

Salazar, D. A., Vélez, J. I. & Salazar, J. C. (2012), ‘Comparison between SVM and logistic regression: Which one is better to discriminate?’, Revista Colombiana de Estadística 35, 2, 223–237.

Sheridan, R., Feuston, B., Maiorov, V. & Kearsley, S. (2004), ‘Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR’, Journal of Chemical Information and Modeling 44, 1912–1928.

Skhölkopf, B. & Smola, A. J. (2002), Learning with Kernel, MIT Press.

Vapnik, V. (1998), Statistical Learning Theory, John Wiley & Sons, Inc.

Wong, A. & Chiu, D. (1987), ‘Synthesizing statistical knowledge from incomplete mixed-mode data’, IEEE Transactions on Pattern Analysis and Machine Intelligence 9(6), 796–805.

How to Cite

APA

Gavilan, J. M., Morente, F. V. and Gonzales-Abril, L. (2014). Three Similarity Measures between One-Dimensional Data Sets. Revista Colombiana de Estadística, 37(1), 79–94. https://doi.org/10.15446/rce.v37n1.44359

ACM

[1]
Gavilan, J.M., Morente, F.V. and Gonzales-Abril, L. 2014. Three Similarity Measures between One-Dimensional Data Sets. Revista Colombiana de Estadística. 37, 1 (Jan. 2014), 79–94. DOI:https://doi.org/10.15446/rce.v37n1.44359.

ACS

(1)
Gavilan, J. M.; Morente, F. V.; Gonzales-Abril, L. Three Similarity Measures between One-Dimensional Data Sets. Rev. colomb. estad. 2014, 37, 79-94.

ABNT

GAVILAN, J. M.; MORENTE, F. V.; GONZALES-ABRIL, L. Three Similarity Measures between One-Dimensional Data Sets. Revista Colombiana de Estadística, [S. l.], v. 37, n. 1, p. 79–94, 2014. DOI: 10.15446/rce.v37n1.44359. Disponível em: https://revistas.unal.edu.co/index.php/estad/article/view/44359. Acesso em: 28 mar. 2025.

Chicago

Gavilan, Jose M., Francisco Velasco Morente, and Luis Gonzales-Abril. 2014. “Three Similarity Measures between One-Dimensional Data Sets”. Revista Colombiana De Estadística 37 (1):79-94. https://doi.org/10.15446/rce.v37n1.44359.

Harvard

Gavilan, J. M., Morente, F. V. and Gonzales-Abril, L. (2014) “Three Similarity Measures between One-Dimensional Data Sets”, Revista Colombiana de Estadística, 37(1), pp. 79–94. doi: 10.15446/rce.v37n1.44359.

IEEE

[1]
J. M. Gavilan, F. V. Morente, and L. Gonzales-Abril, “Three Similarity Measures between One-Dimensional Data Sets”, Rev. colomb. estad., vol. 37, no. 1, pp. 79–94, Jan. 2014.

MLA

Gavilan, J. M., F. V. Morente, and L. Gonzales-Abril. “Three Similarity Measures between One-Dimensional Data Sets”. Revista Colombiana de Estadística, vol. 37, no. 1, Jan. 2014, pp. 79-94, doi:10.15446/rce.v37n1.44359.

Turabian

Gavilan, Jose M., Francisco Velasco Morente, and Luis Gonzales-Abril. “Three Similarity Measures between One-Dimensional Data Sets”. Revista Colombiana de Estadística 37, no. 1 (January 1, 2014): 79–94. Accessed March 28, 2025. https://revistas.unal.edu.co/index.php/estad/article/view/44359.

Vancouver

1.
Gavilan JM, Morente FV, Gonzales-Abril L. Three Similarity Measures between One-Dimensional Data Sets. Rev. colomb. estad. [Internet]. 2014 Jan. 1 [cited 2025 Mar. 28];37(1):79-94. Available from: https://revistas.unal.edu.co/index.php/estad/article/view/44359

Download Citation

CrossRef Cited-by

CrossRef citations3

1. Divyansh Agarwal, Nancy R. Zhang. (2019). Semblance: An empirical similarity kernel on probability spaces. Science Advances, 5(12) https://doi.org/10.1126/sciadv.aau9630.

2. Yanjie Tong, Iris Tien. (2022). Time-Series Prediction in Nodal Networks Using Recurrent Neural Networks and a Pairwise-Gated Recurrent Unit Approach. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part A: Civil Engineering, 8(2) https://doi.org/10.1061/AJRUA6.0001221.

3. Yurii Hodlevskyi, Tetiana Vakaliuk. (2024). Information Technology for Education, Science, and Technics. Lecture Notes on Data Engineering and Communications Technologies. 221, p.95. https://doi.org/10.1007/978-3-031-71801-4_8.

Dimensions

PlumX

  • Citations
  • CrossRef - Citation Indexes: 2
  • Usage
  • SciELO - Full Text Views: 2703
  • SciELO - Abstract Views: 2151
  • Captures
  • Mendeley - Readers: 15
  • Social Media
  • Facebook - Shares, Likes & Comments: 274

Article abstract page views

493

Downloads