UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO

Susana A. Leiva-Valdebenito; Francisco J. Torres-Avilés

Published

2010-07-01

UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO

A REVIEW OF THE MOST COMMON PARTITION ALGORITHMS IN CLUSTER ANALYSIS: A COMPARATIVE STUDY

Keywords:

algoritmos de conglomerados, medida de similaridad, simulación (es)
Clustering algorithm, Similarity measure, Simulation (en)

Downloads

Authors

Susana A. Leiva-Valdebenito Universidad de Santiago de Chile
Francisco J. Torres-Avilés Universidad de Santiago de Chile

Abstract (es)
Abstract (en)

Este estudio está enfocado en comparar diversos métodos de partición del análisis de conglomerados, usualmente conocidos como métodos no jerárquicos. En este trabajo, se realizan estudios de simulación para comparar los resultados obtenidos al implementar los algoritmos k-medias, k-medianas, PAM y Clara cuando los datos son multivariados y de tipo continuo. Adicionalmente, se efectúa un estudio de simulación con el fin de comparar algoritmos de partición para datos cualitativos, confrontando la eficiencia de los algoritmos PAM y k-modas. La eficiencia de los algoritmos se compara usando el índice de Rand ajustado y la tasa de correcta clasificación. Finalmente, se aplican los algoritmos a bases de datos reales, las cuales poseen clases predefinidas.

This study is oriented to compare several partition methods in the context of cluster analysis, which are also called non hierarchical methods. In this work, a simulation study is performed to compare the results obtained from the implementation of the algorithms k-means, k-medians, PAM and CLARA when continuous multivariate information is available. Additionally, a study of simulation is presented to compare partition algorithms qualitative information, comparing the efficiency of the PAM and k-modes algorithms. The efficiency of the algorithms is compared using the Adjusted Rand Index and the correct classification rate. Finally, the algorithms are applied to real databases with predefined classes.

Untitled Document Una revisión de los algoritmos de partición más comunes en el análisis de conglomerados: un estudio comparativo

A Review of the Most Common Partition Algorithms in Cluster Analysis: A Comparative Study SUSANA A. LEIVA-VALDEBENITO1, FRANCISCO J. TORRES-AVILÉS2

1Universidad de Santiago de Chile, Facultad de Ciencia, Departamento de Matemática y Ciencia de la Computación, Santiago, Chile. Estudiante de Ingeniería Estadística. Email: susanaleivav@gmail.com
2Universidad de Santiago de Chile, Facultad de Ciencia, Departamento de Matemática y Ciencia de la Computación, Santiago, Chile. Profesor asistente. Email: francisco.torres@usach.cl

Resumen

Este estudio está enfocado en comparar diversos métodos de partición del análisis de conglomerados, usualmente conocidos como métodos no jerárquicos. En este trabajo, se realizan estudios de simulación para comparar los resultados obtenidos al implementar los algoritmos k-medias, k-medianas, PAM y Clara cuando los datos son multivariados y de tipo continuo. Adicionalmente, se efectúa un estudio de simulación con el fin de comparar algoritmos de partición para datos cualitativos, confrontando la eficiencia de los algoritmos PAM y k-modas. La eficiencia de los algoritmos se compara usando el índice de Rand ajustado y la tasa de correcta clasificación. Finalmente, se aplican los algoritmos a bases de datos reales, las cuales poseen clases predefinidas.

Palabras clave: algoritmos de conglomerados, medida de similaridad, simulación.

Abstract

This study is oriented to compare several partition methods in the context of cluster analysis, which are also called non hierarchical methods. In this work, a simulation study is performed to compare the results obtained from the implementation of the algorithms k-means, k-medians, PAM and CLARA when continuous multivariate information is available. Additionally, a study of simulation is presented to compare partition algorithms qualitative information, comparing the efficiency of the PAM and k-modes algorithms. The efficiency of the algorithms is compared using the Adjusted Rand Index and the correct classification rate. Finally, the algorithms are applied to real databases with predefined classes.

Key words: Clustering algorithm, Similarity measure, Simulation.

Texto completo disponible en PDF

Referencias

1. Anderberg, M. (1973), Cluster Analysis for Applications, Academic Press, New York, United States.

2. Anderson, B., Gross, D., Musicant, D., Ritz, A., Smith, T. & Steinberg, L. (2006), Adapting K-Medians to Generate Normalized Cluster Centers, 'Proceedings of the 2006 SIAM International Conference on Data Mining', Bethesda, p. 165-175.

3. Andreopoulos, B., An, A. & Wang, X. (2006), Clustering Mixed Numerical and Low Quality Categorical Data: Significance Metrics on a Yeast Example, 'ACM SIGMOD Workshop on Information Quality in Information Systems, IQIS 2005 Statistics Clustering Session', Baltimore, p. 87-98.

4. Der, G. & Everitt, B. S. (2006), Statistical Analysis of Medical Data using SAS, CRC Press, Boca Raton, United States.

5. Gower, J. C. (1971), 'A General Coefficient of Similarity and Some of its Properties', Biometrics 27, 623-637.

6. Hae, P. & Chi, J. (2009), 'A simple and Fast Algorithm for K-medoids Clustering', Expert Systems with Applications 36, 3336-3341.

7. Han, J., Kamber, M. & Tung, A. K. H. (2001), Spatial Clustering Methods in Data Mining: A Survey, 'Geographic Data Mining and Knowledge Discovery', Taylor & Francis.

8. Hartigan, J. (1975), Clustering Algorithms, John Wiley & Sons, Nueva York, United States.

9. He, Z., Deng, S. & Xu, X. (2005), Improving K-modes Algorithm Considering Frequencies of Attribute Values in Mode, Vol. 3801 of Lecture Notes in Computer Science, Springer Berlin - Heidelberg, p. 157-162.

10. He, Z., Xu, X. & Deng, S. (2007), 'Attribute Value Weighting in K-Modes Clustering', Computer Science e-Prints 1, 1-15.

11. Huang, Z. (1998), 'Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values', Data Mining and Knowledge Discovery 2, 283-304.

12. Hubert, L. & Arabie, P. (1985), 'Comparing Partitions', Journal of Classification 2, 193-218.

13. Kamber, M. & Han, J. (2006), Data Mining Concepts and Techniques, Morgan Kaufman Publishers, San Francisco, United States.

14. Kaufman, L. & Rousseeuw, P. (1987), Clustering by Means of Medoids, 'Statistical Data Analysis Based on the L1 Norm and Related Methods', North-Holland, p. 405-416.

15. Kaufman, L. & Rousseeuw, P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, New York, United States.

16. Leiva, S. (2008), Algoritmos de partición en el análisis de conglomerados: un estudio comparativo, Trabajo de Grado, Universidad de Santiago de Chile, Chile.

17. MacQueen, J. (1967), Some Methods for classification and Analysis of Multivariate Observations, 'Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability', Vol. 1, Symposium on mathematics, , , p. 281-297.

18. McGarigal, K., Cushman, S. & Stafford, S. (2000), Multivariate Statistics for Wildlife and Ecology Research, Springer Verlag, New York, United States.

19. Ng, M. K., Li, M. J., Huang, J. Z. & Zengyou, H. (2007), 'On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm', , IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 503-507.

20. Ng, R. & Han, J. (1994), Efficient and Effective Clustering Methods for Spatial Data Mining, 'Proceeding of the 20th International Conference on Very Large Databases', p. 144-155.

21. Peña, D. (2002), Análisis de datos multivariantes, McGraw-Hill, Madrid, España.

22. Quinn, G. & Keough, M. (2002), Experimental Design and Data Analysis for Biologists, Cambridge University Press, Cambridge, UK.

23. R Development Core Team, (2010), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. *http://www.R-project.org

24. SAS Institute Inc., (2008), SAS/STAT 9.2 User's Guide, SAS Publishing, Cary, Carolina del Norte.

25. Velmurugan, T. & Santhanam, T. (2010), 'Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points', Journal of Computer Science 6(3), 363-368.

[Recibido en mayo de 2010. Aceptado en octubre de 2010]

Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:

@ARTICLE{RCEv33n2a09,
    AUTHOR = {Leiva-Valdebenito, Susana A. and Torres-Avilés, Francisco J.},
    TITLE = {{Una revisión de los algoritmos de partición más comunes en el análisis de conglomerados: un estudio comparativo}},
    JOURNAL = {Revista Colombiana de Estadística},
    YEAR    = {2010},
    volume = {33},
    number = {2},
    pages = {321-339}
}

References

Abonyi, J. & Feil, B. (2007), Clustering Analysis for Data Mining and

System Identification, Birkhauser Verlag AG, Berlin, Germany.

Anderberg, M. (1973), Cluster Analysis for Applications, Academic Press, New York, United States.

Anderson, B., Gross, D., Musicant, D., Ritz, A., Smith, T. & Steinberg, L. (2006), Adapting K-Medians to Generate Normalized Cluster Centers, in 'Procee-dings of the 2006 SIAM International Conference on Data Mining', Bethesda, pp. 165-175.

Andreopoulos, B., An, A. & Wang, X. (2006), Clustering Mixed Numerical

and Low Quality Categorical Data: Significante Metrics on a Yeast Example, in `ACM SIGMOD Workshop on Information Quality in Information

Systems, IQIS 2005 Statistics Clustering Session', Baltimore, pp. 87-98.

Asuncion, A. & Newman, D. J. (2007), PCI machine learning repository'.

*http://www.ics.ucledu/—mleam/MLRepository-.html

Der, G. & Everitt, B. S. (2006), Statistical Analysis of Medical Data using SAS, CRC Press, Boca Raton, United States.

Gower, J. C. (1971), 'A General Coefficient of Similarity and Some of its Proper-ties', Biometrics 27, 623-637.

Hae, P. & Chi, J. (2009), 'A simple and Fast Algorithm for K-medoids Clustering', Expert Systems with Applications 36, 3336-3341.

Han, J., Kamber, M. & Tung, A. K. H. (2001), Spatial Clustering Methods in Data Mining: A Survey, in H. J. Miller & J. Han, eds, `Geographic Data Mining and Knowledge Discovery', Taylor & Francis.

Hartigan, J. (1975), Clustering Algorithms, John Wiley & Sons, Nueva York, United States.

He, Z., Deng, S. & Xu, X. (2005), Improving K-modes Algorithm Considering Frequencies of Attribute Values in Mode, Vol. 3801 of Lecture Notes in Computer Science, Springer Berlin - Heidelberg, pp. 157-162.

He, Z., Xu, X. & Deng, S. (2007), `Attribute Value Weighting in K-Modes

Clustering', Computer Science e-Prints 1, 1-15.

Huang, Z. (1998), Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values', Data Mining and Knowledge Discovery 2, 283-304.

Hubert, L Sc Arabie, P. (1985), `Comparing Partitions', Journal of

Classification 2, 193-218.

Kamber, M. & Han, J. (2006), Data Mining Concepts and Techniques, Morgan Kaufman Publishers, San Francisco, United States.

Kaufman, L. & Rousseeuw, P. (1987), Clustering by Means of Medoids, in D. Y., ed., Statistical Data Analysis Based on the L1 Norm and Related Methods', North-Holland, pp. 405-416.

Kaufman, L. & Rousseeuw, P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, New York, United States.

Leiva, S. (2008), Algoritmos de partición en el análisis de conglomerados: un estudio comparativo, Trabajo de grado, Universidad de Santiago de Chile, Chile.

MacQueen, J. (1967), Some Methods for classification and Analysis of

Multivariate Observations, in 'Proceedings of 5-th Berkeley Symposium on

Mathematical Statistics and Probability', Vol. 1, Sy-mposium on

mathematics, pp. 281-297.

McGarigal, K., Cushman, S. & Stafford, S. (2000), Multivariate Statistics for Wild-life and Ecology Research, Springer Verlag, New York, United States.

Ng, M. K., Li, M. J., Huang, J. Z. & Zengyou, H. (2007), 'On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm'„ IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 503-507.

Ng, R. & Han, J. (1994), Efficient and Effective Clustering Methods for Spatial Data Mining, in 'Proceeding of the 20th International Conference on Very Large Databases', pp. 144-155.

Peña, D. (2002), Análisis de datos multivariantes, McGraw-Hill, Madrid, España.

Quinn, G. & Keough, M. (2002), Experimental Design and Data Analysis for Bio-logists, Cambridge University- Press, Cambridge, UK.

R Development Core Team (2010), R: A Language and Environment for

Statistical Computing, R Foundation for Statistical Computing, Vienna,

Austria. ISBN 3-900051-07-0. *http://www.R-project.org

SAS Institute Inc. (2008), SAS/STAT 9.2 User's Guide, SAS Publishing,

Cary, Carolina del Norte.

Velmurugan, T. & Santhanam, T. (2010), `Computationa1 Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points', Journal of Computer Science 6(3), 363-368.

How to Cite

APA

Leiva-Valdebenito, S. A. and Torres-Avilés, F. J. (2010). UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO. Revista Colombiana de Estadística, 33(2), 321–339. https://revistas.unal.edu.co/index.php/estad/article/view/29880

ACM

[1]

Leiva-Valdebenito, S.A. and Torres-Avilés, F.J. 2010. UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO. Revista Colombiana de Estadística. 33, 2 (Jul. 2010), 321–339.

ACS

(1)

Leiva-Valdebenito, S. A.; Torres-Avilés, F. J. UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO. Rev. colomb. estad. 2010, 33, 321-339.

ABNT

LEIVA-VALDEBENITO, S. A.; TORRES-AVILÉS, F. J. UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO. Revista Colombiana de Estadística, [S. l.], v. 33, n. 2, p. 321–339, 2010. Disponível em: https://revistas.unal.edu.co/index.php/estad/article/view/29880. Acesso em: 18 apr. 2024.

Chicago

Leiva-Valdebenito, Susana A., and Francisco J. Torres-Avilés. 2010. “UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO”. Revista Colombiana De Estadística 33 (2):321-39. https://revistas.unal.edu.co/index.php/estad/article/view/29880.

Harvard

Leiva-Valdebenito, S. A. and Torres-Avilés, F. J. (2010) “UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO”, Revista Colombiana de Estadística, 33(2), pp. 321–339. Available at: https://revistas.unal.edu.co/index.php/estad/article/view/29880 (Accessed: 18 April 2024).

IEEE

[1]

S. A. Leiva-Valdebenito and F. J. Torres-Avilés, “UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO”, Rev. colomb. estad., vol. 33, no. 2, pp. 321–339, Jul. 2010.

MLA

Leiva-Valdebenito, S. A., and F. J. Torres-Avilés. “UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO”. Revista Colombiana de Estadística, vol. 33, no. 2, July 2010, pp. 321-39, https://revistas.unal.edu.co/index.php/estad/article/view/29880.

Turabian

Leiva-Valdebenito, Susana A., and Francisco J. Torres-Avilés. “UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO”. Revista Colombiana de Estadística 33, no. 2 (July 1, 2010): 321–339. Accessed April 18, 2024. https://revistas.unal.edu.co/index.php/estad/article/view/29880.

Vancouver

1.

Leiva-Valdebenito SA, Torres-Avilés FJ. UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO. Rev. colomb. estad. [Internet]. 2010 Jul. 1 [cited 2024 Apr. 18];33(2):321-39. Available from: https://revistas.unal.edu.co/index.php/estad/article/view/29880

Download Citation

Article abstract page views

322

Downloads

Download data is not yet available.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

	IBN Publindex
	El Índice Bibliográfico Nacional Publindex es un sistema colombiano para la clasificación, actualización, escalafonamiento y certificación de las publicaciones científicas y tecnológicas. Es regido por COLCIENCIAS y el ICFES en Colombia.
	SciELO Colombia
	SciELO Colombia es una librería virtual para América Latina, el Caribe, España y Portugal, fue creada por FAPESP en el año de 1997 en Sao Pablo Brasil, actualmente en Colombia es gestionada por la Universidad Nacional de Colombia.
	REDIB
	Portal donde se muestran las revistas electrónicas españolas y latinoamericanas de acceso abierto (Open Access). Fue creado en España.
	Scopus
	Scopus es una base de datos bibliográfica de resúmenes y citas de artículos de revistas científicas. Cubre aproximadamente 19.500 títulos de más de 5.000 editores internacionales, incluyendo la cobertura de de 16.500 revistas.
	Latindex
	Latindex es producto de la cooperación de una red de instituciones latinoamericanas que funcionan de manera coordinada para reunir y diseminar información bibliográfica sobre las publicaciones científicas seriadas producidas en la región.
	Dialnet
	Dialnet es un portal de difusión de la producción científica hispana que inició su funcionamiento en el año 2001 especializado en ciencias humanas y sociales. Su base de datos, de acceso libre, fue creada por la Universidad de La Rioja (España).
	Zentralblatt Math
	Zentralblatt MATH (zbMATH) es el servicio de resumen y revisión más completo y de más larga duración del mundo en matemática pura y aplicada. Está editado por la European Mathematical Society (EMS), la Academia de Ciencias y Humanidades de Heidelberg y FIZ Karlsruhe. El trabajo editorial lo realiza la oficina de Berlín de FIZ Karlsruhe que, como miembro de la Asociación Leibniz, es una empresa sin fines de lucro y una organización reconocida de interés público. zbMATH es distribuido por Springer Nature.

Revista Colombiana de Estadística

Published

UNA REVISIÓN DE LOS ALGORITMOS DE PARTICIÓN MÁS COMUNES EN EL ANÁLISIS DE CONGLOMERADOS: UN ESTUDIO COMPARATIVO

A REVIEW OF THE MOST COMMON PARTITION ALGORITHMS IN CLUSTER ANALYSIS: A COMPARATIVE STUDY

Keywords:

Downloads

Authors

References

How to Cite

APA

ACM

ACS

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Article abstract page views

Downloads

License

Make a Submission

Information for Authors

Scimago Journal & Country Rank (SJR)

Keywords