https://doi.org/10.15446/rce.v37n2spe.47945

Visualization of Skewed Data: A Tool in R

Visualización de datos sesgados: una herramienta en R

RAYDONAL OSPINA1, ANTONIO MARCOS LARANGEIRAS2, ALEJANDRO C. FRERY3

1Universidade Federal de Pernambuco, Departamento de Estatística, Recife, Brazil. Professor. Email: rayospina@gmail.com
2Universidade Federal de Alagoas, Laboratório de Computação Científica e Análise Numérica, Maceió, Brazil. MSc Candidate. Email: amlarangeiras@gmail.com
3Universidade Federal de Alagoas, Laboratório de Computação Científica e Análise Numérica, Maceió, Brazil. Professor. Email: acfrery@gmail.com


Abstract

After discussing the main characteristics of the histogram and of a number of variations in the boxplot, this work presents a visualization tool specifically tailored to deal with skewed data. The idea is to use various types of boxplots (the classical one, which is tuned for skewness of the data, the shifting boxplot, and the box-percentile plot), the violin plot, and the histogram with a nonparametric estimate of the density overlay. The plots are presented in such a way that they facilitate the extraction of additional information from each one. We show that a good deal of information can be extracted from the inspection of the output using example data from synthetic aperture radar images. We provide an implementation in R based on functions already available.

Key words: Exploratory Data Analysis, Skewed Data, Boxplot, Violin Plot, Visualization.


Resumen

Despu\es de discutir las principales características del histograma y de un número de variables en el boxplot, se presento una herramienta de visualisación específicamente diseñada para el tratamiento de datos. La idea es usar varios tipos de boxplots (el clásico, el cual es adaptado para la consideración de sesgo de los datos, el boxplot trasladado, y el gráfico de cajas de percentiles), el gráfico violin, y el histograma con un estimador no paramétrico de la densidad. Los gráficos son presentados de forma que faciliten la extracción de información adicional. Se muestra como una buena cantidad de información que puede ser extraída a través de ejemplos de imágenes de radar de apertura sintética. Se presenta su implementacón en R basada en funciones actualmente disponibles.

Palabras clave: análisis exploratorio de datos, boxplot, datos sesgados gráficos de violin, visualización.


Texto completo disponible en PDF


References

1. Adams, R. E. W., Brown, W. E. & Culbert, T. P. (1981), 'Radar mapping, archeology, and ancient Maya land use', Science 213(4515), 1457-1468. doi: 10.1126/science.213.4515.1457.

2. Arvidson, R., Schulte, M., Kwok, R., Curlander, J., Elachi, C., Ford, J. P. & Saunders, R. (1988), 'Construction and analysis of simulated Venera and Magellan images of Venus', Icarus 76(1), 163-181. doi: 10.1016/0019-1035(88)90149-2.

3. Brys, G., Hubert, M. & Struyf, A. (2004), 'A robust measure of skewness', Journal of Computational and Graphical Statistics 13(4), 996-1017. doi: 10.1198/106186004X12632.

4. Cassetti, J., Gambini, J. & Frery, A. C. (2013), Parameter estimation in SAR imagery using stochastic distances, 'Proceedings of The 4th Asia-Pacific Conference on Synthetic Aperture Radar (APSAR)', Tsukuba, Japan, p. 573-576.

5. Chambers, J., Cleveland, W., Kleiner, B. & Tukey, P. (1983), 'Graphical methods for data analysis', The Wadsworth Statistics/Probability Series. Boston, MA: Duxury.

6. Doulgeris, A. P., Anfinsen, S. N. & Eltoft, T. (2011), 'Automated non-Gaussian clustering of polarimetric synthetic aperture radar images', IEEE Transactions on Geoscience and Remote Sensing 49(10), 3665-3676.

7. Esty, W. W. & Banfield, J. D. (2003), 'The box-percentile plot', Journal of Statistical Software 8(17).

8. Freedman, D. & Diaconis, P. (1981), 'On the histogram as a density estimator: l2 theory', Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 57(4), 453-476.

9. Freitas, C. C., Frery, A. C. & Correia, A. H. (2005), 'The polarimetric G distribution for SAR data analysis', Environmetrics 16(1), 13-31.

10. Frery, A. C., Correia, A. H. & Freitas, C. C. (2007), 'Classifying multifrequency fully polarimetric imagery with multiple sources of statistical evidence and contextual information', IEEE Transactions on Geoscience and Remote Sensing 45(10), 3098-3109.

11. Frery, A. C., Müller, H.-J., Yanasse, C. C. F. & Sant'Anna, S. J. S. (1997), 'A model for extremely heterogeneous clutter', IEEE Transactions on Geoscience and Remote Sensing 35(3), 648-659.

12. Hintze, J. L. & Nelson, R. D. (1998), 'Violin plots: A box plot-density trace synergism', The American Statistician 52(2), 181.

13. Hubert, M. & Vandervieren, E. (2008), 'An adjusted boxplot for skewed distributions', Computational Statistics & Data Analysis 52(12), 5186-5201. doi: 10.1016/j.csda.2007.11.008.

14. Marmolejo, R. F. & Tian, T. S. (2010), 'The shifting boxplot: A boxplot based on essential summary statistics around the mean', International Journal of Psychological Research 3(1), 37-45.

15. McGill, R., Tukey, J. W. & Larsen, W. A. (1978), 'Variations of boxplots', The American Statistician 32(1), 12-16.

16. Mejail, M. E., Jacobo-Berlles, J., Frery, A. C. & Bustos, O. H. (2003), 'Classification of SAR images using a general and tractable multiplicative model', International Journal of Remote Sensing 24(18), 3565-3582.

17. Moreira, A., Prats-Iraola, P., Younis, M., Krieger, G., Hajnsek, I. & Papathanassiou, K. P. (2013), 'A tutorial on synthetic aperture radar', IEEE Geoscience and Remote Sensing Magazine 1(1), 6-43.

18. Mott, H. (2007), Remote Sensing with Polarimetric Radar, Wiley-IEEE Press, USA.

19. Mugdadi, A. R. & Ahmad, I. A. (2004), 'A bandwidth selection for kernel density estimation of functions of random variables', Computational Statistics & Data Analysis 47(1), 49-62.

20. Parzen, E. (1962), 'On estimation of a probability density function and mode', The Annals of Mathematical Statistics 33(3), 1065-1076.

21. Pearson, K. (1895), 'Contributions to the mathematical theory of evolution II: skew variation in homogeneous material', Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 186(0), 343-414.

22. R Core Team, (2013), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. *http://www.R-project.org/

23. Rosenblatt, M. (1956), 'Remarks on some nonparametric estimates of a density function', The Annals of Mathematical Statistics 27(3), 832-837.

24. Scott, D. W. (1979), 'On optimal and data-based histograms', Biometrika 66(3), 605-610.

25. Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis, Chapman & Hall, London.

26. Sturges, H. A. (1926), 'The choice of a class interval', Journal of the American Statistical Association 21(153), pp. 65-66.

27. Tufte, E. R. (2001), The Visual Display of Quantitative Information, 2 edn, Graphics Press.

28. Tukey, J. W. (1977), Exploratory Data Analysis, Addison-Wesley, USA.


[Recibido en mayo de 2014. Aceptado en octubre de 2014]

Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:

@ARTICLE{RCEv37n2a08,
    AUTHOR  = {Ospina, Raydonal and Larangeiras, Antonio Marcos and Frery, Alejandro C.},
    TITLE   = {{Visualization of Skewed Data: A Tool in R}},
    JOURNAL = {Revista Colombiana de Estadística},
    YEAR    = {2014},
    volume  = {37},
    number  = {2},
    pages   = {399-417}
}