Published
2012-09-01
Sobre la entropía del español escrito
On the Entropy of Written Spanish
Keywords:
entropía de Shannon, ley de grandes números, ley de Zipf, procesos estocásticos (es)Law of large numbers, Shannon entropy, Stochastic process, Zipf’s law (en)
Se presenta una discusión sobre la entropía de la lengua española por
medio de un método práctico para el cálculo de la entropía de un texto mediante
procesamiento informático directo. Como un ejemplo de aplicación,
se analizan treinta muestras de texto español, sumando un total de 22,8 millones
de caracteres. Longitudes de símbolos desde n = 1 hasta 500 fueron
consideradas tanto para palabras como caracteres. Para el cálculo de la
distribución de probabilidad de los símbolos se emplearon procesamiento directo
por computador y la ley de probabilidad de los grandes números. Se
presenta una relación empírica de la entropía con la longitud del texto (en
caracteres) y el número de palabras diferentes en el texto. Se analizan también
propiedades estadísticas de la lengua española cuando se considera como
producida por una fuente estocástica, tales como la invarianza al desplazamiento
del origen, ergodicidad y la propiedad de equipartición asintótica.
A discussion on the entropy of the Spanish language by means of a practical
method for calculating the entropy of a text by direct computer processing
is presented. As an example of application, thirty samples of Spanish text
are analyzed, totaling 22.8 million characters. Symbol lengths from n = 1 to
500 were considered for both words and characters. Both direct computer
processing and the probability law of large numbers were employed for calculating
the probability distribution of the symbols. An empirical relation
on entropy involving the length of the text (in characters) and the number of
different words in the text is presented. Statistical properties of the Spanish
language when viewed as produced by a stochastic source, (such as origin
shift invariance, ergodicity and asymptotic equipartition property) are also
analyzed.
Downloads
Download data is not yet available.
License
Copyright (c) 2012 Revista Colombiana de Estadística

This work is licensed under a Creative Commons Attribution 4.0 International License.
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).






