Web text corpus extraction system for linguistic tasks
Sistema de extracción de cuerpos de texto de la web para tareas lingüísticas
Keywords:
Web Corpus, crawler, unsupervised language learning, concurrent programming (en)corpus web, crawler, aprendizaje no supervisado de lenguajes, programación concurrente (es)
Internet content, used as text corpus for natural language learning, offers important characteristics for such task, like its huge volume, being permanently up-to-date with linguistic variants and having low time and resource costs regarding the traditional way that text is built for natural language machine learning tasks. This paper describes a system for the automatic extraction of large bodies of text from the Internet as a valuable tool for such learning tasks. A concurrent programming-based, hardware-use optimisation strategy significantly improving extraction performance is also presented. The strategies incorporated into the system for maximising hardware resource exploitation, thereby reducing extraction time are presented, as are extendibility (supporting digital-content formats) and adaptability (regarding how the system cleanses content for obtaining pure natural language samples). The experimental results obtained after processing one of the biggest Spanish domains on the internet, are presented (i.e. es.wikipedia.org). Such results are used for presenting initial conclusions about the validity and applicability of corpus directly extracted from Internet as morphological or syntactical learning input.
En este artículo se describe un sistema desarrollado para la extracción de grandes cuerpos de texto de Internet, teniendo como motivación el valor que ofrecen los ejemplos de lenguaje natural disponibles en la red para las tareas de aprendizaje no supervisado de dichos naturales, dado por características como su enorme volumen, permanente actualización respecto de las alteraciones del lenguaje, y bajo costo, en tiempo y recursos, en cuanto a los mecanismos tradicionales de construcción de corpus para esas tareas de aprendizaje. Se presentan las estrategias incorporadas al sistema con el fin de maximizar el aprovechamiento de los recursos de hardware y así reducir los tiempos de extracción, al igual que se presentan las características de extensibilidad para los formatos soportados, y adaptabilidad respecto a la manera como el sistema limpia los contenidos para obtener muestras de lenguaje natural puras. Al final del artículo se presentan los resultados experimentales obtenidos con uno de los dominios de contenido en español más grande de Internet: es.wikipedia.org, a través de los cuales se concluye sobre la validez y aplicabilidad de un corpus extraído directamente de la Internet para un eventual proceso de aprendizaje de morfología o sintaxis.
Downloads
References
Chomsky, N., Knowledge of Language: Its Nature, Origin, and Use., Praeger, 1986.
Clark, A., Unsupervised Language Acquisition: Theory and Practice., Tesis presentada a la Universidad Génova, para optar al grado de Doctor of Philosophy, Dicembre, 2002.
Parekh, R., Honavar, V., Grammar inference, automata induction, and language acquisition., 2000.
Buitelaar, P., Cimiano, P., Magnini, B., Ontology Learning from Text: Methods., Evaluation and Applications, Vol. 123 of Frontiers in Artificial Intelligence, IOS Press, 2005.
Navigli, R., Velardi, P., Gangemi, A., Ontology learning and its application to automated terminology translation., IEEE Intelligent Systems, Vol. 18, No. 1, 2003, pp. 22¬31.
Zhou, L., Ontology learning: state of the art and open issues., Information Technology and Management archive, Vol. 8, No. 3, September, 2007, pp. 241¬252.
Church, K. W., Mercer, R. L., Introduction to the special issue on computational linguistics using large corpora., Comput. Linguist., Vol. 19, No. 1, 1993, pp. 1¬24.
Marianne Hundt, N. N., Biewer, C., corpus Linguistics and the web., Language and Computers 59, Kenilworth: Rodopi, 2007.
Keller, F., Lapata, M., Using the web to obtain frequencies for unseen bigrams., Comput. Linguist., Vol. 29, No. 3, 2003, pp. 459¬484.
Kilgarriff, A., Grefenstette, G., Introduction to the special issue on the web as corpus., Computational Linguistics, Vol. 29, 2003, pp. 333¬347.
Miller, R. C., Bharat, K., Sphinx: a framework for creating personal, sitespecific web crawlers., in WWW7: Proceedings of the seventh international conference on World Wide web 7, (Amsterdam, The Netherlands, The
Netherlands), Elsevier Science Publishers B. V., 1998., pp. 119¬130.
Kehoe, A. R., webcorp: Applying the web to linguistics and linguistics to the web., in WWW2002 Conference, Honolulu, Hawaii, 2002.
Kehoe, A. M. G., New corpora from the web: making web text more 'text-like'., in Towards Multimedia in corpus Studies, electronic publication, University of Helsinki, 2007.
Mattson, G., Sanders, B. A., Massingill. B. L., Patterns for Parallel Programming., Addison-Wesley Professional, 2004.
Krishnamurthy, A., Yelick, K., Optimizing parallel programs with explicit synchronization., SIGPLAN Not. 30, 1995, pp. 96-204.
Gelbukh, A., Sidorov, G., Procesamiento automático del español con enfoque en recursos léxicos grandes., IPN, Mexico, 2006.
License
Copyright (c) 2009 Héctor Fabio Cadavid Rengifo, Jonatan Gómez Perdomo

This work is licensed under a Creative Commons Attribution 4.0 International License.
The authors or holders of the copyright for each article hereby confer exclusive, limited and free authorization on the Universidad Nacional de Colombia's journal Ingeniería e Investigación concerning the aforementioned article which, once it has been evaluated and approved, will be submitted for publication, in line with the following items:
1. The version which has been corrected according to the evaluators' suggestions will be remitted and it will be made clear whether the aforementioned article is an unedited document regarding which the rights to be authorized are held and total responsibility will be assumed by the authors for the content of the work being submitted to Ingeniería e Investigación, the Universidad Nacional de Colombia and third-parties;
2. The authorization conferred on the journal will come into force from the date on which it is included in the respective volume and issue of Ingeniería e Investigación in the Open Journal Systems and on the journal's main page (https://revistas.unal.edu.co/index.php/ingeinv), as well as in different databases and indices in which the publication is indexed;
3. The authors authorize the Universidad Nacional de Colombia's journal Ingeniería e Investigación to publish the document in whatever required format (printed, digital, electronic or whatsoever known or yet to be discovered form) and authorize Ingeniería e Investigación to include the work in any indices and/or search engines deemed necessary for promoting its diffusion;
4. The authors accept that such authorization is given free of charge and they, therefore, waive any right to receive remuneration from the publication, distribution, public communication and any use whatsoever referred to in the terms of this authorization.









