Finding protein sites using machine learning methods

Jaime Leonardo Bobadilla Molina; Fernando Niño; Tobías Mojica

doi:10.15446/ing.investig.v23n3.14696

Published

2003-07-01

Finding protein sites using machine learning methods

Identificación de sitios en proteínas usando métodos de aprendizaje de máquina

DOI:

https://doi.org/10.15446/ing.investig.v23n3.14696

Keywords:

bioinformatics, machine learning, support vector machines, protein tertiary structure (en)
bioinfomática, dogma central de la Biología, aprendizaje de máquina, estructura terciaria de proteínas, máquinas con vectores de soporte (es)

Downloads

PDF (Español)

Authors

Jaime Leonardo Bobadilla Molina Universidad Nacional de Colombia
Fernando Niño Universidad Nacional de Colombia
Tobías Mojica Universidad Nacional de Colombia

Abstract (en)
Abstract (es)

The increasing amount of protein three-dimensional (3D) structures determined by X-ray and NMR technologies as well as structures predicted by computational methods results in the need for automated methods to provide initial annotations. We have developed a new method for recognizing sites in three-dimensional protein structures. Our method is based on a previously reported algorithm for creating descriptions of protein microenvironments using physical and chemical properties at multiple levels of detail. The recognition method takes three inputs: 1. a set of sites that, share some structural or functional role; 2. a set of control nonsites that lack this role, and 3. a single query site. A support vector machine classifier is built using feature vectors where each component represents a property in a given volume. Validation against an independent test set shows that this recognition approach has high sensitivity and specificity. We also describe the results of scanning four calcium binding proteins (with the calcium removed) using a three dimensional grid of probe points at 1.25 angstrom spacing. The system finds the sites in the proteins giving points at or near the binding sites. Our results show that property based descriptions along with support vector machines can be used for recognizing protein sites in unannotated structures.

Con el crecimiento de las bases de datos de estructuras tridimensionales determinadas por rayos-x, NMR (resonancia magnética nuclear) y de estructuras predichas por computador, se deriva la necesidad de sistemas automáticos que provean anotaciones iniciales. Se ha desarrollado un nuevo método para reconocer sitios en estructura terciaria de proteínas. El método propuesto se basa en un algoritmo previamente reportado para crear descripciones de microambientes en proteínas usando propiedades físicas y químicas con varios niveles de detalle. El método de reconocimiento toma tres entradas: 1. Un conjunto de sitios que comparte un rol funcional o estructural; 2. Un conjunto de no sitios que no tienen este rol; y 3. un sitio del cual se ignora si tiene la característica buscada o no. Se construyó un clasificador con máquinas con vectores de soporte usan vectores de características en que cada componente representa una propiedad en un volumen dado. La validación contra un conjunto de prueba independiente muestra que este enfoque tiene alta sensibilidad y especificidad. También se describen los resultados de escanear cuatro proteínas con sitios de unión a calcio (con el calcio removido) usando una rejilla tridimensional de puntos de prueba separada a 1.25 angstroms. El sistema encuentra los sitios en las proteínas ubicando puntos en los sitios de unión o cerca de estos. Los resultados muestran que pueden usarse descripciones de propiedades junto con máquinas de soporte para reconocer sitios en proteínas no anotadas.

References

Acerca del genoma humano. Tobías Mojica, Luzardo Estrada. Agronomía Colombiana, Vol. 27, pp. 7-12.

Defining the Mandate of Proteomics in the Post-Genomics Era: Workshop Report National Research Council Steering Committee. George L. Kenyon (Chair).

Berman, H.M.; Westbrook, I; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, N.; Bourne, RE. (2000). The Protein Data Bank. Nucleic Acids Research, 28 pp. 235-242. DOI: https://doi.org/10.1093/nar/28.1.235

Simons, K.T.; Kooperberg, C.; Huang, E. and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated anealing and Bayesian scoring functions. J. Mol. Biol. 268: 209-225.

Altman, R. B, & T.E. Klein. Challenges for Biomedical Informatics and Pharmacogenomics. Annu, Rev. Pharmacol. Toxicol 2002. 42:1 13-33 DOI: https://doi.org/10.1146/annurev.pharmtox.42.082401.140850

Bryant, SH. and Altschul, S.F (1995), Statistics of Sequence-structure Threading. Current Opinion in Structural Biology, 5, pp. 236-244. DOI: https://doi.org/10.1016/0959-440X(95)80082-4

Simons, K.T.; Kooperberg, C.; Huang, E, and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated anealing and Bayesian scoring functions. J. Mol. Biol. 268: 209-225. DOI: https://doi.org/10.1006/jmbi.1997.0959

Bork, P; Dandekar, T; Diaz-Lazcoz, Y; Eisenhaber, F; Huynen, M, and Yuan, Y. (1998). Predicting Function: From Genes to Genomes and Back. J. Mol. Biol, 283: 707-725. DOI: https://doi.org/10.1006/jmbi.1998.2144

Brown, M.PS.; Grundy, W. N.; Lin, D.; Cristianini, N.; Sugent, C.W.; Furey, T.S.; Ares Jr., M.; and Haussler, D. (2000). Knowledge-based Analysis of Microarray Gene Expression Data by Using Support Vector Machines. PNAS, 97(1): 262-267. DOI: https://doi.org/10.1073/pnas.97.1.262

Lathrop, R.H. (1994). The Protein Threading Problem with Sequence Amino Acid Interaction Preferences is NP-Complete. Protein Engineering, 7:9: 1059-1068. DOI: https://doi.org/10.1093/protein/7.9.1059

What is bioinformatics? An introduction and overview Nicholas M Luscombe, Dov Greenbaum & Mark Gerstein,

Molecular Biology of the Cell, (1994c), 3rd ed. Alberts, Bruce; Bray, Dennis; Lewis, Julian; Raff, Martin; Roberts, Keith; Watson, James D, New York and London: Garland Publishing.

Richards, FM. (1996). Calculation of Molecular Volumes and Areas for Structures of Known Geometry. Methods in Enzymology, 115: 440-464. DOI: https://doi.org/10.1016/0076-6879(85)15032-9

Bateman, A.; Birney, E.; Cerruti, L.; Durbin, R.; Etwiller, L.; Eddy, S.R.; Griffiths-Jones, S.; Howe, K.L.; Marshall, M.; Sonnhammer, E.L. (2002). The Pfam Protein Families Database. Nucleic Acids Research 30(1):276-280. DOI: https://doi.org/10.1093/nar/30.1.276

Bagley, S.C. and Altman, R.B. (1995). Characterizing the Microenvironment Surrounding Protein Sites. Protein Science, 4: 622-635. DOI: https://doi.org/10.1002/pro.5560040404

Baldi, R and Brunak, S. (1998). Bioinformatics: The Machine Learning Cambridge, MA: Approach. MIT Press.

Burges, C. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. In: Data Mining and Knowledge Discovery.

Simon S. Haykin. Neural Networks: A Comprehensive Foundation (2nd Edition).

Berman, H.M.; Bhat, T.N.; Bourne, P.E.; Feng, Z.; Gilliland, G.; Weissig, H.; Westbrook, J. (2000). The Protein Data Bank and the challenge of structural genomics. Mafure Structural Biology 7 (11): 957-959. DOI: https://doi.org/10.1038/80734

How to Cite

APA

Bobadilla Molina, J. L., Niño, F. & Mojica, T. (2003). Finding protein sites using machine learning methods. Ingeniería e Investigación, 23(3), 5–11. https://doi.org/10.15446/ing.investig.v23n3.14696

ACM

[1]

Bobadilla Molina, J.L., Niño, F. and Mojica, T. 2003. Finding protein sites using machine learning methods. Ingeniería e Investigación. 23, 3 (Jul. 2003), 5–11. DOI:https://doi.org/10.15446/ing.investig.v23n3.14696.

ACS

(1)

Bobadilla Molina, J. L.; Niño, F.; Mojica, T. Finding protein sites using machine learning methods. Ing. Inv. 2003, 23, 5-11.

ABNT

BOBADILLA MOLINA, J. L.; NIÑO, F.; MOJICA, T. Finding protein sites using machine learning methods. Ingeniería e Investigación, [S. l.], v. 23, n. 3, p. 5–11, 2003. DOI: 10.15446/ing.investig.v23n3.14696. Disponível em: https://revistas.unal.edu.co/index.php/ingeinv/article/view/14696. Acesso em: 27 dec. 2025.

Chicago

Bobadilla Molina, Jaime Leonardo, Fernando Niño, and Tobías Mojica. 2003. “Finding protein sites using machine learning methods”. Ingeniería E Investigación 23 (3):5-11. https://doi.org/10.15446/ing.investig.v23n3.14696.

Harvard

Bobadilla Molina, J. L., Niño, F. and Mojica, T. (2003) “Finding protein sites using machine learning methods”, Ingeniería e Investigación, 23(3), pp. 5–11. doi: 10.15446/ing.investig.v23n3.14696.

IEEE

[1]

J. L. Bobadilla Molina, F. Niño, and T. Mojica, “Finding protein sites using machine learning methods”, Ing. Inv., vol. 23, no. 3, pp. 5–11, Jul. 2003.

MLA

Bobadilla Molina, J. L., F. Niño, and T. Mojica. “Finding protein sites using machine learning methods”. Ingeniería e Investigación, vol. 23, no. 3, July 2003, pp. 5-11, doi:10.15446/ing.investig.v23n3.14696.

Turabian

Bobadilla Molina, Jaime Leonardo, Fernando Niño, and Tobías Mojica. “Finding protein sites using machine learning methods”. Ingeniería e Investigación 23, no. 3 (July 1, 2003): 5–11. Accessed December 27, 2025. https://revistas.unal.edu.co/index.php/ingeinv/article/view/14696.

Vancouver

1.

Bobadilla Molina JL, Niño F, Mojica T. Finding protein sites using machine learning methods. Ing. Inv. [Internet]. 2003 Jul. 1 [cited 2025 Dec. 27];23(3):5-11. Available from: https://revistas.unal.edu.co/index.php/ingeinv/article/view/14696

Download Citation

CrossRef Cited-by

0

Dimensions

PlumX

Article abstract page views

442

Downloads

Download data is not yet available.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

The authors or holders of the copyright for each article hereby confer exclusive, limited and free authorization on the Universidad Nacional de Colombia's journal Ingeniería e Investigación concerning the aforementioned article which, once it has been evaluated and approved, will be submitted for publication, in line with the following items:

1. The version which has been corrected according to the evaluators' suggestions will be remitted and it will be made clear whether the aforementioned article is an unedited document regarding which the rights to be authorized are held and total responsibility will be assumed by the authors for the content of the work being submitted to Ingeniería e Investigación, the Universidad Nacional de Colombia and third-parties;

2. The authorization conferred on the journal will come into force from the date on which it is included in the respective volume and issue of Ingeniería e Investigación in the Open Journal Systems and on the journal's main page (https://revistas.unal.edu.co/index.php/ingeinv), as well as in different databases and indices in which the publication is indexed;

3. The authors authorize the Universidad Nacional de Colombia's journal Ingeniería e Investigación to publish the document in whatever required format (printed, digital, electronic or whatsoever known or yet to be discovered form) and authorize Ingeniería e Investigación to include the work in any indices and/or search engines deemed necessary for promoting its diffusion;

4. The authors accept that such authorization is given free of charge and they, therefore, waive any right to receive remuneration from the publication, distribution, public communication and any use whatsoever referred to in the terms of this authorization.

	IBN Publindex El Índice Bibliográfico Nacional Publindex es un sistema colombiano para la clasificación, actualización, escalafonamiento y certificación de las publicaciones científicas y tecnológicas. Es regido por COLCIENCIAS y el ICFES en Colombia.
	Directory of Open Access Journals DOAJ aumenta la visibilidad y la facilidad de uso de las revistas científicas y académicas de acceso abierto, pretende ser global y abarcar todas las revistas que utilizan un sistema de control de calidad para garantizar el contenido.
	SciELO Colombia SciELO Colombia es una librería virtual para América Latina, el Caribe, España y Portugal, fue creada por FAPESP en el año de 1997 en Sao Pablo Brasil, actualmente en Colombia es gestionada por la Universidad Nacional de Colombia.
	REDIB Portal donde se muestran las revistas electrónicas españolas y latinoamericanas de acceso abierto (Open Access). Fue creado en España.
	Science Citation Index Expanded^TM SCI es un prestigio sistema de indexación en línea que incorpora información bibliográfica y de citación de publicaciones científicas alrededor del mundo.
	Scopus Scopus es una base de datos bibliográfica de resúmenes y citas de artículos de revistas científicas. Cubre aproximadamente 19.500 títulos de más de 5.000 editores internacionales, incluyendo la cobertura de de 16.500 revistas.
	Latindex Latindex es producto de la cooperación de una red de instituciones latinoamericanas que funcionan de manera coordinada para reunir y diseminar información bibliográfica sobre las publicaciones científicas seriadas producidas en la región.
	Dialnet Dialnet es un portal de difusión de la producción científica hispana que inició su funcionamiento en el año 2001 especializado en ciencias humanas y sociales. Su base de datos, de acceso libre, fue creada por la Universidad de La Rioja (España).
see more

Published

Finding protein sites using machine learning methods

Identificación de sitios en proteínas usando métodos de aprendizaje de máquina

DOI:

Keywords:

Downloads

Authors

References

How to Cite

APA

ACM

ACS

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

CrossRef Cited-by

Dimensions

PlumX

Article abstract page views

Downloads

License

Most read articles by the same author(s)

Make a Submission

Guide for authors

Peer-review process

Ethics

Journal Citation Reports™

Scimago Journal & Country Rank - SJR

Keywords

Journal Citation Reports^™