Published

2003-07-01

Finding protein sites using machine learning methods

Identificación de sitios en proteínas usando métodos de aprendizaje de máquina

DOI:

https://doi.org/10.15446/ing.investig.v23n3.14696

Keywords:

bioinformatics, machine learning, support vector machines, protein tertiary structure (en)
bioinfomática, dogma central de la Biología, aprendizaje de máquina, estructura terciaria de proteínas, máquinas con vectores de soporte (es)

Authors

  • Jaime Leonardo Bobadilla Molina Universidad Nacional de Colombia
  • Fernando Niño Universidad Nacional de Colombia
  • Tobías Mojica Universidad Nacional de Colombia

The increasing amount of protein three-dimensional (3D) structures determined by X-ray and NMR technologies as well as structures predicted by computational methods results in the need for automated methods to provide initial annotations. We have developed a new method for recognizing sites in three-dimensional protein structures. Our method is based on a previously reported algorithm for creating descriptions of protein microenvironments using physical and chemical properties at multiple levels of detail. The recognition method takes three inputs: 1. a set of sites that, share some structural or functional role; 2. a set of control nonsites that lack this role, and 3. a single query site. A support vector machine classifier is built using feature vectors where each component represents a property in a given volume. Validation against an independent test set shows that this recognition approach has high sensitivity and specificity. We also describe the results of scanning four calcium binding proteins (with the calcium removed) using a three dimensional grid of probe points at 1.25 angstrom spacing. The system finds the sites in the proteins giving points at or near the binding sites. Our results show that property based descriptions along with support vector machines can be used for recognizing protein sites in unannotated structures.

Con el crecimiento de las bases de datos de estructuras tridimensionales determinadas por rayos-x, NMR (resonancia magnética nuclear) y de estructuras predichas por computador, se deriva la necesidad de sistemas automáticos que provean anotaciones iniciales. Se ha desarrollado un nuevo método para reconocer sitios en estructura terciaria de proteínas. El método propuesto se basa en un algoritmo previamente reportado para crear descripciones de microambientes en proteínas usando propiedades físicas y químicas con varios niveles de detalle. El método de reconocimiento toma tres entradas: 1. Un conjunto de sitios que comparte un rol funcional o estructural; 2. Un conjunto de no sitios que no tienen este rol; y 3. un sitio del cual se ignora si tiene la característica buscada o no. Se construyó un clasificador con máquinas con vectores de soporte usan vectores de características en que cada componente representa una propiedad en un volumen dado. La validación contra un conjunto de prueba independiente muestra que este enfoque tiene alta sensibilidad y especificidad. También se describen los resultados de escanear cuatro proteínas con sitios de unión a calcio (con el calcio removido) usando una rejilla tridimensional de puntos de prueba separada a 1.25 angstroms. El sistema encuentra los sitios en las proteínas ubicando puntos en los sitios de unión o cerca de estos. Los resultados muestran que pueden usarse descripciones de propiedades junto con máquinas de soporte para reconocer sitios en proteínas no anotadas.

References

Acerca del genoma humano. Tobías Mojica, Luzardo Estrada. Agronomía Colombiana, Vol. 27, pp. 7-12.

Defining the Mandate of Proteomics in the Post-Genomics Era: Workshop Report National Research Council Steering Committee. George L. Kenyon (Chair).

Berman, H.M.; Westbrook, I; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, N.; Bourne, RE. (2000). The Protein Data Bank. Nucleic Acids Research, 28 pp. 235-242. DOI: https://doi.org/10.1093/nar/28.1.235

Simons, K.T.; Kooperberg, C.; Huang, E. and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated anealing and Bayesian scoring functions. J. Mol. Biol. 268: 209-225.

Altman, R. B, & T.E. Klein. Challenges for Biomedical Informatics and Pharmacogenomics. Annu, Rev. Pharmacol. Toxicol 2002. 42:1 13-33 DOI: https://doi.org/10.1146/annurev.pharmtox.42.082401.140850

Bryant, SH. and Altschul, S.F (1995), Statistics of Sequence-structure Threading. Current Opinion in Structural Biology, 5, pp. 236-244. DOI: https://doi.org/10.1016/0959-440X(95)80082-4

Simons, K.T.; Kooperberg, C.; Huang, E, and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated anealing and Bayesian scoring functions. J. Mol. Biol. 268: 209-225. DOI: https://doi.org/10.1006/jmbi.1997.0959

Bork, P; Dandekar, T; Diaz-Lazcoz, Y; Eisenhaber, F; Huynen, M, and Yuan, Y. (1998). Predicting Function: From Genes to Genomes and Back. J. Mol. Biol, 283: 707-725. DOI: https://doi.org/10.1006/jmbi.1998.2144

Brown, M.PS.; Grundy, W. N.; Lin, D.; Cristianini, N.; Sugent, C.W.; Furey, T.S.; Ares Jr., M.; and Haussler, D. (2000). Knowledge-based Analysis of Microarray Gene Expression Data by Using Support Vector Machines. PNAS, 97(1): 262-267. DOI: https://doi.org/10.1073/pnas.97.1.262

Lathrop, R.H. (1994). The Protein Threading Problem with Sequence Amino Acid Interaction Preferences is NP-Complete. Protein Engineering, 7:9: 1059-1068. DOI: https://doi.org/10.1093/protein/7.9.1059

What is bioinformatics? An introduction and overview Nicholas M Luscombe, Dov Greenbaum & Mark Gerstein,

Molecular Biology of the Cell, (1994c), 3rd ed. Alberts, Bruce; Bray, Dennis; Lewis, Julian; Raff, Martin; Roberts, Keith; Watson, James D, New York and London: Garland Publishing.

Richards, FM. (1996). Calculation of Molecular Volumes and Areas for Structures of Known Geometry. Methods in Enzymology, 115: 440-464. DOI: https://doi.org/10.1016/0076-6879(85)15032-9

Bateman, A.; Birney, E.; Cerruti, L.; Durbin, R.; Etwiller, L.; Eddy, S.R.; Griffiths-Jones, S.; Howe, K.L.; Marshall, M.; Sonnhammer, E.L. (2002). The Pfam Protein Families Database. Nucleic Acids Research 30(1):276-280. DOI: https://doi.org/10.1093/nar/30.1.276

Bagley, S.C. and Altman, R.B. (1995). Characterizing the Microenvironment Surrounding Protein Sites. Protein Science, 4: 622-635. DOI: https://doi.org/10.1002/pro.5560040404

Baldi, R and Brunak, S. (1998). Bioinformatics: The Machine Learning Cambridge, MA: Approach. MIT Press.

Burges, C. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. In: Data Mining and Knowledge Discovery.

Simon S. Haykin. Neural Networks: A Comprehensive Foundation (2nd Edition).

Berman, H.M.; Bhat, T.N.; Bourne, P.E.; Feng, Z.; Gilliland, G.; Weissig, H.; Westbrook, J. (2000). The Protein Data Bank and the challenge of structural genomics. Mafure Structural Biology 7 (11): 957-959. DOI: https://doi.org/10.1038/80734

How to Cite

APA

Bobadilla Molina, J. L., Niño, F. and Mojica, T. (2003). Finding protein sites using machine learning methods. Ingeniería e Investigación, 23(3), 5–11. https://doi.org/10.15446/ing.investig.v23n3.14696

ACM

[1]
Bobadilla Molina, J.L., Niño, F. and Mojica, T. 2003. Finding protein sites using machine learning methods. Ingeniería e Investigación. 23, 3 (Jul. 2003), 5–11. DOI:https://doi.org/10.15446/ing.investig.v23n3.14696.

ACS

(1)
Bobadilla Molina, J. L.; Niño, F.; Mojica, T. Finding protein sites using machine learning methods. Ing. Inv. 2003, 23, 5-11.

ABNT

BOBADILLA MOLINA, J. L.; NIÑO, F.; MOJICA, T. Finding protein sites using machine learning methods. Ingeniería e Investigación, [S. l.], v. 23, n. 3, p. 5–11, 2003. DOI: 10.15446/ing.investig.v23n3.14696. Disponível em: https://revistas.unal.edu.co/index.php/ingeinv/article/view/14696. Acesso em: 1 feb. 2025.

Chicago

Bobadilla Molina, Jaime Leonardo, Fernando Niño, and Tobías Mojica. 2003. “Finding protein sites using machine learning methods”. Ingeniería E Investigación 23 (3):5-11. https://doi.org/10.15446/ing.investig.v23n3.14696.

Harvard

Bobadilla Molina, J. L., Niño, F. and Mojica, T. (2003) “Finding protein sites using machine learning methods”, Ingeniería e Investigación, 23(3), pp. 5–11. doi: 10.15446/ing.investig.v23n3.14696.

IEEE

[1]
J. L. Bobadilla Molina, F. Niño, and T. Mojica, “Finding protein sites using machine learning methods”, Ing. Inv., vol. 23, no. 3, pp. 5–11, Jul. 2003.

MLA

Bobadilla Molina, J. L., F. Niño, and T. Mojica. “Finding protein sites using machine learning methods”. Ingeniería e Investigación, vol. 23, no. 3, July 2003, pp. 5-11, doi:10.15446/ing.investig.v23n3.14696.

Turabian

Bobadilla Molina, Jaime Leonardo, Fernando Niño, and Tobías Mojica. “Finding protein sites using machine learning methods”. Ingeniería e Investigación 23, no. 3 (July 1, 2003): 5–11. Accessed February 1, 2025. https://revistas.unal.edu.co/index.php/ingeinv/article/view/14696.

Vancouver

1.
Bobadilla Molina JL, Niño F, Mojica T. Finding protein sites using machine learning methods. Ing. Inv. [Internet]. 2003 Jul. 1 [cited 2025 Feb. 1];23(3):5-11. Available from: https://revistas.unal.edu.co/index.php/ingeinv/article/view/14696

Download Citation

CrossRef Cited-by

CrossRef citations0

Dimensions

PlumX

Article abstract page views

359

Downloads

Download data is not yet available.

Most read articles by the same author(s)