Finding protein sites using machine learning methods
Identificación de sitios en proteínas usando métodos de aprendizaje de máquina
DOI:
https://doi.org/10.15446/ing.investig.v23n3.14696Keywords:
bioinformatics, machine learning, support vector machines, protein tertiary structure (en)bioinfomática, dogma central de la Biología, aprendizaje de máquina, estructura terciaria de proteínas, máquinas con vectores de soporte (es)
Downloads
The increasing amount of protein three-dimensional (3D) structures determined by X-ray and NMR technologies as well as structures predicted by computational methods results in the need for automated methods to provide initial annotations. We have developed a new method for recognizing sites in three-dimensional protein structures. Our method is based on a previously reported algorithm for creating descriptions of protein microenvironments using physical and chemical properties at multiple levels of detail. The recognition method takes three inputs: 1. a set of sites that, share some structural or functional role; 2. a set of control nonsites that lack this role, and 3. a single query site. A support vector machine classifier is built using feature vectors where each component represents a property in a given volume. Validation against an independent test set shows that this recognition approach has high sensitivity and specificity. We also describe the results of scanning four calcium binding proteins (with the calcium removed) using a three dimensional grid of probe points at 1.25 angstrom spacing. The system finds the sites in the proteins giving points at or near the binding sites. Our results show that property based descriptions along with support vector machines can be used for recognizing protein sites in unannotated structures.
Con el crecimiento de las bases de datos de estructuras tridimensionales determinadas por rayos-x, NMR (resonancia magnética nuclear) y de estructuras predichas por computador, se deriva la necesidad de sistemas automáticos que provean anotaciones iniciales. Se ha desarrollado un nuevo método para reconocer sitios en estructura terciaria de proteínas. El método propuesto se basa en un algoritmo previamente reportado para crear descripciones de microambientes en proteínas usando propiedades físicas y químicas con varios niveles de detalle. El método de reconocimiento toma tres entradas: 1. Un conjunto de sitios que comparte un rol funcional o estructural; 2. Un conjunto de no sitios que no tienen este rol; y 3. un sitio del cual se ignora si tiene la característica buscada o no. Se construyó un clasificador con máquinas con vectores de soporte usan vectores de características en que cada componente representa una propiedad en un volumen dado. La validación contra un conjunto de prueba independiente muestra que este enfoque tiene alta sensibilidad y especificidad. También se describen los resultados de escanear cuatro proteínas con sitios de unión a calcio (con el calcio removido) usando una rejilla tridimensional de puntos de prueba separada a 1.25 angstroms. El sistema encuentra los sitios en las proteínas ubicando puntos en los sitios de unión o cerca de estos. Los resultados muestran que pueden usarse descripciones de propiedades junto con máquinas de soporte para reconocer sitios en proteínas no anotadas.
References
Acerca del genoma humano. Tobías Mojica, Luzardo Estrada. Agronomía Colombiana, Vol. 27, pp. 7-12.
Defining the Mandate of Proteomics in the Post-Genomics Era: Workshop Report National Research Council Steering Committee. George L. Kenyon (Chair).
Berman, H.M.; Westbrook, I; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, N.; Bourne, RE. (2000). The Protein Data Bank. Nucleic Acids Research, 28 pp. 235-242. DOI: https://doi.org/10.1093/nar/28.1.235
Simons, K.T.; Kooperberg, C.; Huang, E. and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated anealing and Bayesian scoring functions. J. Mol. Biol. 268: 209-225.
Altman, R. B, & T.E. Klein. Challenges for Biomedical Informatics and Pharmacogenomics. Annu, Rev. Pharmacol. Toxicol 2002. 42:1 13-33 DOI: https://doi.org/10.1146/annurev.pharmtox.42.082401.140850
Bryant, SH. and Altschul, S.F (1995), Statistics of Sequence-structure Threading. Current Opinion in Structural Biology, 5, pp. 236-244. DOI: https://doi.org/10.1016/0959-440X(95)80082-4
Simons, K.T.; Kooperberg, C.; Huang, E, and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated anealing and Bayesian scoring functions. J. Mol. Biol. 268: 209-225. DOI: https://doi.org/10.1006/jmbi.1997.0959
Bork, P; Dandekar, T; Diaz-Lazcoz, Y; Eisenhaber, F; Huynen, M, and Yuan, Y. (1998). Predicting Function: From Genes to Genomes and Back. J. Mol. Biol, 283: 707-725. DOI: https://doi.org/10.1006/jmbi.1998.2144
Brown, M.PS.; Grundy, W. N.; Lin, D.; Cristianini, N.; Sugent, C.W.; Furey, T.S.; Ares Jr., M.; and Haussler, D. (2000). Knowledge-based Analysis of Microarray Gene Expression Data by Using Support Vector Machines. PNAS, 97(1): 262-267. DOI: https://doi.org/10.1073/pnas.97.1.262
Lathrop, R.H. (1994). The Protein Threading Problem with Sequence Amino Acid Interaction Preferences is NP-Complete. Protein Engineering, 7:9: 1059-1068. DOI: https://doi.org/10.1093/protein/7.9.1059
What is bioinformatics? An introduction and overview Nicholas M Luscombe, Dov Greenbaum & Mark Gerstein,
Molecular Biology of the Cell, (1994c), 3rd ed. Alberts, Bruce; Bray, Dennis; Lewis, Julian; Raff, Martin; Roberts, Keith; Watson, James D, New York and London: Garland Publishing.
Richards, FM. (1996). Calculation of Molecular Volumes and Areas for Structures of Known Geometry. Methods in Enzymology, 115: 440-464. DOI: https://doi.org/10.1016/0076-6879(85)15032-9
Bateman, A.; Birney, E.; Cerruti, L.; Durbin, R.; Etwiller, L.; Eddy, S.R.; Griffiths-Jones, S.; Howe, K.L.; Marshall, M.; Sonnhammer, E.L. (2002). The Pfam Protein Families Database. Nucleic Acids Research 30(1):276-280. DOI: https://doi.org/10.1093/nar/30.1.276
Bagley, S.C. and Altman, R.B. (1995). Characterizing the Microenvironment Surrounding Protein Sites. Protein Science, 4: 622-635. DOI: https://doi.org/10.1002/pro.5560040404
Baldi, R and Brunak, S. (1998). Bioinformatics: The Machine Learning Cambridge, MA: Approach. MIT Press.
Burges, C. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. In: Data Mining and Knowledge Discovery.
Simon S. Haykin. Neural Networks: A Comprehensive Foundation (2nd Edition).
Berman, H.M.; Bhat, T.N.; Bourne, P.E.; Feng, Z.; Gilliland, G.; Weissig, H.; Westbrook, J. (2000). The Protein Data Bank and the challenge of structural genomics. Mafure Structural Biology 7 (11): 957-959. DOI: https://doi.org/10.1038/80734
How to Cite
APA
ACM
ACS
ABNT
Chicago
Harvard
IEEE
MLA
Turabian
Vancouver
Download Citation
License
Copyright (c) 2003 Jaime Leonardo Bobadilla Molina, Fernando Niño, Tobías Mojica
This work is licensed under a Creative Commons Attribution 4.0 International License.
The authors or holders of the copyright for each article hereby confer exclusive, limited and free authorization on the Universidad Nacional de Colombia's journal Ingeniería e Investigación concerning the aforementioned article which, once it has been evaluated and approved, will be submitted for publication, in line with the following items:
1. The version which has been corrected according to the evaluators' suggestions will be remitted and it will be made clear whether the aforementioned article is an unedited document regarding which the rights to be authorized are held and total responsibility will be assumed by the authors for the content of the work being submitted to Ingeniería e Investigación, the Universidad Nacional de Colombia and third-parties;
2. The authorization conferred on the journal will come into force from the date on which it is included in the respective volume and issue of Ingeniería e Investigación in the Open Journal Systems and on the journal's main page (https://revistas.unal.edu.co/index.php/ingeinv), as well as in different databases and indices in which the publication is indexed;
3. The authors authorize the Universidad Nacional de Colombia's journal Ingeniería e Investigación to publish the document in whatever required format (printed, digital, electronic or whatsoever known or yet to be discovered form) and authorize Ingeniería e Investigación to include the work in any indices and/or search engines deemed necessary for promoting its diffusion;
4. The authors accept that such authorization is given free of charge and they, therefore, waive any right to receive remuneration from the publication, distribution, public communication and any use whatsoever referred to in the terms of this authorization.