Publicado

2020-04-01

Speaker verification system based on articulatory information from ultrasound recordings

Sistema de verificación de hablantes utilizando información articulatoria de grabaciones de ultrasonido

DOI:

https://doi.org/10.15446/dyna.v87n213.81772

Palabras clave:

speech processing, speaker verification, articulatory parameters, ultrasound, i-vectors, GMMs (en)
procesamiento de señales del habla, verificación de hablantes, parámetros articulatorios, ultrasonido, i-vectors, GMMs (es)

Autores/as

Current state-of-the-art speaker verification (SV) systems are known to be strongly affected by unexpected variability presented during testing, such as environmental noise or changes in vocal effort. In this work, we analyze and evaluate articulatory information of the tongue's movement as a means to improve the performance of speaker verification systems. We use a Spanish database, where besides the speech signals, we also include articulatory information that was acquired with an ultrasound system. Two groups of features are proposed to represent the articulatory information, and the obtained performance is compared to an SV system trained only with acoustic information. Our results show that the proposed features contain highly discriminative information, and they are related to speaker identity; furthermore, these features can be used to complement and improve existing systems by combining such information with cepstral coefficients at the feature level.
Los sistemas actuales de verificación de hablantes (VH) pueden verse afectados por variaciones inesperadas durante la fase de validación, tales como ruido de entorno o cambios en el esfuerzo vocal. En este trabajo se evalúa la información articulatoria del movimiento de la lengua como medio para mejorar el desempeño de los sistemas de verificación del hablante. Se utilizó una base de datos en español, donde además de las señales de voz, también se adquiere información articulatoria con un sistema de ultrasonido. Se proponen dos grupos de características para representar la información articulatoria y el desempeño obtenido es comparado con un SVH entrenado únicamente con información acústica. Los resultados muestran que las características propuestas contienen gran cantidad de información discriminativa y altamente asociada a la identidad de los hablantes, además que se pueden emplear para complementar y mejorar SVH existentes como por ejemplo combinando dicha información con coeficientes cepstrales.

Referencias

Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V. and Wellekens, C., Automatic speech recognition and speech variability: a review. Speech Communication. 49(10), pp. 763-786, 2007. DOI: 10.1016/j.specom.2007.02.006

O’Shaughnessy, D., Speech communications: human and machine, 2nd Ed., Wiley-IEEE Press, New York, USA, 1999, 548 P.

Kitapci, K. and Galbrun, L., Perceptual analysis of the speech intelligibility and soundscape of multilingual environments. Applied Acoustics, 151, pp. 124-136, 2019. DOI: 10.1016/j.apacoust.2019.03.001.

Rix, A.W., Beerends, J.G., Hollier, M.P. and Hekstra, A.P., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in: International Conference on Acoustics, Speech, and Signal Processing, Proceedings. IEEE, Salt Lake City, USA, 2001, pp. 749-752.

Kinnunen, T. and Li, H., An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52(1), pp. 12-40, 2010. DOI: 10.1016/j.specom.2009.08.009.

Reynolds, D.A., Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4), pp. 639-643, 1994. DOI: 10.1109/89.326623.

Doddington, G.R., Przybocki, M.A., Martin, A.F. and Reynolds, D.A., The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective. Speech Communication. 31(2-3), pp. 225-254, June, 2000. DOI: 10.1016/S0167-6393(99)00080-1.

Bishop, C.M., Pattern recognition and machine learning, 1st Ed., Springer, New York, USA, 2006.

Duin, PW. R. and Pekalska, E., Dissimilarity representation for pattern recognition, the: foundations and applications. 1st Ed., World scientific, New Jersey, USA, 2005.

Dehak, N., Kenny, P.J, Dehak, R., Dumouchel, P. and Ouellet, P., Front-end factor analysis for speaker verification. Transactions on Audio, Speech, and Language Processing, 19(4), pp.788-798, 2011. DOI: 10.1109/TASL.2010.2064307.

Kenny, P., Boulianne, G., Ouellet, P., Dumochel, P., Joint Factor Analysis Versus Eigenchannels in Speaker Recognition. In: IEEE Transactions on Audio, Speech, and Language Processing, 15(4), pp. 1435-1447, 2007. DOI: 10.1109/TASL.2006.881693.

Sreenivasa, K. and Sarkar, S., Robust speaker recognition in noisy environments, 1st Ed., Springer, New York, USA, 2014, pp. 2-49.

Leung, K., Mak, M., Siu, M. and Kung, S., Adaptive articulatory feature-based conditional pronunciation modeling for speaker verification. Speech Communication, 48(1), pp. 71-84, 2006. DOI: 10.1016/j.specom.2005.05.013.

Dromey, C. and Sanders, M., Intra-speaker variability in palatometric measures of consonant articulation. Journal of Communication Disorders, 42(6), pp. 397-407, 2009. DOI: 10.1016/j.jcomdis.2009.05.001.

Serruirer, A., Badin, P., Bo, L., Lamalle, L. and Nesuchaefer-Rube, C., Inter-speaker variability: speaker normalisation and quantitative estimation of articulatory invariants in speech production for French, in: Speech and Language Processing, 4th, 2017, Interspeech, Stockholm, Sweden, 2017.

Ghosh, P.K. and Narayanan, S., A generalized smoothness criterion for acoustic-to-articulatory inversion. The Journal of the Acoustical Society of America, 128(4), art. 2172, 2010. DOI: 10.1121/1.3455847.

Sepúlveda, A., Capobianco, R. and Castellanos, G., Estimation of relevant time frequency features using Kendall coefficient for articulator position inference. Speech Communication, 55(1), pp. 99-110, 2013. DOI: 10.1016/J.SPECOM.2012.06.005.

Potard, B., Laprie, Y. and Ouni, S., Incorporation of phonetic constraints in acoustic-to-articulatory inversion. Journal of the Acoustical Society of America, 123(4), pp. 2310-2323, 2008. DOI: 10.1121/1.2885747.

Ghosh, P.K. and Narayanan, S., Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. The Journal of the Acoustical Society of America, 130(4), pp. EL251-EL257, 2011. DOI: 10.1121/1.3634122.

Li, M., Kim, J., Lammert, A., Ghosh, P.K., Ramanarayanan, V. and Narayanan, S., Speaker verification based on the fusion of speech acoustics and inverted articulatory signals. Computer Speech & Language, 36, pp. 196-211, 2016. DOI: 10.1016/j.csl.2015.05.003.

Aravind, I. and Ghosh, P.K., Inferring speaker identity from articulatory motion during speech, in: Workshop on Machine Learning in Speech and Language Processing 5th Interspeech, 2018, Hyderabad, India, 2018.

Aron, M., Kerrien, E., Berger, M. and Laprie, Y., Coupling electromagnetic sensors and ultrasound images for tongue tracking in International Seminar on Speech Production, 7th, ISSP, 2006, Ubatuba-SP, Brazil. 2006.

Narayanan, S., Toutios, A., Ramanarayanan, V., Lammert, A., Kim, J., Lee, S., Nayak, K., Kim, Y., Zhu, Y., Goldstein, L., Byrd, D., Bresch, E., Ghosh, P., Katsamanis, A. and Proctor, M., Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). The Journal of the Acoustical Society of America, 136(3), pp. 1307-1311, 2014. DOI: 10.1121/1.4890284.

Prasad, A., Periyasamy, V. and Ghosh, P., Estimation of the invariant and variant characteristics in speech articulation and its application to speaker identification, in: International Conference on Acoustics, Speech and Signal Processing (ICASSP), 40th, IEEE, 2015, Brisbane, Australia, 2015, pp. 4265-4269.

Porras, D., Sepúlveda, A. and Csapó, G., DNN-based acoustic-to-articulatory inversion using ultrasound tongue imaging, in: 21st, International Join Conference on Neural Networks, IJCNN, Budapest, Hungary, 2019.

Scobbie, J., Wrench, A. and van der Linden, M., Head-probe stabilisation in ultrasound tongue imaging using a headset to permit natural head movement, in: 8th International seminar on speech production. Proceedings of the 8th International seminar on speech production, Strasbourg, France, 2008, pp. 373-376.

Whalen, D., Iskarous, K., Tiede, M., Ostry, D., Lehnert-LeHouillier, H., Vatikiotis-Bateson, E. and Hailey, D., The Haskins optically corrected ultrasound system (HOCUS). Journal of Speech, Language, and Hearing Research, 48(3), pp. 543-553, 2005. DOI: 10.1044/1092-4388(2005/037).

Castillo, M., Rubio, F., Porras, D., Contreras, S. and Sepúlveda, A., A small vocabulary database of ultrasound image sequences of vocal tract dynamics, in: 2019 XXII Symposium on Image, Signal Processing and Artificial Vision (STSIVA) Conference Proceedings, IEEE, Bucaramanga, Colombia, 2019, pp. 1-5.

Li, M., Kambhamettu, C. and Stone, M., Automatic contour tracking in ultrasound images. Clinical Linguistics & Phonetics, 19(6-7), pp. 545-554, 2005. DOI: 10.1080/02699200500113616.

Kass, M., Witkin, A. and Terzopoulos, D., Snakes: active contour models. International Journal of Computer Vision, 1(4), pp. 321-331, 1988. DOI: 10.1007/BF00133570.

Proakism, J. and Manolakis, D., Digital signal processing: principles, algorithms, and applications, 3rd Ed, Prentice Hall, N.J., United States of America, 1996.

Childers, D.G., Skinner, D.P. and Kemerait, R.C., The cepstrum: a guide to processing. Proceedings of the IEEE, 65(10), pp. 1428-1443, 1977. DOI: 10.1109/PROC.1977.10747.

O'Shaughnessy, D., Invited paper: automatic speech recognition: history, methods and challenges. Pattern Recognition, 41(10), pp. 2965-2979, 2008. DOI: 10.1016/j.patcog.2008.05.008.

Ichikawa, O., Fukuda, T. and Nishimura, M., Dynamic features in the linear-logarithmic hybrid domain for automatic speech recognition in a reverberant environment. IEEE Journal of Selected Topics in Signal Processing, 4(5), pp. 816-823, 2010. DOI: 10.1109/JSTSP.2010.2057191.

Zheng, F., Zhang, G. and Song, Z., Comparison of different implementations of MFCC, Journal of Computer Science and Technology, 16(6), pp. 582-589, 2001.

Young, S., Evermann, G., Kershaw, D., Moore, G., Gales, M., Odell, J., Hain, T., Liu, X., Ollason, D., Povey, D., Valtchev, V. and Woodland, P. The htk book [online], version 3.2, Cambridge University Engineering Department, Cambridge, UK, 2002. [Consulted, September 14th of 2018]. Available at: http://www.dsic.upv.es/docs/posgrado/20/RES/materialesDocentes/alejandroViewgraphs/htkbook.pdf.

Brookes, M., Voicebox: speech processing toolbox for matlab. [online]. 2nd Ed., Department of Electrical & Electronic Engineering, Imperial College, London, UK, 2011. [Consulted, September 20, 2018]. Available at: http://www.ee.ic.ac.uk/hp/staff/dmb/ voicebox/voicebox.html.

Sadjadi, S.O., Slaney, M. and Heck, L., Msr identity toolbox v1.0: a Matlab toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter, 1(4), pp.1-32, 2013. [39] Reynolds, D.A., Quatieri, T.F. and Dunn, R.B., Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1-3), pp.19-41, 2000. DOI: 10.1006/dspr.1999.0361.

Bousquet, P-M., Matrouf, D. and Bonastre, J-F., Intersession compensation and scoring methods in the i-vectors space for speaker recognition, in: 12th Annual Conference of the International Speech Communication Association Proceedings, ISCA, Florence, Italy, 2011, pp. 485-488.

Sizov, A., Lee, K.A. and Kinnunen, T., Unifying probabilistic linear discriminant analysis variants in biometric authentication, in: 10th The Joint Biannual Event Statistical Pattern Recognition Technique and Structural and Syntactical Pattern Recognition, Proc. S+SSPR, IAPR, Joensuu, Finland, 2014, pp. 464-475.

Simon, J.D., Prince and Elder, J.H., Probabilistic linear discriminant analysis for inferences about identity, in: 11th International Conference on Computer Vision, Proceedings ICCV, IEEE, Rio de Janeiro, Brazil, 2007, pp. 1-8.

Zhang, Y., Long, Y., Shen, X., Wei, H., Yang, M., Ye, H. and Mao, H., Articulatory movement features for short-duration text-dependent speaker verification. International Journal of Speech Technology, 20(4), pp. 753-759, 2017. DOI: 10.1007/s10772-017-9447-8.

Martin, A., Doddington, G., Kamm, T., Ordowski, M. and Przybocki, M., The det curve in assessment of detection task performance, Gaithersburg, USA, National Institute of Standards and Technology, 1997.

Cómo citar

IEEE

[1]
F. A. Sepulveda Sepulveda, D. Porras-Plata, y M. Sarria-Paja, «Speaker verification system based on articulatory information from ultrasound recordings», DYNA, vol. 87, n.º 213, pp. 9–16, abr. 2020.

ACM

[1]
Sepulveda Sepulveda, F.A., Porras-Plata, D. y Sarria-Paja, M. 2020. Speaker verification system based on articulatory information from ultrasound recordings. DYNA. 87, 213 (abr. 2020), 9–16. DOI:https://doi.org/10.15446/dyna.v87n213.81772.

ACS

(1)
Sepulveda Sepulveda, F. A.; Porras-Plata, D.; Sarria-Paja, M. Speaker verification system based on articulatory information from ultrasound recordings. DYNA 2020, 87, 9-16.

APA

Sepulveda Sepulveda, F. A., Porras-Plata, D. & Sarria-Paja, M. (2020). Speaker verification system based on articulatory information from ultrasound recordings. DYNA, 87(213), 9–16. https://doi.org/10.15446/dyna.v87n213.81772

ABNT

SEPULVEDA SEPULVEDA, F. A.; PORRAS-PLATA, D.; SARRIA-PAJA, M. Speaker verification system based on articulatory information from ultrasound recordings. DYNA, [S. l.], v. 87, n. 213, p. 9–16, 2020. DOI: 10.15446/dyna.v87n213.81772. Disponível em: https://revistas.unal.edu.co/index.php/dyna/article/view/81772. Acesso em: 22 mar. 2026.

Chicago

Sepulveda Sepulveda, Franklin Alexander, Dagoberto Porras-Plata, y Milton Sarria-Paja. 2020. «Speaker verification system based on articulatory information from ultrasound recordings». DYNA 87 (213):9-16. https://doi.org/10.15446/dyna.v87n213.81772.

Harvard

Sepulveda Sepulveda, F. A., Porras-Plata, D. y Sarria-Paja, M. (2020) «Speaker verification system based on articulatory information from ultrasound recordings», DYNA, 87(213), pp. 9–16. doi: 10.15446/dyna.v87n213.81772.

MLA

Sepulveda Sepulveda, F. A., D. Porras-Plata, y M. Sarria-Paja. «Speaker verification system based on articulatory information from ultrasound recordings». DYNA, vol. 87, n.º 213, abril de 2020, pp. 9-16, doi:10.15446/dyna.v87n213.81772.

Turabian

Sepulveda Sepulveda, Franklin Alexander, Dagoberto Porras-Plata, y Milton Sarria-Paja. «Speaker verification system based on articulatory information from ultrasound recordings». DYNA 87, no. 213 (abril 1, 2020): 9–16. Accedido marzo 22, 2026. https://revistas.unal.edu.co/index.php/dyna/article/view/81772.

Vancouver

1.
Sepulveda Sepulveda FA, Porras-Plata D, Sarria-Paja M. Speaker verification system based on articulatory information from ultrasound recordings. DYNA [Internet]. 1 de abril de 2020 [citado 22 de marzo de 2026];87(213):9-16. Disponible en: https://revistas.unal.edu.co/index.php/dyna/article/view/81772

Descargar cita

CrossRef Cited-by

CrossRef citations1

1. Naren Arley Mantilla Ramírez, Iván Darío Porras Gómez, Alexander Sepúlveda Sepúlveda. (2022). Detección de especies maderables mediante sensores químicos de olor, aplicando regularización L1 y modelos de mezclas gaussianas. Revista Logos Ciencia & Tecnología, 15(1), p.8. https://doi.org/10.22335/rlct.v15i1.1642.

Dimensions

PlumX

Visitas a la página del resumen del artículo

568

Descargas

Los datos de descargas todavía no están disponibles.