Continuous lipreading based on acoustic temporal alignments-Reference-Cited by-同舟云学术

Continuous lipreading based on acoustic temporal alignments

Published:2024-05-06 Issue:1 Volume:2024 Page:
ISSN:1687-4722
Container-title:EURASIP Journal on Audio, Speech, and Music Processing
language:en
Short-container-title:J AUDIO SPEECH MUSIC PROC.

Author:

Gimeno-Gómez David^ORCID,Martínez-Hinarejos Carlos-D.

Abstract

AbstractVisual speech recognition (VSR) is a challenging task that has received increasing interest during the last few decades. Current state of the art employs powerful end-to-end architectures based on deep learning which depend on large amounts of data and high computational resources for their estimation. We address the task of VSR for data scarcity scenarios with limited computational resources by using traditional approaches based on hidden Markov models. We present a novel learning strategy that employs information obtained from previous acoustic temporal alignments to improve the visual system performance. Furthermore, we studied multiple visual speech representations and how image resolution or frame rate affect its performance. All these experiments were conducted on the limited data VLRF corpus, a database which offers an audio-visual support to address continuous speech recognition in Spanish. The results show that our approach significantly outperforms the best results achieved on the task to date.

Funder

Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital, Generalitat Valenciana

Ministerio de Ciencia e Innovación

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s13636-024-00345-7.pdf

Reference102 articles.

1. S. Dupont, J. Luettin, Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000). https://doi.org/10.1109/6046.865479

2. J. Besle, A. Fort, C. Delpuech, M.-H. Giard, Bimodal speech: early suppressive visual effects in human auditory cortex. Eur. J. NeuroSci. 20(8), 2225–2234 (2004). https://doi.org/10.1111%2Fj.1460-9568.2004.03670.x

3. H. McGurk, J. MacDonald, Hearing lips and seeing voices. Nature. 264(5588), 746–748 (1976). https://doi.org/10.1038/264746a0

4. M. Gales, Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12 (2), 75-98 (1998). https://doi.org/10.1006/csla.1998.0043

5. B.H. Juang, L.R. Rabiner, Hidden Markov models for speech recognition. Echnometrics. 33(3), 251–272 (1991). https://doi.org/10.2307/1268779