Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish-Reference-Cited by-同舟云学术

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Published:2023-05-26 Issue:11 Volume:13 Page:6521
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Gimeno-Gómez David¹^ORCID,Martínez-Hinarejos Carlos-D.¹^ORCID

Affiliation:

1. Pattern Recognition and Human Language Technologies Research Center, Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, Spain

Abstract

Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique, the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.

Funder

Generalitat Valenciana

Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/11/6521/pdf

Reference72 articles.

1. Hidden Markov models for speech recognition;Juang;Technometrics,1991

2. Gales, M., and Young, S. (2008). The Application of Hidden Markov Models in Speech Recognition, Now Publishers Inc.

3. Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016, January 20–25). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the ICASSP, Shanghai, China.

4. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv.

5. Speech recognition in adverse environments;Juang;Comput. Speech Lang.,1991