Enhancing Facial Expression Recognition through Light Field Cameras
Author:
Oucherif Sabrine Djedjiga1ORCID, Nawaf Mohamad Motasem2ORCID, Boï Jean-Marc2, Nicod Lionel3ORCID, Mallor Elodie3, Dubuisson Séverine2ORCID, Merad Djamal2ORCID
Affiliation:
1. Institut de Mathématiques de Marseille (IMM), CNRS, Aix-Marseille University, 13009 Marseille, France 2. Laboratoire d’Informatique et des Systèmes (LIS), CNRS, Aix-Marseille University, 13009 Marseille, France 3. Centre d’Etudes et de Recherche en Gestion d’Aix-Marseille (CERGAM), Aix-Marseille University, 13013 Marseille, France
Abstract
In this paper, we study facial expression recognition (FER) using three modalities obtained from a light field camera: sub-aperture (SA), depth map, and all-in-focus (AiF) images. Our objective is to construct a more comprehensive and effective FER system by investigating multimodal fusion strategies. For this purpose, we employ EfficientNetV2-S, pre-trained on AffectNet, as our primary convolutional neural network. This model, combined with a BiGRU, is used to process SA images. We evaluate various fusion techniques at both decision and feature levels to assess their effectiveness in enhancing FER accuracy. Our findings show that the model using SA images surpasses state-of-the-art performance, achieving 88.13% ± 7.42% accuracy under the subject-specific evaluation protocol and 91.88% ± 3.25% under the subject-independent evaluation protocol. These results highlight our model’s potential in enhancing FER accuracy and robustness, outperforming existing methods. Furthermore, our multimodal fusion approach, integrating SA, AiF, and depth images, demonstrates substantial improvements over unimodal models. The decision-level fusion strategy, particularly using average weights, proved most effective, achieving 90.13% ± 4.95% accuracy under the subject-specific evaluation protocol and 93.33% ± 4.92% under the subject-independent evaluation protocol. This approach leverages the complementary strengths of each modality, resulting in a more comprehensive and accurate FER system.
Reference32 articles.
1. Zucco, C., Calabrese, B., and Cannataro, M. (2019, January 15–19). Emotion mining: From unimodal to multimodal approaches. Proceedings of the Brain-Inspired Computing: 4th International Workshop, BrainComp 2019, Cetraro, Italy. Revised Selected Papers 4. 2. A review of affective computing: From unimodal analysis to multimodal fusion;Poria;Inf. Fusion,2017 3. Single lens stereo with a plenoptic camera;Adelson;IEEE Trans. Pattern Anal. Mach. Intell.,1992 4. Sepas-Moghaddam, A., Chiesa, V., Correia, P.L., Pereira, F., and Dugelay, J.L. (2017, January 4–5). The IST-EURECOM light field face database. Proceedings of the 2017 5th International Workshop on Biometrics and Forensics (IWBF), Warwick, UK. 5. David, P., Le Pendu, M., and Guillemot, C. (2017, January 16–18). White lenslet image guided demosaicing for plenoptic cameras. Proceedings of the 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), Luton, UK.
|
|