27 Apr 2022 |
Research article |
Intelligent and Autonomous Systems
Emotion Recognition using Cross-Attentional AudioVisual Fusion
Purchased on Istockphoto.com. Copyright.
Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition (ER) over isolated unimodal approaches. In this work, we explored the fusion of audio (A) and visual (V) modalities in a complementary fashion in order to extract robust multimodal feature representations. The objective was to address the problem of continuous emotion recognition, where we aim to estimate the wide range of human emotions on a continuous scale of valence and arousal. Specifically, we introduced a cross-attentional fusion approach to extract the salient features across AV modalities, allowing for an accurate prediction of continuous values of valence and arousal. Our new cross-attentional AV fusion model efficiently leverages the intermodal AV relationships. In particular, it computes cross-attention weights to focus on the more relevant features across individual modalities, thereby combines contributive feature representations, which are then fed to prediction layers to predict valence and arousal. Our work has a lot of potential in real-world applications such as pain intensity estimation, depression level estimation, etc., and in health care, driver fatigue detection in driver assistance systems, etc.
Emotion Recognition: A Challenging Task
Automatic recognition and analysis of human emotions have drawn much attention over the past few decades. It has a wide range of applications in various fields, such as health care (anger, fatigue, depression and pain assessment), robotics (human-machine interaction), driver assistance (driver condition assessment). Emotion recognition (ER) is a challenging problem since the expressions linked to human emotions are extremely diverse in nature across individuals and cultures.
The Valence-Arousal Space
Recently, real-world applications have brought about a shift in affective computing research from laboratory-controlled environments to more realistic natural settings. This shift has further led to the analysis of a wide range of subtle, continuous emotional states elicited in real-world settings, such as pain intensity estimation, depression level estimation, etc. Normally, continuous ER states are formulated as a dimensional ER problem, where complex human emotions can be represented in a dimensional space. Figure 2 illustrates the two-dimensional space representing emotional states, where valence and arousal are employed as dimensional axes. Valence reflects the wide range of emotions in the dimension of pleasantness, from being negative (sad) to positive (happy), whereas arousal spans the range of intensities from passive (sleepiness) to active (high excitement).
Using Multimodal Systems
Human emotions can be conveyed through various modalities like face, voice, text and biosignals (electroencephalogram, electrocardiogram, etc.), each typically carrying diverse information. Although human emotions can be expressed through various modalities, vocal and facial modalities are the predominant contact-free channels in videos carrying complementary information. In this work, we investigated the prospect of efficiently leveraging the complementary nature of AV relationships captured in videos to improve the performance of multimodal systems over unimodal ones. For instance, when the facial modality is missing due to pose, blur, low illumination, etc., we can still leverage the audio modality to estimate the emotional state, and vice versa.
Given the set of video sequences, we extracted the audio and visual streams separately, where the visual stream is preprocessed to obtain cropped and aligned images and the audio stream is processed to obtain spectrograms of the corresponding visual clips. Then they were fed to visual and audio backbones to extract the corresponding visual and audio features respectively, which was further fed to the cross-attentional model. In the fusion (cross-attentional) model, we obtained the attention weights for each modality based on the correlation measure across the audio and visual features. The higher correlation measure shows that the corresponding audio and visual features are highly related to each other and carry relevant information. The final attended features are then obtained using the attention weights and fed to the prediction layer to estimate the predictions of valence and arousal.
For more information on that research, please read the following conference paper:
 R. G. Praveen, E. Granger and P. Cardinal, “Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition,” 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 2021, pp. 1-8.
Gnana Praveen Rajasekar
Gnana Praveen Rajasekar is working on audiovisual fusion for emotion recognition for his PhD. He received his M.Tech degree in Signal Processing from the Indian Institute of Technology Guwahati.
Program : Information Technology Engineering
Research laboratories : LIVIA – Imaging, Vision and Artificial Intelligence Laboratory
Eric Granger is a professor in the Systems Engineering Department at ÉTS. His research focuses on machine learning, pattern recognition, computer vision, information fusion, and adaptive and intelligent systems.
Program : Automated Manufacturing Engineering
Patrick Cardinal is a professor and director of the Software Engineering and Information Technologies Department at ÉTS. His research focuses on speech recognition, language identification, emotion detection, and parallel processing.