Multimodal Emotional Understanding involves analyzing and integrating information from various modalities, such as text, audio, and visual cues, to accurately interpret human emotions and intentions. This approach enhances human-computer interaction and enables more responsive systems in applications like customer service, mental health monitoring, and social robotics, ultimately improving the user experience by providing more nuanced and context-aware responses.
We develop benchmarks for emotional understanding with Multimodal Large Language Models (MLLMs). We also design frameworks that integrate different cues to enhance emotion and intention recognition.