One is not enough: Multimodal learning for richer information.

Unimodal deep learning focuses on processing, analyzing, and generating data from a single modality. For example, this could involve training neural nets on images, text, or audio to classify, regress, or generate on only images, text, or audio. Multimodal deep learning, on the other hand, involves integrating and jointly processing data from multiple modalities. This could include combining information from images, text, or audio to perform tasks that require understanding across modalities.

Humans acquire knowledge from multiple modalities to leverage rich information to optimize learning skills. For example, a person can recognize the emotion of another more distinctly when a human can see the face and listen to that person’s speech. 

Similar to human-human emotion recognition, in human-robot interaction, robots need to recognize the emotion by fusing multiple modalities that can include vision (image), audio (speech), text (verbal), or even touch. However, most existing work on emotion recognition relies on unimodality. As such, the robot needs to understand emotion from different modalities to have fluid interaction in a social setting. 

To leverage different modalities for emotion recognition in Human-robot interaction, we need to understand a few components:

  • Fusion of different modalities: As we leverage different modalities, we need to understand how we can optimally combine those modalities so that rich information can be extracted from each modality. This can be done through early fusion, where different modality is combined as input to deep models. Similarly, we can perform intermediate fusion, fusing features from different modalities. Furthermore, finally, late fusion is addressed by combining the prediction of different modalities.
  • Extraction of Features: Extracting important features from each modality using modality-specific techniques or models is crucial. For example, I used facial expression models to extract facial landmarks and expressions from images, extract audio features such as pitch, intensity, and spectral features from speech signals, and analyze textual content using natural language processing techniques.
  • Learning with cross-modalities: Exploring architectures for learning representations that capture cross-modal relations between different modalities is crucial. Deep architectures such as Siamese networks, Cross-Modal Retrieval models, and Variational Autoencoders can be used to learn joint representations from multiple modalities.