Deep Multimodal Embedding
Obtaining a good common representation between different modalities is challenging for two main reasons. First, each modality might intrinsically have very different statistical properties – for example, most trajectory representations are inherently dense, while a bag-of-words representation of language is by nature sparse. This makes it challenging to apply algorithms designed for unimodal data, as one modality might overpower the others. Second, even with expert knowledge, it is extremely challenging to design joint features between such disparate modalities. Humans are able to map similar concepts from different sensory system to the same concept using common representation between different modalities . For example, we are able to correlate the appearance with feel of a banana, or a language instruction with a real-world action. This ability to fuse information from different input modalities and map them to actions is extremely useful to a household robot.
We introduce an algorithm that learns to pull together semantically similar environment/language pairs and their corresponding trajectories to the same regions in a shared embedding space, and push environment/language pairs away from irrelevant trajectories based on how irrelevant these tra- jectories are. Our algorithm also allows for efficient inference because, given a new instruction and point-cloud, we only need to find the nearest trajectory to the projection of this pair in the learned embedding space using a fast nearest- neighbor algorithms .
* Please refer to the journal version of the work for the explanation of this figure.