While many researchers study computer vision, natural language processing, or robotics, we are mainly interested in the intersection of these three domains. Specifically, we are interested in Vision and Language Navigation (VLN). For an agent or a robot to perform VLN, it can perceive the 360-degree view of the environment and is given human instructions in the form “Take a right, going past the kitchen into the hallway”. The goal for the agent is to follow human instruction and navigate in a previously unknown environment.
Recently, the research in VLN has been growing. Typically, VLN involves an agent that follows human instructions and navigates in a previously unknown environment. In some of the recent works, there is a focus on generating better quality instructions that could lead to better conversational agents. Moreover, using a combination of synthetic instructions along with human instructions could also improve navigation performance. In my first project, we propose an architecture inspired by a Generative Pre-Trained Transformer (GPT). The model generates synthetic instructions of the path the agent has traversed and consists of a transformer decoder that generates sentences for a sequence of images from the environment describing the agent’s path to a particular point in time.
While the original GPT model has been implemented for NLP applications of text generation or text summarization, the model that we propose describes the actions the agent has to take in an environment until it reaches the target location, given the sequence of images of the traversed path. The overall approach is as follows. First, we take the images from the environment the agent has traversed. These images are fed into a trained vision encoder CLIP to extract the features from the images which are then fed into a GPT-2 decoder model along with the first Begin of String (BoS) token. The GPT-2 decoder predicts the subsequent language tokens as an output and is thereby, able to generate a complete language instruction to describe the actions the agent has taken to reach the final location in the last image. Unlike in GPT, where the input of the model is only the text up to the previous time step, in our case the input of the model consists of both images and text. Inspired by BERT, we use segment embeddings and position embeddings in addition to token embeddings to effectively segregate image and textual information.
The proposed model generates high quality synthetic instructions that could potentially improve the ability of an agent to navigate in a previously unknown environment. The model could also be extended to other applications. For example, when the human asks an agent a question where an object is, it can provide the human with instructions on how they could reach it. This would enable humans and robots to collaborate and interact with each other in the same environment. The robot could also help a human navigate in an indoor environment when the human is unable to figure out the way. Or the other way round, the human could guide the robot in case the robot gets stuck in a particular location through continuous interaction between the human and the robot.
An example of generated instruction for the given images of the path:
Ground truth instruction:
Exit bedroom and head straight toward living area, and wait at entrance.
exit the bathroom and walk through the bedroom . then stand near the entrance of the front door .