Home Paper note - [Week 7]
Post
Cancel

Paper note - [Week 7]

Paper note - [Week 7]

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Motivation

  • Previous approaches to natural language command of robots have often neglected the visual information processing aspect of the problem. Using rendered, rather than real images, for example, constrains the set of visible objects to the set of hand-crafted models available to the renderer. This turns the robot’s challenging open-set problem of relating real language to real imagery into a far simpler closed-set classification problem. The natural extension of this process is that adopted in works where the images are replaced by a set of labels.
  • Limiting the variation in the imagery inevitably limits the variation in the navigation instructions also. What distinguishes the VLN challenge is that the agent is required to interpret a previously unseen natural-language navigation command in light of images generated by a previously unseen real environment. The task thus more closely models the distinctly open-set nature of the underlying problem

Contribution

  • The research problem in this paper is embodied AI, specifically the task of Vision-and-Language Navigation (VLN). This is a practical problem in robotics, where language-empowered intelligent agents adapt to the physical environment. Despite the recent successes in vision and language tasks individually, this combination has not been systematically studied due to the challenge of linking both tasks in an unstructured and unseen environment.

  • This work pioneered the research of visually-grounded natural language navigation and inspired more recent work to push the boundary forward. The main contributions of this paper include the proposal of Matterport 3D Simulator as a large-scale interactive reinforcement learning environment, Room-to-Room (R2R) as the state-of-art benchmark dataset, and an attention-based sequence-to-sequence model designed to introduce a baseline for the VLN task.

Method

In this work, the authors first introduced a novel Matterport3D Simulator and Room-to-Room task/dataset, and then further investigated the difficulty of this task by proposing several plausible models using this dataset.

Firstly for Matterport3D Simulator, 10,800 densely-sampled panoramic RGBD images of real environments are sampled. The key point is the real-world images, other than the readily available synthesized datasets because no synthesized datasets can level the real image for its rich visual context. Then, based on this simulator, the R2R dataset is prepared to support the R2R task, where an embodied agent intake language instructions to navigate from a starting pose to a goal location.

In the simulator, an embodied agent taking advantage of the panoramic views to virtually “move” throughout the scene, thus R2R addressed the fact that the agent can move and control the camera in comparison to previous benchmarks. In obtaining R2R, the top 3 navigation instructions were collected using Amazon Mechanical Turk in a time-consuming process. The average length of instructions is 29 words (much longer than VQA), and the average trajectory length ~10m. Lastly, a sequence-to-sequence model was proposed, similar to models for VQA, but used ResNet-152, LSTM, and a bottom-up attention mechanism. The LSTM encoder encodes the language tokens, and the LSTM decoder decodes a sequence of actions to take in the environment while keeps track of the agent’s traversing history. At every timestamp, the model receives a new visual observation.

In training, the model is to predict the action the shortest path would take from the current state. Besides, the authors experimented with “teacher-forcing”, where the target word is passed as the next input to the decoder, and “student-forcing”, where the next action is sampled from the previous output probability distribution.

Conclusion

One limitation the paper mentioned is from its choice of dataset Matterport3D dataset as it comprises clean and tidy scenes of luxurious interiors with hardly any moving objects, such as human or animals.

The simulator could be extended to incorporate depth information so that the agent can learn a semantic depth map of the environment. Though, It’s still very commendable because it’s real-world imagery with rich visual context, important in preventing overfitting. Another implicit limitation is the language model currently only supports English instructions, which is inconvenient for non-English speakers. I expect future works incorporating more powerful language models into VLN task to expand on this.

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

Motivation

Focusing our discussion on Vision-and-Language Navigation (VLN), the existence and common usage of the nav-graph imply the following assumptions:

  • Known topology.
  • Oracle navigation.
  • Perfect localization.

=> Taken together, these assumptions make current settings poor reflections of the real world both in terms of control (ignoring actuation, navigation, and localization error) and visual stimuli (lacking the poor framing and long observationsequences agents will encounter). In essence, the problem is reduced to that of visually-guided graph search. As such, closing the loop by transferring these trained agents to physical robotic platforms has not been examined.

=> Vision-and-Language Navigation in Continuous Environments. In this work, we focus in on the Vision-and-Language Navigation (VLN) task and lift these implicit assumptions by instantiating it in continuous 3D environments.

Contribution

  • Lift the VLN task to continuous 3D environments – removing many unrealistic assumptions imposed by the nav-graph-based representation.
  • Develop model architectures for the VLN-CE task and evaluate a suite of single-input ablations to assess the biases and baselines of the setting.
  • Investigate how a number of popular techniques in VLN transfer to this more challenging long-horizon setting – identifying significant gaps in performance.

Method

VLN in Continuous Environments (VLN-CE)

Given a natural language navigation instruction, an agent must navigate from a start position to the described goal in a continuous 3D environment by executing a sequence of low-level actions based on egocentric perception alone.

VLN-CE Dataset

In total, the VLN-CE dataset consists of 4475 trajectories converted from R2R train and validation splits. For each trajectory, we provide the multiple R2R instructions and a pre-computed shortest path following the waypoints via lowlevel actions.

Instruction-guided Navigation Models in VLN-CE

image

Experiments

Conclusion

In this work, we explore the problem of following navigation instructions in continuous environments with low-level actions – lifting many of the unrealistic assumptions in prior nav-graph-based settings. Our work lays the groundwork for future research into reducing the gap between simulation and reality for VLN agents. Crucially, setting our VLN-CE task in continuous environments (rather than a nav-graph) provides the community a testbed where integrative experiments studying the interface of high- and low-level control are possible.

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Motivation

Contribution

Method

Experiments

Conclusion

This post is licensed under CC BY 4.0 by the author.