Creating a Navigation Agent for the Visually Impaired

Step 1: Build a Sidewalk Simulation Environment

As part of the NAVI project, we have released our dataset and environment called SEVN (Sidewalk Simulation Environment for the Visually Impaired). We hope that this work will enable future research in visual navigation and, in the long term, contribute to allowing the blind and visually impaired to gain autonomy when commuting.

For a moment, imagine you will imminently become blind to a degenerative illness. You must find a rehab center and get training to use a cane, learn to love Siri, and buy an Alexa. You’re going to have to learn how to be independent all over again. Unfortunately, there are some compromises you’ll have to make. Take this example.

You’re at home on a Friday evening when you get a text inviting you to a party. But venturing to a new location is tough without the help of a sighted person. Sure, you can get out of the house, and Uber can take you most of the way, but then what? There are a lot of tools out there that can help, but most of them rely on GPS and can’t actually get you all the way to the venue. The perfect end-to-end solution does not yet exist.

That’s why we’re building a tool called NAVI, the Navigation Agent for the Visually Impaired. NAVI uses GPS and vision – even reading scene text like house numbers and street signs – to dynamically navigate!

There are three primary processes that make up the NAVI system:

These are the processes that take place, usually subconsciously, for those that are not visually impaired – sensing, perception and reasoning. Our final goal is to recreate these processes in the NAVI technology. But before we can successfully develop this solution, there are three key components that have yet to be tackled.

  1. Create a navigation environment with real-world data and simulates sensor signals
  2. Train an RL agent to navigate that environment
  3. Transfer that agent to an edge device for real-world use

We are committed to developing these systems as free, open-source software. Any future deployed NAVI system will not upload your location to a server or track your behaviour in any way.

In our work SEVN: A Sidewalk Simulation Environment for Visual Navigation, we present an RL environment to train agents in and set a baseline performance on the navigation task using the RL algorithm Proximal Policy Optimization.

A Sidewalk Simulation Environment for Visual Navigation

Real World Data

Our dataset contains 5k+ panoramic images taken from Montreal’s Little Italy. They cover 7 blocks, totaling 6.3km of sidewalk. We annotated doors, house numbers and street signs and provided ground truth text labels for each annotation.

An Open-AI gym environment

Our Sidewalk Simulation Environment for Visual Navigation (SEVN) is a realistic representation of what it is like to travel over the sidewalk in a typical urban neighborhood. The goal is to use this simulated environment to train Reinforcement Learning agents that will eventually be able to solve this task and transfer to the real world.

The simulator allows the agent to observe a part of a specific panorama based on its position and orientation, and navigate in the environment by rotating or transitioning to neighboring panoramas. Many tasks are available, they have different sensor modalities and present different challenges with varying difficulty.

We chose Montreal’s Little Italy because it features a high variety of zones and encompasses interesting characteristics that make it a challenging setting for navigation. In a relatively small area, there are many different types of zones, such as residential commercial and industrial areas.

How we built it

To build our environment, we collected raw video with a 360 degree Vuze+ mounted on a monopod and held sligthly above eye-height. The Vuze+ is equiped with four synchronized stereo cameras, each one composed of two images sensors with fisheye lenses. We captured 360 degree high definition video, from which we extracted 30 frames per second.

To get an accurate localisation for each one of our frames, we used ORBSLAM2, a Simultaneous Localisation and Mapping pipeline. This allowed us to triangulate the position of the camera at every timestep relying only on visual features from our footage.

In parallel, we used the VuzeVR Studio software to stitch together the raw footage from each camera and obtain a 360 stabilized video, from which we extracted 3840 x 2160 pixel equirectangular projections.

Finally, we used the coordinates from each image to geo-localise the 360 degree equirrectangular panoramas. We constructed a graph with these panoramas, where an agent can navigate by transferring between neighboring images. This results in a Google Street View–like environment. For a more detailed explanation of the process, take a look at our paper and our code.

How to use it

Our environment SEVN is compatible with Open AI gym and quite easy to use. Here is an example:

import gym
import SEVN_gym

env = gym.make("SEVN-Mini-All-Shaped-v1")
obs = env.reset() # See obs 1

In SENV-Mini-All-Shaped, the agent has access to all different modalities. The observation contains:

action = env.Actions['FORWARD']
obs, rew, done, info = env.step(action) # See obs 2

This environment features a shaped reward. The agent get a positive reward every time it gets closer to its goal, and a negative reward when it moves away from its goal.

action = env.Actions['BIG_RIGHT']
obs, rew, done, info = env.step(action) # See obs 2

We hope that our environment will contribute to the progress of state-of-the-art Visual Navigation models, and will be a stepping stone for coming up with a system that will solve this problem for the blind and visually impaired.