It is easy for humans to estimate depth in a scene, but what about machines? Typically, robots and self-driving cars use LiDAR sensors to gauge the depth of a scene. However, LiDAR is an expensive sensor that is beyond the reach of many personal vehicles. Robo-Taxis may be reasonable in business models that provide service across a city, but not for personal vehicles. As a result, some companies are using camera-only approaches to infer depth information from monocular images. I will discuss some of the state-of-the-art approaches for monocular depth estimation in this post.

Several approaches are usually used for depth estimation:

Geometry-based methods: Geometric constraints are used to recover 3D structures from images. In 3D reconstruction and Simultaneous Localization and Mapping (SLAM) , structure from motion (SfM) is an effective method of estimating 3D structures from a series of 2D image sequences. The accuracy of depth estimation depends heavily on exact feature matching and high-quality image sequences. SfM suffers from monocular scale ambiguity as well. Stereo vision matching is also capable of recovering 3D structures of scenes from two viewpoints. It simulates the way human eyes work by using two cameras, and a cost function is used to calculate disparity maps of images. Due to the calibration of the transformation between the two cameras, the scale factor is incorporated into depth estimation during stereo vision matching.
Sensor-based methods: This approach uses sensors such as RGB-D and LiDAR. There are sevaral disadvantages for this method such as cost, power consumption, size of sensor.
Deep learning-based methods: The pixel-level depth map can be recovered from a single image in an end-to-end manner based on deep learning. It can be done in supervised, semi-supervised, or self-supervised manner.

In this post, we just consider the self-supervised methods in which the geometric constraints between frames are regarded as the supervisory signal during the training process. There are several types of self-supervised learning methods for estimating depth using images, such as stereo-based and monocular videos. Using methods based on monocular videos presents its own challenges. Along with estimating depth, the model also requires estimating ego-motion between pairs of temporal images during training. The process involves training a pose estimation network, which takes a finite sequence of frames as input and outputs the corresponding camera transformations. Stereo data, however, make the camera-pose estimation a one-time offline calibration but may introduce occlusion and texture-copy artifacts.

One of the interesting use cases for depth estimation is to use it as an auxiliary task for end-to-end policy learning. It can lead to better representation learning and help the policy to learn some information about the geometric and the depth of the scene. Other tasks, such as optical flow, semantic segmentation, object detection, motion prediction, etc, can also be used to improve representation learning. For example, the following image shows a model from Wayve.ai, a self-driving car company in the UK working on end-to-end autonomous driving, which tries to use multi-task learning to improve representation learning and driving policy learning.

Unsupervised Learning of Depth and Ego-Motion from Video

Digging Into Self-Supervised Monocular Depth Estimation