Depth map

Depth map

Flat goes vertical

#️⃣   ⌛  ~1 h 🤓  Intermediate

17.01.2024

upd:

#90

Depth map

Flat goes vertical

⌛  ~1 h

#90

🎓 106/167

This post is a part of the Computer vision educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

I want to begin our in-depth discussion of depth maps by first emphasizing that, in modern computer vision, robotics, augmented reality, and more, a depth map is far more than a mere 2D image. It can be viewed as a critical piece of geometry-related information, storing not simply brightness or color at each pixel, but also embedding the distance from the camera (or sensor) to the corresponding point in the scene. Put differently, a depth map is an image whose pixel intensities encode the relative (and sometimes absolute) depth values with respect to a viewpoint.

This concept of representing spatial information has steadily gained traction in recent years. Thanks to advances in sensor technology, hardware accelerators, as well as algorithmic and architectural improvements in machine learning (including deep neural networks), depth maps today are widely used in a variety of applications ranging from environment understanding in robotics to 3D scene reconstruction in augmented and virtual reality.

One might wonder how depth maps differ from the familiar notion of a 2D image. The fundamental difference is that a standard 2D image provides primarily color or intensity values for each pixel. A depth map, instead, makes each pixel represent some measure of distance between the scene point and the camera or sensor plane. This additional dimension of geometric insight provides the basis for reconstructing full or partial 3D representations, performing advanced occlusion reasoning, and merging multiple viewpoints for immersive experiences.

In some contexts, you will see other common terms: "disparity map", "distance map", or even "range image". While these are not precisely identical in all scenarios — for instance, a disparity map typically refers to the offset between pixels in left-right images used for stereo-based 3D reconstruction — they share the central idea that each pixel or location encodes spatial distance. Sometimes, these terms are used somewhat interchangeably in everyday computer vision parlance.

The rest of this article aims to provide a thorough, theoretically grounded tour of depth maps. I will explore the historical evolution of the concept, discuss different ways of generating depth maps, review both hardware-based and purely algorithmic approaches, and consider best practices and applications in data science and machine learning. Along the way, I plan to highlight relevant research trends, significant references, and occasionally show code snippets for hands-on illustration. Let's begin by pinning down a precise definition and a bit of background.

Definition and background

What is a depth map?

By definition, a depth map is a 2D array (often with the same dimensional layout as a standard RGB image) in which each element corresponds to the distance between the camera plane (or sensor plane) and the surface of the object or environment being observed. If we denote the depth map by $D$ , then $D(u, v)$ is the depth value at pixel coordinates $u$ (horizontal index) and $v$ (vertical index). Often, these pixel coordinates correspond in a direct or near-direct way to those of an RGB image taken at the same viewpoint.

In many practical scenarios, you may encounter normalized versions of these maps, where the raw distance is scaled, clipped, or reprojected into a smaller integer or floating-point range (for instance, from 0 to 255 for 8-bit channels or 0.0 to 1.0 for floating-point channels). Other contexts require the full range of distances: for example, in some LiDAR-based systems, distances might go up to several hundred meters, necessitating specialized data types or encodings.

Whether measured by hardware (e.g. time-of-flight sensors, structured light sensors, stereo rigs) or estimated by algorithms (such as deep learning approaches trained to predict depth from a single image), the underlying concept remains: to produce a spatial representation capturing the distance for each pixel. Historically, depth maps were often stored as what is known as a "range image," especially in fields like robotics and 3D scanning. This range image is effectively identical in concept to a depth map: an image whose per-pixel intensity encodes radial distance.

Historical context and evolution

Depth sensing and 3D imaging can trace its roots back to early photogrammetry, which used pairs of images (stereo photography) to measure 3D structure. With the surge in computer vision research beginning in the 1970s and 1980s, researchers started formalizing methods such as epipolar geometry, triangulation, and structured light to automate the process of extracting 3D information. The essential idea was already known: you can infer distance if you have enough constraints (e.g. two slightly offset views of the same object).

For a long time, only specialized industrial or scientific applications used these methods, because they required careful calibration, specialized hardware, or computationally intensive algorithms to carry out in real time. In the early 2000s, interest in consumer 3D sensors and cameras was stirred by the possibility of real-time structured light scanning. With the advent of devices like Microsoft Kinect — which used an infrared projector and IR camera to estimate depth — such sensors became widely available to the general public. This democratization of hardware triggered a wave of new research in gesture recognition, augmented reality, gaming, and robotic perception.

Around the mid-2010s, progress in deep learning began to significantly impact depth map estimation, particularly from single images. Techniques based on convolutional neural networks (CNNs) showed early promise by effectively learning disparity or depth from large annotated datasets. Soon, even more sophisticated architectures such as generative adversarial networks (GANs), capsule networks, and specialized 3D convolutional networks were developed. Researchers also found ways to perform unsupervised or self-supervised learning of depth from monocular video, harnessing epipolar constraints or image reconstruction losses without explicit ground truth depth.

Principles of 3D perception

Whether the system is hardware-based (e.g. LiDAR, time-of-flight, stereo cameras) or algorithmic (structure-from-motion, deep learning from single images), all revolve around a few fundamental principles of 3D perception:

Geometry and Triangulation
By measuring angles, parallax, or shifts in pixel locations as you move or as you compare multiple viewpoints, it is possible to use geometric rules of projection to infer distance. Stereo-based methods, structure-from-motion, and some deep neural approaches rely on these geometry-based cues.
Active Sensing
By actively illuminating a scene with known patterns of light (structured light) or measuring the travel time of emitted signals (time-of-flight, LiDAR), you can directly obtain an estimate of distance. These methods typically do not rely on texture or color features in the same way that purely passive approaches do.
Appearance-Based or Learned Approaches
By training a model on large amounts of data for which the ground truth depth (or disparity) is known, the model can learn to predict depth purely from intensity or color cues, leveraging prior knowledge learned from examples.

All these approaches are valid ways to generate or refine depth maps. In practice, you might see hybrid techniques that combine active and passive sensing or that fuse multiple hardware streams.

Types

Although there are many ways to categorize depth maps and their creation methods, I find it most intuitive to group them based on the underlying sensor or conceptual approach. Let's outline the main categories:

Stereo-based depth maps

A stereo camera setup consists of at least two cameras or lenses separated by a known baseline. By capturing two images simultaneously, one can exploit the slight difference in viewpoints — known as parallax — to determine how the same 3D point projects onto each camera.

Rectification
Often, stereo images need to be preprocessed (rectified) so that corresponding points lie on horizontal lines, known as epipolar lines. This simplifies the matching problem, as for each pixel $(x_0, y_0)$ in the left image, you only need to look along the same row in the right image, searching for the best match and computing the disparity $d$ . The depth $Z$ at that pixel is inversely related to the disparity:
$Z = \frac{f \cdot B}{d},$
where $f$ is the focal length of the cameras, $B$ is the baseline distance (i.e., the separation between the two camera centers), and $d$ is the measured disparity.
Matching Methods
Disparity estimation can be done by block matching, correlation windows, or more advanced feature-based or deep learning–based approaches. Then, the disparity map is typically converted to a depth map.

This method is considered a passive approach, as it relies on ambient illumination or naturally available light to reveal the scene texture. However, it can fail in regions of uniform texture or in the presence of repeated patterns that cause ambiguous matching.

Time-of-flight (ToF) sensors

A time-of-flight camera emits (often infrared) light, then measures how long it takes for the reflected signal to return. By associating this round-trip time with a distance, the device builds a dense depth map in real time.

Many consumer ToF sensors incorporate multiple sampling phases to handle issues such as exposure, ambient lighting, and correlation. The Kinect (second-generation Kinect for Xbox One) is a well-known example of a ToF-based sensor, as are many modern smartphone depth sensors used for face unlocking and AR features.

Unlike stereo approaches, a ToF camera does not require complex matching computations. Instead, it must handle issues such as multipath interference, reflectivity differences, and timing accuracy.

Structured light depth sensors

Structured light sensors project known patterns — stripes, grids, pseudorandom dots, etc. — onto the scene. A camera offset from the projector observes how the pattern deforms across the surfaces. By analyzing these deformations, the system can compute per-point distance. The original Microsoft Kinect used this approach (with an IR projector shining a known dot pattern).

Structured light methods can produce very accurate indoor depth maps, but they are sensitive to bright ambient light (like sunlight), because the projected pattern can become washed out. They also require active projection hardware that has its own constraints and calibration requirements.

LiDAR-based depth maps

LiDAR (Light Detection and Ranging) technology is widely used in autonomous vehicles, surveying, and robotics. A LiDAR device emits pulses of laser light in a scanning pattern across the environment, measuring the reflection times to reconstruct a 3D map. While a LiDAR point cloud is typically the main data product, it can be converted or reprojected to a depth map.

LiDAR point clouds are extremely precise at longer ranges, but generating dense 2D maps from them can be nontrivial (since LiDAR points are often sparser than a typical 2D image's pixel resolution). However, fusing LiDAR with camera data is a popular approach for robust depth estimation, especially in self-driving applications.

Depth from motion (structure-from-motion)

Structure-from-motion (SfM) is a purely passive computational approach that uses multiple images or frames from a moving camera. If you know or can estimate the camera's motion (or have partial constraints on it), you can triangulate scene points. This approach is often called multiview stereo or structure-from-motion:

Egomotion Estimation
The system first determines how the camera moved between consecutive frames. This can be done with classical methods (like feature matching and RANSAC for homography or fundamental matrix) or with deep networks that estimate camera pose.
Triangulation
Once the relative pose is known, you can match features across frames. Using epipolar geometry and triangulation, you infer the 3D position of those matched points. The final step might be to compute a dense depth map by applying dense multiview stereo techniques or by training networks that generate depth and camera motion simultaneously in an unsupervised manner.

Structure-from-motion methods are particularly popular in robotics (e.g. visual odometry and SLAM) where no additional hardware is needed beyond a camera or multiple cameras.

Generation techniques

Depth maps can be generated by a wide variety of approaches, broadly grouped into active methods (where the system sends out signals or patterns) and passive methods (which rely on the scene's natural illumination or other external means). Then, you can further refine or compute them with specialized algorithms.

Active methods vs. passive methods

Active methods: Time-of-flight sensors, structured light sensors, LiDAR. They work in low-texture environments and can produce robust depth data even in the dark (since they actively emit light). However, they usually have higher power consumption and may suffer from interference if multiple devices operate nearby.
Passive methods: Stereo-based depth from color images, structure-from-motion using purely visual cues, or deep neural networks that learn depth from single images. These typically rely on scene texture or robust machine learning models. They can operate with standard cameras, which are simpler and cheaper than specialized sensors, but they may fail in textureless or low-light scenes.

Hardware considerations

Before adopting a particular approach to generate depth maps, it's crucial to consider:

Range and Resolution
Time-of-flight cameras might have excellent resolution for near-range tasks but can degrade in accuracy beyond a few meters. LiDAR is more appropriate for medium to long ranges (tens or hundreds of meters). Meanwhile, stereo cameras can scale across different distances by adjusting focal length and baseline.
Environmental Constraints
Active sensors (structured light, ToF) might be overshadowed by strong ambient light. Outdoor usage often demands LiDAR or specialized high-power structured light or IR illuminators.
Computational Requirements
Some methods produce depth in real time (e.g. hardware-based ToF or stereo with specialized accelerators), while others (e.g. certain offline structure-from-motion pipelines) can be more computationally expensive.

Common algorithms and workflows

Stereo Matching
A typical stereo matching pipeline includes camera calibration, rectification, block matching or correlation-based matching to find disparities, and then filtering of the resulting disparity map.

Deep Learning–Based Single-Image Depth Estimation
Many modern approaches rely on convolutional neural networks to predict depth directly from a single RGB image. A well-cited example is the work by Eigen and gang (NeurIPS 2014) using a multi-scale CNN architecture. They introduced a loss function that penalizes differences between predicted log-depth and ground truth log-depth. Another example is the approach by Godard and gang (CVPR 2017) that leverages stereo pairs in a self-supervised manner.

Depth from Video
In unsupervised or self-supervised settings, you can feed pairs (or sequences) of images from a monocular camera into a network that simultaneously learns to predict depth and camera motion. Zhou and gang (CVPR 2017) introduced a method that uses an image reconstruction loss derived from projecting one image to another viewpoint using the predicted depth and pose.

Key steps in data preprocessing

Regardless of the method, there are critical data preprocessing steps:

Calibration
For stereo and structure-from-motion approaches, calibrating intrinsic parameters (focal length, principal point) and extrinsic parameters (relative rotation, translation between cameras) is essential.
Rectification
Stereo imagery typically requires rectification so that epipolar lines align horizontally. This drastically simplifies the matching search problem.
Normalization and Rescaling
If your system outputs raw distances (like a LiDAR or ToF sensor), you might want to scale them to a standard range or correct for lens distortions, sensor noise, or offset biases.
Outlier Removal
Depth sensing can be noisy — removing outliers or replacing them with reasonable estimates might be necessary.
Fusing Multiple Data Sources
Many real-world setups fuse multiple data sources: e.g., LiDAR + cameras or multiple camera viewpoints. This can involve point cloud registration, reprojection to a common reference, and merging partial depth maps into a global representation.

Integration with other sensor data

Depth maps are frequently integrated with other sensor data such as:

RGB images
In robotics or autonomous driving, an RGB-D camera provides color data plus an aligned depth map. This helps with tasks like semantic segmentation in 3D or object detection with distance estimates.
Inertial measurements (IMU)
For improved pose estimation in structure-from-motion or SLAM, combining inertial data can help refine or stabilize depth estimates.
GPS
Large-scale mapping or surveying might incorporate GPS data to anchor depth maps or point clouds to global coordinates, especially important in geospatial analytics.

Depth map processing

Once a depth map is generated, it often needs to be refined or combined with other data. Depth signals can be noisy or incomplete, leading to holes and artifacts. This section highlights several common depth map processing tasks.

Noise reduction and filtering

Many sensors produce depth measurements with inherent noise. This might be due to hardware limitations (e.g. phase noise in ToF cameras), motion blur, or algorithmic errors (e.g. stereo matching outliers). Some standard filters include:

Median filtering
A simple, local approach that can remove small salt-and-pepper–type noise. However, it can blur sharp depth edges.
Bilateral filtering
This filter respects edges by weighting pixels based on both spatial proximity and intensity (or in this case, depth) differences.
Guided filtering
A guided filter can be used to refine the depth map using a high-resolution intensity image as guidance, preserving edges.

Post-processing and smoothing

To improve the overall smoothness and consistency of the depth map, while keeping important edges crisp, more advanced approaches might use:

Global energy minimization
For instance, one can define an energy function that penalizes large discontinuities in regions expected to be smooth, subject to constraints from the original depth measurements. Methods such as Markov Random Fields (MRFs) or graph cuts can solve these global optimization problems.
CNN-based refinement
Some modern pipelines feed an initial (potentially noisy) depth map into a refinement CNN that has been trained on examples of noisy vs. ground truth depth.

Depth map fusion

Depth map fusion refers to the process of merging depth information from multiple views or sensors into a unified representation. In 3D reconstruction tasks, multiple partial depth maps from different camera viewpoints can be integrated into a single volumetric or surface representation (e.g. truncated signed distance function, or TSDF). The idea is to accumulate evidence from many frames or vantage points in order to fill gaps and reduce noise.

Handling occlusions and missing data

Occlusions are a major challenge in stereo-based or multi-view approaches: some parts of the scene are simply not visible from a given viewpoint. Depth maps often have holes or undefined values in these occluded regions. Approaches to handle or fill these occlusions include:

Inpainting
Trying to guess the depth values in occluded regions by extrapolating from surrounding pixels.
Multiple View Coverage
With more vantage points, you can combine partial depth maps so that occlusions from one viewpoint are visible from another.
Object-Based Reasoning
Some advanced algorithms perform semantic segmentation to identify objects that cause occlusions and attempt to model them separately.

Applications in machine learning and data science

Depth maps have an incredibly wide range of uses. In many machine learning pipelines, they are either the direct input (for tasks such as 3D object detection) or an intermediate representation that helps models better understand spatial relationships.

Object detection and segmentation

In robotics, a depth map can significantly boost performance in tasks like object detection or semantic segmentation. For instance, a 2D bounding box might be projected into 3D space to understand how far away an object is. Some networks handle RGB-D inputs, in which the depth is treated as an additional input channel. This often improves accuracy in cluttered scenes, where color alone might be insufficient to disambiguate objects.

3D scene reconstruction

Depth maps are crucial building blocks for 3D reconstruction. Systems such as KinectFusion, or more advanced volumetric fusion methods, accumulate depth data from multiple viewpoints to build a consistent 3D model in real time. Architectural, engineering, or geological industries often rely on such reconstructions for surveying or inspection tasks.

Robotic navigation systems — from simple indoor drones to fully autonomous cars — rely on depth information to avoid obstacles, localize themselves, and plan safe paths. While LiDAR is a popular sensor in self-driving, camera-based depth is also used for redundancy and cost-sensitive systems.

In many robotic pipelines, you'll see depth data integrated into SLAM (simultaneous localization and mapping) to build or update the environment map. The system can keep track of known vs. unknown areas and quickly detect potential collisions.

Augmented reality and virtual reality

AR and VR systems need to know where surfaces are in the user's environment to realistically place digital content. Depth maps aid in occlusion handling, anchor detection (to place a virtual object on a real table), and scene reconstruction for more advanced AR experiences.

On smartphones, AR apps can leverage the phone's built-in ToF sensor (when available) or compute depth from dual cameras (stereo) or from a neural network. The resulting depth map is used to render 3D illusions in real time, letting users place digital objects that look seamlessly integrated with the real world.

Code snippets for illustration purposes

Below is a brief example in Python using OpenCV to compute a stereo-based depth map. This snippet demonstrates calibration, rectification, and disparity calculation. Please note that you would need real stereo data and camera calibration parameters (intrinsic/extrinsic) for a practical use case:


import cv2
import numpy as np

# Suppose left_img and right_img are loaded as grayscale or color images
# Suppose we have stereo calibration parameters: camera matrices and dist coeffs
# These must come from a calibration procedure not shown here

def compute_stereo_depth(left_img_path, right_img_path, Q_matrix):
    # Q_matrix is the reprojection matrix, typically derived from stereo calibration

    # Read images
    left_img = cv2.imread(left_img_path, cv2.IMREAD_GRAYSCALE)
    right_img = cv2.imread(right_img_path, cv2.IMREAD_GRAYSCALE)

    # Create StereoSGBM object (Semi-Global Block Matching is popular for better results)
    min_disp = 0
    num_disp = 64  # must be divisible by 16
    block_size = 5

    stereo = cv2.StereoSGBM_create(
        minDisparity = min_disp,
        numDisparities = num_disp,
        blockSize = block_size,
        uniquenessRatio = 10,
        speckleWindowSize = 100,
        speckleRange = 32,
        disp12MaxDiff = 1
    )

    disparity_map = stereo.compute(left_img, right_img)

    # Convert disparity map to float32 for reprojecting to 3D space
    # Typically, disparity is scaled by 16 or another factor inside OpenCV
    # so we need to divide to get actual disparity values in pixels
    disparity_map = np.float32(disparity_map) / 16.0

    # Reproject to 3D space using the Q reprojection matrix
    points_3D = cv2.reprojectImageTo3D(disparity_map, Q_matrix)

    # We can derive a depth map from the third coordinate in points_3D
    # However, watch out for invalid disparities
    depth_map = points_3D[:, :, 2]

    return disparity_map, depth_map

# Example usage:
# Q = ... # from your calibration
# disp, depth = compute_stereo_depth('left_img.png', 'right_img.png', Q)
# Next steps might involve filtering and cleaning the depth map

In real-world applications, you would have to calibrate your stereo system beforehand, rectify the images, handle edge cases, etc.

Other applications (just a quick mention)

Face recognition
Depth can improve reliability by ensuring the face is real and not a flat picture.
Photography
Depth-based bokeh or portrait-mode effects artificially blur backgrounds.
Medical imaging
Some specialized medical sensors produce volumetric or partial 3D data from which a depth-like representation can be extracted for better diagnosis or surgical planning.

Main challenges and limitations

No matter how powerful or mature the method, depth map generation and usage involve challenges that can significantly complicate real-world applications.

Sensor inaccuracies and noise

Every sensor introduces artifacts. For example, time-of-flight devices can suffer from multipath interference or difficulties in highly reflective surfaces. Stereo matching can struggle with uniform, low-texture regions or repeated patterns (leading to ambiguous matches). Structured light can be confused by external infrared sources. Understanding and mitigating these sources of noise is often an entire engineering discipline by itself.

Data storage and processing constraints

A depth map is typically the same spatial resolution as an RGB image, but it might need higher bit depths (e.g. 16-bit or 32-bit floats to capture extended range). Handling large data volumes in real time can be computationally expensive, especially in embedded or edge devices with limited compute resources. Streaming high-resolution depth data also demands substantial bandwidth, which can become a bottleneck in multi-sensor setups.

Computational complexity

Many depth extraction algorithms — particularly those that rely on complex stereo matching or heavy deep neural networks — can be demanding in terms of CPU/GPU usage. Real-time performance might require specialized hardware accelerators, such as FPGA-based stereo modules or GPUs with optimized deep learning frameworks. For advanced tasks like dense 3D reconstruction, the complexity grows even further because you often need to handle hundreds or thousands of frames with sophisticated fusion pipelines.

Best practices and optimization strategies

While challenges exist, there are strategies and best practices that can make depth map generation and usage more robust, scalable, and efficient.

Dataset collection and annotation

When training data-hungry approaches — like a deep CNN for single-image depth estimation — a critical bottleneck is obtaining high-quality ground truth depth. Here are a few strategies:

Synthetic data
Generate photo-realistic synthetic images from 3D models where perfect ground truth depth can be rendered. However, domain adaptation may be necessary because synthetic images differ from real ones.
Specialized sensors
Use a high-quality laser scanner or structured light sensor to collect ground truth in a controlled environment, then pair it with your RGB images for training.
Stereo or Multi-Camera Rigs
Rigs with carefully measured baselines can produce approximate ground truth for training. You might have to handle outliers and incomplete data, but advanced pipelines can create fairly accurate ground truth.

Model selection and hyperparameter tuning

For neural network approaches, I recommend carefully selecting architectures based on:

Task Requirements
Real-time vs. offline, device constraints, interpretability.
Data Availability
Supervised, semi-supervised, or unsupervised approaches.
Depth Range
If you are working at large ranges (e.g. self-driving), your network might need specialized layers or loss functions to handle large disparities effectively.

Hyperparameter tuning (e.g., learning rates, batch sizes, optimizer choices) can also dramatically affect performance. Depth estimation is sensitive to subtle changes in network design and training protocols, especially if your dataset is relatively small.

Cloud-based vs. edge-based processing

If your application can afford to upload data to the cloud, you can run heavier algorithms on powerful cloud servers. This approach is popular for offline 3D reconstruction tasks. However, for real-time or privacy-sensitive tasks, edge-based processing is often mandatory. In this scenario, one must optimize code, exploit hardware acceleration (e.g. GPUs, DSPs, or NPUs on embedded devices), and use efficient network architectures such as MobileNet-like backbones for CNN-based solutions.

Performance metrics and evaluation

It's essential to evaluate the quality of depth maps with standardized metrics. Common metrics in depth estimation literature include:

Absolute Relative Error (Abs Rel)
$\text{Abs Rel} = \frac{1}{N} \sum_{i} \frac{|D_i - \hat{D}_i|}{\hat{D}_i}$
where $D_i$ is predicted depth, $\hat{D}_i$ is ground truth depth, and $N$ is the number of valid pixels.
Root Mean Squared Error (RMSE)
$\text{RMSE} = \sqrt{\frac{1}{N}\sum_i (D_i - \hat{D}_i)^2}$
Log RMSE
$\text{Log RMSE} = \sqrt{\frac{1}{N}\sum_i (\log D_i - \log \hat{D}_i)^2}$
Inlier Threshold
A measure of the proportion of pixels for which the predicted depth is within some factor (1.25, 1.25^2, etc.) of the ground truth.

Some domains, such as robotics, also emphasize the overall SLAM or odometry performance rather than just the raw depth error. If you're doing object detection, you might measure bounding box accuracy in 3D, or Intersection-over-Union (IoU) in 3D bounding volumes. Always align your evaluation methods with your end application.

(Optional extra content to deepen understanding and push theoretical coverage further.)

Additional advanced considerations

Deep learning architectures specific to depth

While a standard encoder-decoder CNN can do single-image depth estimation, specialized variants exist:

Residual Networks (ResNets)
They can serve as encoders, capturing multi-scale features in deeper networks without vanishing gradients.
DenseNet
As introduced by Huang and gang (CVPR 2017), each layer receives direct connections from all preceding layers. Some depth estimation architectures incorporate DenseNet blocks to enhance gradient flow and feature reuse.
Capsule Networks
Proposed by Sabour and gang (NeurIPS 2017) and expanded upon in depth tasks by various authors, capsules aim to encode pose information and can, in theory, better represent geometric transformations.
Transformer-based Networks
More recently, vision transformers (ViT) have started to appear in depth estimation tasks, showing that self-attention mechanisms can capture long-range dependencies in images.

Unsupervised and self-supervised learning

Given the high cost of collecting large-scale depth annotations, unsupervised methods are popular. The network can predict both depth and camera motion by minimizing photometric reconstruction errors when warping frames from one viewpoint to another. These methods revolve around geometric principles to align images in 2D space, thereby deriving constraints on 3D structure.

Multi-task learning

Depth estimation can be learned alongside semantic segmentation, surface normals, or object detection in multi-task frameworks. The synergy between tasks often helps the network build a richer representation of the scene, improving overall performance.

Domain adaptation

If you train on synthetic data but want to deploy on real data, domain adaptation can mitigate the domain gap. Techniques might include adversarial training or style transfer from real images to synthetic (and vice versa), ensuring the network generalizes better.

Depth completion

Sometimes you have incomplete or sparse depth measurements (e.g., from a LiDAR sensor that returns scattered point data). Depth completion networks aim to transform these sparse maps into dense, high-quality depth maps. They can incorporate semantic or geometric priors for better results.

(Optional) Illustrations

An image was requested, but the frog was found.

Alt: "Example of a Time-of-flight sensor operation"

Caption: "A conceptual illustration of how a ToF camera uses infrared light and measures the return time to compute depth."

Error type: missing path

An image was requested, but the frog was found.

Alt: "Stereo matching with epipolar geometry"

Caption: "An illustration of how stereo cameras capture offset images; corresponding features can be used to compute disparity and derive depth."

Error type: missing path

An image was requested, but the frog was found.

Alt: "Depth map from a structured light sensor"

Caption: "Structured light patterns deform across surfaces; analyzing deformation yields depth values per pixel."

Error type: missing path

Final remarks

Depth maps stand at the intersection of geometry, sensing, and machine learning. From classical photogrammetry to modern neural networks, they continue to evolve, unlocking new possibilities in a variety of domains: robotics, 3D modeling, AR/VR, medical imaging, autonomous navigation, and beyond.

The inherent challenges — sensor noise, occlusions, limited data, high computational loads — are countered by the ever-increasing sophistication of specialized hardware, cunning algorithms, and the synergy between data-driven and model-driven approaches. I find that best results often come from an integrated viewpoint: combining multiple sensors (e.g., camera plus LiDAR), refining depth maps through advanced filtering or neural networks, and feeding them into well-designed pipelines.

By carefully considering the trade-offs between active and passive methods, deciding whether cloud-based or edge-based processing is more suitable, and ensuring that any ML pipeline is properly trained and evaluated, one can harness depth maps effectively for a wide variety of cutting-edge applications.

I hope this article helps you gain a deeper (no pun intended) understanding of depth maps, their generation, and their role in machine learning and data science.