Video without depth has Depth

Dec 02, 2024

We get 3d depth data from videos directly now

Paper: Video Depth without Video Models
Code: https://rollingdepth.github.io/

Researchers from ETH Zurich and Carnegie Mellon University are interested in 3D scene understanding from video.

Hmm..What’s the background?

Traditionally, 3D scene models were built using Structure-from-Motion (SfM) and multi-view reconstruction techniques. These techniques often fail on in-the-wild videos due to limitations in camera motion and scene properties. Video depth estimation, which infers a 2.5D depth map for each video frame, provides a more robust alternative.

Recent advancements in single-image depth estimation, driven by large foundation models and synthetic training data, have renewed interest in video depth. However, applying single-image depth estimators frame-by-frame causes temporal inconsistencies like flickering and drift.

Source: https://lexica.art/prompt/a2a128ea-99c1-4c25-932b-4bff2b55ce07

So what is proposed in the research paper?

The research paper incorporates several key insights:

RollingDepth extends a single-image LDM (Marigold) for video depth estimation by incorporating a multi-frame depth estimator that operates on short video snippets. This allows the model to capture temporal patterns across frames using a modified cross-frame self-attention mechanism.
To handle varying depth ranges in videos, the model predicts inverse depth instead of affine-invariant depth. This makes it less sensitive to changes in near and far planes caused by camera and object motion.
The model is trained on a combination of synthetic video data (TartanAir) and photorealistic single-image data (Hypersim) to improve generalization across diverse scenes. Depth range augmentation is applied during training to further enhance robustness.

RollingDepth outperforms both single-frame and video-based methods across multiple datasets and sequence lengths, achieving state-of-the-art performance in zero-shot depth estimation. RollingDepth produces high-quality depth maps that maintain long-term coherence, mitigating issues like flickering, drift, and unwarranted jumps in depth values. This is demonstrated on both evaluation data and in-the-wild video clips.

Source: Code

What’s next?

Exploring the integration of generative video models or flow-based methods into the refinement step to improve motion reconstruction and detail enhancement. Developing techniques to further reduce the computational cost of the refinement step without sacrificing detail and quality.

We get 3d depth data from videos directly now

Learned something new? Consider sharing it!

So Essentially

Discussion about this post