Tracking Everything Everywhere All at Once

1Cornell University   2Google Research   3UC Berkeley

ICCV 2023 (Oral, Best Student Paper)

OmniMotion jointly tracks all points in a video across all frames, even through occlusions.


Abstract

We present a new test-time optimization method for estimating dense and long-range motion from a video sequence. Prior optical flow or particle video tracking algorithms typically operate within limited temporal windows, struggling to track through occlusions and maintain global consistency of estimated motion trajectories. We propose a complete and globally consistent motion representation, dubbed OmniMotion, that allows for accurate, full-length motion estimation of every pixel in a video. OmniMotion represents a video using a quasi-3D canonical volume and performs pixel-wise tracking via bijections between local and canonical space. This representation allows us to ensure global consistency, track through occlusions, and model any combination of camera and object motion. Extensive evaluations on the TAP-Vid benchmark and real-world footage show that our approach outperforms prior state-of-the-art methods by a large margin both quantitatively and qualitatively.


Video


More Results

Trail visualization


Point visualization. Points detected as occluded are marked as "+"




Interactive Demo

Use our interactive demo to inspect the correspondences generated by our method. Simply click on any location in the query frame (left), and observe its corresponding location in the target frame (right). Use the slider to switch to a different target frame, and press the 'clear points' button to remove all points. Points that are identified as occluded are displayed as crosses '+' instead of dots '●'. Note that this demo showcases correspondences for a single query frame, but our representation captures all correspondences from any frame to any other frame in a video.

Image
Frame 1
clear points



Pseudo-Depth Visualization

Since our method optimizes an underlying quasi-3D representation, we can extract a pseudo-depth visualization showing the relative ordering of different parts of the scene. Blue: near, Red: far.





Failure Cases

Like many motion estimation methods, our method struggles with rapid and highly non-rigid motion as well as thin structures. In these scenarios, pairwise correspondence methods can fail to provide enough reliable correspondences for our method to compute accurate global motion.



Acknowledgements

We thank Jon Barron, Richard Tucker, Vickie Ye, Zekun Hao, Xiaowei Zhou, Steve Seitz, Brian Curless, and Richard Szeliski for their helpful input and assistance. This work was supported in part by an NVIDIA academic hardware grant and by the National Science Foundation (IIS-2008313 and IIS-2211259). Qianqian Wang was supported in part by a Google PhD Fellowship.

BibTeX


@inproceedings{wang2023omnimotion,
  title     = {Tracking Everything Everywhere All at Once},
  author    = {Wang, Qianqian and Chang, Yen-Yu and Cai, Ruojin and Li, Zhengqi and Hariharan, Bharath and Holynski, Aleksander and Snavely, Noah},
  booktitle = {International Conference on Computer Vision},
  year      = {2023}
}