Text-Guided Video Amodal Completion

Given an input video, users select an object of interest in the first frame and provide a text description of the expected output. Our pipeline then generates a completed video, filling in the missing shape and texture of the object.

(Top) A summary of our proposed pipeline and three key contributions. (Bottom) Zero-shot transfer results on natural videos using our method.

Abstract

Amodal perception enables humans to perceive entire objects even when parts are occluded, a remarkable cognitive skill that artificial intelligence struggles to replicate. While substantial advancements have been made in image amodal completion, video amodal completion remains underexplored despite its high potential for real-world applications in video editing and analysis. In response, we propose a video amodal completion framework to explore this potential direction. Our contributions include (i) a synthetic dataset for video amodal completion with text description for the object of interest. The dataset captures a variety of object types, textures, motions, and scenarios to support zero-shot transferring on natural videos. (ii) A diffusion-based text-guided video amodal completion framework enhanced with a motion continuity module to ensure temporal consistency across frames. (iii) Zero-shot inference for long video, inspired by temporal diffusion techniques to effectively manage long video sequences while improving inference accuracy and maintaining coherent amodal completions. Experimental results shows the efficacy of our approach in handling video amodal completion, opening potential capabilities for advanced video editing and analysis with amodal completion.

Training pipeline of the proposed method

Our approach follows a common two-stage training strategy: first, training a denoising UNet at the frame level to capture spatial features, and then incorporating motion training to ensure temporal coherence.

Synthesizing training data process

We selected videos with unoccluded objects and used provided masks to isolate object pixels. These were used to synthesize occlusion videos and create corresponding amodal completion ground truth. A vision language model (e.g. BLIP is utilized to generate ground truth’s text description.)

Text-Guided Video Amodal Completion

Given an input video, users select an object of interest in the first frame and provide a text description of the expected output. Our pipeline then generates a completed video, filling in the missing shape and texture of the object.

(Top) A summary of our proposed pipeline and three key contributions. (Bottom) Zero-shot transfer results on natural videos using our method.

Abstract

Training pipeline of the proposed method

Our approach follows a common two-stage training strategy: first, training a denoising UNet at the frame level to capture spatial features, and then incorporating motion training to ensure temporal coherence.

Synthesizing training data process

We selected videos with unoccluded objects and used provided masks to isolate object pixels. These were used to synthesize occlusion videos and create corresponding amodal completion ground truth. A vision language model (e.g. BLIP is utilized to generate ground truth’s text description.)

Qualitative Results

Qualitative results of our model compared to ProgressiveAmodal and Pix2gestalt

Qualitative results of our model compared to ProgressiveAmodal and Pix2gestalt

Zero Shot Performance

Zero-shot amodal completion on natural videos

Zero-shot amodal completion on natural videos

Zero-shot amodal completion on natural videos

BibTeX