REVEAL: Reconstructing Video Amodal Content via Language and Flow Guidance

Abstract

Amodal perception enables humans to perceive entire objects even when parts are occluded, a remarkable cognitive skill that artificial intelligence struggles to replicate. Recent diffusion-based methods have extended amodal completion from images to videos, yet they lack auxiliary priors to guide the reconstruction of heavily occluded content and to maintain temporal consistency over long sequences. We present REVEAL (REconstructing VidEo Amodal content via Language and flow guidance), a unified diffusion-based framework that integrates complementary motion and semantic priors for video amodal completion. A user-provided text query serves a dual role: it drives open-vocabulary video segmentation to obtain visible masks, and it provides semantic guidance for texture reconstruction. For amodal segmentation, we introduce optical flow guidance: by warping visible masks from previous frames to the current frame, the flow-warped mask propagates visible content into occluded regions and approximates the object's current shape, providing a strong shape prior even under simultaneous deformation and occlusion. For texture completion, text guidance constrains the reconstruction of occluded appearance while providing a stable semantic reference that helps maintain visual coherence throughout the reconstruction. We also contribute LAVAT, a new benchmark featuring long video sequences paired with text descriptions, enabling evaluation of text-guided video amodal completion under heavy occlusion. Extensive experiments demonstrate that REVEAL achieves state-of-the-art performance on existing benchmarks while maintaining high-quality temporal consistency over extended sequences.

How does it work?

REVEAL allows the user to provide a text query specifying the object of interest. The text query serves a dual role: it drives open-vocabulary video segmentation to obtain visible masks across frames, and it provides semantic guidance for texture reconstruction. The model then performs a two-stage prediction. Stage 1 – Amodal Mask Segmentation: We introduce optical flow guidance as a motion prior. By warping visible masks from previous frames to the current frame, the flow-warped mask propagates visible content into occluded regions and approximates the object's current shape, providing a strong shape prior even under simultaneous deformation and occlusion. Stage 2 – Video Amodal Texture Completion: Text guidance constrains the reconstruction of occluded appearance while providing a stable semantic reference that helps maintain visual coherence throughout the reconstruction. The text remains constant across frames, offering a consistent semantic anchor for temporal coherence.

Stage 1 pipeline — Video Amodal Mask Segmentation Stage. The model uses optical flow as a motion prior, warping visible masks from previous frames to propagate shape information into occluded regions.

Stage 2 pipeline — Video Amodal Texture Completion Stage. The second stage conducts amodal texture completion with text guidance as semantic priors, leveraging pretrained diffusion models to reconstruct accurate content based on canonical object shapes and properties.

Comparison with SOTA on LAVAT

We compare REVEAL with state-of-the-art (SOTA) video amodal completion methods on the LAVAT benchmark.

Input Visible TACO Diff-VAS REVEAL

Ablation Study: Flow Prior Guidance

We study the effect of optical flow as motion priors for amodal mask segmentation.

Ablation Study: Textual Guidance

We study the effect of text guidance as semantic priors for amodal texture completion.

BibTeX

@inproceedings{anonymous2026reveal,
  title={REVEAL: Reconstructing Video Amodal Content via Language and Flow Guidance},
  author={Anonymous},
  booktitle={Under Review},
  year={2026}
}