VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

1University of Arkansas 2Carnegie Mellon University 3Mohammed bin Zayed University of AI
*Indicates Equal Contribution

AAAI 2023, Oral
VLTinT model

VL Encoder: given a snippet, the encoder simultaneously extracts local visual features from main agents, global visual features from the environment, and linguistic relevant scene elements; and models interaction between those three modalities through our M2RF module.
TinT Decoder: the canonical transformer encoder is extended by an autoregressive outer transformer that can selectively access the previous hidden states, which are stored in the event memory.

Abstract

Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.

Demo Videos

BibTeX


@article{kashu_vltint,
  title={VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning},
  volume={37},
  url={https://ojs.aaai.org/index.php/AAAI/article/view/25412},
  DOI={10.1609/aaai.v37i3.25412},
  number={3},
  journal={Proceedings of the AAAI Conference on Artificial Intelligence},
  author={Yamazaki, Kashu and Vo, Khoa and Truong, Quang Sang and Raj, Bhiksha and Le, Ngan},
  year={2023},
  month={Jun.},
  pages={3081-3090}
}