Video Summarization With Spatiotemporal Vision Transformer

Tzu Chun Hsu, Yi Sheng Liao, Chun Rong Huang

研究成果: Article同行評審

15 引文 斯高帕斯(Scopus)

摘要

Video summarization aims to generate a compact summary of the original video for efficient video browsing. To provide video summaries which are consistent with the human perception and contain important content, supervised learning-based video summarization methods are proposed. These methods aim to learn important content based on continuous frame information of human-created summaries. However, simultaneously considering both of inter-frame correlations among non-adjacent frames and intra-frame attention which attracts the humans for frame importance representations are rarely discussed in recent methods. To address these issues, we propose a novel transformer-based method named spatiotemporal vision transformer (STVT) for video summarization. The STVT is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attention (SIA) encoder. The embedded sequence module generates the embedded sequence by fusing the frame embedding, index embedding and segment class embedding to represent the frames. The temporal inter-frame correlations among non-adjacent frames are learned by the TIA encoder with the multi-head self-attention scheme. Then, the spatial intra-frame attention of each frame is learned by the SIA encoder. Finally, a multi-frame loss is computed to drive the learning of the network in an end-to-end trainable manner. By simultaneously using both inter-frame and intra-frame information, our method outperforms state-of-the-art methods in both of the SumMe and TVSum datasets. The source code of the spatiotemporal vision transformer will be available at https://github.com/nchucvml/STVT.

原文English
頁(從 - 到)3013-3026
頁數14
期刊IEEE Transactions on Image Processing
32
DOIs
出版狀態Published - 2023

All Science Journal Classification (ASJC) codes

  • 軟體
  • 電腦繪圖與電腦輔助設計

指紋

深入研究「Video Summarization With Spatiotemporal Vision Transformer」主題。共同形成了獨特的指紋。

引用此