Video-Based Depth Estimation Autoencoder With Weighted Temporal Feature and Spatial Edge Guided Modules

Wei Jong Yang, Wan Nung Tsung, Pau Choo Chung

Research output: Contribution to journalArticlepeer-review


Convolutional neural networks with encoder and decoder structures, generally referred to as autoencoders, are used in many pixelwise transformation, detection, segmentation, and estimation applications, for example, which can be applied for face swapping, lane detection, semantic segmentation, and depth estimation, respectively. However, traditional autoencoders, which are based on single-frame inputs, ignore the temporal consistency between consecutive frames, and may, hence, produce unsatisfactory results. Accordingly, in this article, a video-based depth estimation (VDE) autoencoder is proposed to improve the quality of depth estimation through the inclusion of two weighted temporal feature (WTF) modules in the encoder and a single spatial edge guided (SEG) module in the decoder. The WTF modules designed with channel weighted block submodule effectively extract the temporal similarities in consecutive frames, whereas the SEG module provides spatial edge guidance of the object contours. Through the collaboration of the proposed modules, the accuracy of the depth estimation is greatly improved. The experimental results confirm that the proposed VDE autoencoder achieves a better monocular depth estimation performance than the existing autoencoders with only a slight increase in the computational cost.

Original languageEnglish
Pages (from-to)613-623
Number of pages11
JournalIEEE Transactions on Artificial Intelligence
Issue number2
Publication statusPublished - 2024 Feb 1

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Artificial Intelligence


Dive into the research topics of 'Video-Based Depth Estimation Autoencoder With Weighted Temporal Feature and Spatial Edge Guided Modules'. Together they form a unique fingerprint.

Cite this