TY - JOUR
T1 - Video-Based Depth Estimation Autoencoder With Weighted Temporal Feature and Spatial Edge Guided Modules
AU - Yang, Wei Jong
AU - Tsung, Wan Nung
AU - Chung, Pau Choo
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2024/2/1
Y1 - 2024/2/1
N2 - Convolutional neural networks with encoder and decoder structures, generally referred to as autoencoders, are used in many pixelwise transformation, detection, segmentation, and estimation applications, for example, which can be applied for face swapping, lane detection, semantic segmentation, and depth estimation, respectively. However, traditional autoencoders, which are based on single-frame inputs, ignore the temporal consistency between consecutive frames, and may, hence, produce unsatisfactory results. Accordingly, in this article, a video-based depth estimation (VDE) autoencoder is proposed to improve the quality of depth estimation through the inclusion of two weighted temporal feature (WTF) modules in the encoder and a single spatial edge guided (SEG) module in the decoder. The WTF modules designed with channel weighted block submodule effectively extract the temporal similarities in consecutive frames, whereas the SEG module provides spatial edge guidance of the object contours. Through the collaboration of the proposed modules, the accuracy of the depth estimation is greatly improved. The experimental results confirm that the proposed VDE autoencoder achieves a better monocular depth estimation performance than the existing autoencoders with only a slight increase in the computational cost.
AB - Convolutional neural networks with encoder and decoder structures, generally referred to as autoencoders, are used in many pixelwise transformation, detection, segmentation, and estimation applications, for example, which can be applied for face swapping, lane detection, semantic segmentation, and depth estimation, respectively. However, traditional autoencoders, which are based on single-frame inputs, ignore the temporal consistency between consecutive frames, and may, hence, produce unsatisfactory results. Accordingly, in this article, a video-based depth estimation (VDE) autoencoder is proposed to improve the quality of depth estimation through the inclusion of two weighted temporal feature (WTF) modules in the encoder and a single spatial edge guided (SEG) module in the decoder. The WTF modules designed with channel weighted block submodule effectively extract the temporal similarities in consecutive frames, whereas the SEG module provides spatial edge guidance of the object contours. Through the collaboration of the proposed modules, the accuracy of the depth estimation is greatly improved. The experimental results confirm that the proposed VDE autoencoder achieves a better monocular depth estimation performance than the existing autoencoders with only a slight increase in the computational cost.
UR - http://www.scopus.com/inward/record.url?scp=85174828823&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85174828823&partnerID=8YFLogxK
U2 - 10.1109/TAI.2023.3324624
DO - 10.1109/TAI.2023.3324624
M3 - Article
AN - SCOPUS:85174828823
SN - 2691-4581
VL - 5
SP - 613
EP - 623
JO - IEEE Transactions on Artificial Intelligence
JF - IEEE Transactions on Artificial Intelligence
IS - 2
ER -