TY - JOUR
T1 - Semi-Supervised 3D Human Pose Estimation by Jointly Considering Temporal and Multiview Information
AU - Chu, Wei Ta
AU - Pan, Zong Wei
N1 - Funding Information:
This work was supported in part by the Qualcomm Technologies, Inc., under Grant B109-K027D; and in part by the Ministry of Science and Technology, Taiwan, under Grant 108-2221-E-006-227-MY3, Grant 107-2923-E-194-003-MY3, Grant 109-2218-E-002-015, and Grant 107-2627-H-155-001.
Publisher Copyright:
© 2013 IEEE.
PY - 2020
Y1 - 2020
N2 - Three-dimensional human pose estimation is usually conducted in a supervised manner. However, because collecting labeled 3D skeletons is expensive and time-consuming, semi-supervised methods that need much fewer amount of labeled 3D data are urgently demanded. Some semi-supervised learning methods propose to independently consider information from consecutive video frames, or frames simultaneously captured from multiple viewpoints. In this article, we propose to jointly consider temporal information and multiview information in a unified adversarial learning framework. Given a 2D skeleton, a pose generator network is developed to estimate the corresponding 3D skeleton, and a camera network is developed to estimate camera parameters. The estimated 3D skeleton is evaluated by a critic network to examine whether the estimated one is a plausible 3D human pose or not. Based on the estimated camera parameters, the estimated 3D skeleton can be re-projected into a 2D skeleton, which should be similar to the input 2D skeleton. The ideas of re-projection and adversarial learning enable the scheme of self supervision. We design network architectures of the aforementioned networks to take 2D skeletons from multiple viewpoints in temporally consecutive frames. By jointly considering two types of information, we verify that performance can be largely improved.
AB - Three-dimensional human pose estimation is usually conducted in a supervised manner. However, because collecting labeled 3D skeletons is expensive and time-consuming, semi-supervised methods that need much fewer amount of labeled 3D data are urgently demanded. Some semi-supervised learning methods propose to independently consider information from consecutive video frames, or frames simultaneously captured from multiple viewpoints. In this article, we propose to jointly consider temporal information and multiview information in a unified adversarial learning framework. Given a 2D skeleton, a pose generator network is developed to estimate the corresponding 3D skeleton, and a camera network is developed to estimate camera parameters. The estimated 3D skeleton is evaluated by a critic network to examine whether the estimated one is a plausible 3D human pose or not. Based on the estimated camera parameters, the estimated 3D skeleton can be re-projected into a 2D skeleton, which should be similar to the input 2D skeleton. The ideas of re-projection and adversarial learning enable the scheme of self supervision. We design network architectures of the aforementioned networks to take 2D skeletons from multiple viewpoints in temporally consecutive frames. By jointly considering two types of information, we verify that performance can be largely improved.
UR - http://www.scopus.com/inward/record.url?scp=85098913964&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098913964&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2020.3045794
DO - 10.1109/ACCESS.2020.3045794
M3 - Article
AN - SCOPUS:85098913964
SN - 2169-3536
VL - 8
SP - 226974
EP - 226981
JO - IEEE Access
JF - IEEE Access
M1 - 9298758
ER -