TY - GEN
T1 - Efficient Multi-training Framework of Image Deep Learning on GPU Cluster
AU - Chen, Chun Fu Richard
AU - Lee, Gwo Giun Chris
AU - Xia, Yinglong
AU - Lin, W. Sabrina
AU - Suzumura, Toyotaro
AU - Lin, Ching Yung
N1 - Funding Information:
The Ministry of Science and Technology (MOST 104-2221-E-006-258-MY3) and (MOST 104-2917-I-006-005) of the Republic of China, Taiwan, and JST (Japan Science and Technology Agency) are gratefully acknowledged for partially supporting this research.
Publisher Copyright:
© 2015 IEEE.
PY - 2016/3/25
Y1 - 2016/3/25
N2 - In this paper, we develop a pipelining schema for image deep learning on GPU cluster to leverage heavy workload of training procedure. In addition, it is usually necessary to train multiple models to obtain a good deep learning model due to the limited a priori knowledge on deep neural network structure. Therefore, adopting parallel and distributed computing appears is an obvious path forward, but the mileage varies depending on how amenable a deep network can be parallelized and the availability of rapid prototyping capabilities with low cost of entry. In this work, we propose a framework to organize the training procedures of multiple deep learning models into a pipeline on a GPU cluster, where each stage is handled by a particular GPU with a partition of the training dataset. Instead of frequently migrating data among the disks, CPUs, and GPUs, our framework only moves partially trained models to reduce bandwidth consumption and to leverage the full computation capability of the cluster. In this paper, we deploy the proposed framework on popular image recognition tasks using deep learning, and the experiments show that the proposed method reduces overall training time up to dozens of hours compared to the baseline method.
AB - In this paper, we develop a pipelining schema for image deep learning on GPU cluster to leverage heavy workload of training procedure. In addition, it is usually necessary to train multiple models to obtain a good deep learning model due to the limited a priori knowledge on deep neural network structure. Therefore, adopting parallel and distributed computing appears is an obvious path forward, but the mileage varies depending on how amenable a deep network can be parallelized and the availability of rapid prototyping capabilities with low cost of entry. In this work, we propose a framework to organize the training procedures of multiple deep learning models into a pipeline on a GPU cluster, where each stage is handled by a particular GPU with a partition of the training dataset. Instead of frequently migrating data among the disks, CPUs, and GPUs, our framework only moves partially trained models to reduce bandwidth consumption and to leverage the full computation capability of the cluster. In this paper, we deploy the proposed framework on popular image recognition tasks using deep learning, and the experiments show that the proposed method reduces overall training time up to dozens of hours compared to the baseline method.
UR - http://www.scopus.com/inward/record.url?scp=84969622612&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84969622612&partnerID=8YFLogxK
U2 - 10.1109/ISM.2015.119
DO - 10.1109/ISM.2015.119
M3 - Conference contribution
AN - SCOPUS:84969622612
T3 - Proceedings - 2015 IEEE International Symposium on Multimedia, ISM 2015
SP - 489
EP - 494
BT - Proceedings - 2015 IEEE International Symposium on Multimedia, ISM 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Symposium on Multimedia, ISM 2015
Y2 - 14 December 2015 through 16 December 2015
ER -