TY - GEN
T1 - Modeling Interprocessor Communication and Performance Scalability for Distributed Deep Learning Systems
AU - Lyu, Yi Hong
AU - Liu, Cheng Yueh
AU - Lee, Chen Pang
AU - Tu, Chia Heng
AU - Hung, Shih Hao
N1 - Funding Information:
ACKNOWLEDGMENTS This work was financially supported by the Ministry of Science and Technology of Taiwan under grants MOST No. 107-2218-E-002-053-and MOST No. 107-2218-E-002-003-.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - While deep learning applications become popular, the design of deep learning systems is a critical task to unleash the computing power of underlying systems. Aside from the computing hardware, the computer networking is also a key factor that affects the delivered performance. When considering a large and complex model, the scalability of the system highly depends on the design of the networks, as well as the software behaviors. In this paper, we propose a profile-data-guided performance prediction method to estimate the performance of the system with desired high-speed interconnects, based on the profiling data obtained in a previous run. In particular, we leverage the open-source profiling tool, SOFA, for characterizing the software activities of deep learning software running in a computer cluster, and the characterized information is used to build the performance model for the model training process. When estimating the performance, SOFA is used to capture the performance critical factors for the model to make the predictions. To evaluate the proposed method, four popular deep learning models are adopted in our experiments, ResNet50, Inception3, Alexnet, and VGG16, where a computer cluster formed by four nodes is used to profile the training of the above models on TensorFlow. We ran the scalability analysis to analyze the size of the cluster, and the suitable computer networks for the models. By comparing the predicted data and those measured on the cluster, our model achieves up to 95% accuracy in most of the cases, with the maximum error rate of 10%.
AB - While deep learning applications become popular, the design of deep learning systems is a critical task to unleash the computing power of underlying systems. Aside from the computing hardware, the computer networking is also a key factor that affects the delivered performance. When considering a large and complex model, the scalability of the system highly depends on the design of the networks, as well as the software behaviors. In this paper, we propose a profile-data-guided performance prediction method to estimate the performance of the system with desired high-speed interconnects, based on the profiling data obtained in a previous run. In particular, we leverage the open-source profiling tool, SOFA, for characterizing the software activities of deep learning software running in a computer cluster, and the characterized information is used to build the performance model for the model training process. When estimating the performance, SOFA is used to capture the performance critical factors for the model to make the predictions. To evaluate the proposed method, four popular deep learning models are adopted in our experiments, ResNet50, Inception3, Alexnet, and VGG16, where a computer cluster formed by four nodes is used to profile the training of the above models on TensorFlow. We ran the scalability analysis to analyze the size of the cluster, and the suitable computer networks for the models. By comparing the predicted data and those measured on the cluster, our model achieves up to 95% accuracy in most of the cases, with the maximum error rate of 10%.
UR - http://www.scopus.com/inward/record.url?scp=85092042686&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092042686&partnerID=8YFLogxK
U2 - 10.1109/HPCS48598.2019.9188168
DO - 10.1109/HPCS48598.2019.9188168
M3 - Conference contribution
AN - SCOPUS:85092042686
T3 - 2019 International Conference on High Performance Computing and Simulation, HPCS 2019
SP - 169
EP - 176
BT - 2019 International Conference on High Performance Computing and Simulation, HPCS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 International Conference on High Performance Computing and Simulation, HPCS 2019
Y2 - 15 July 2019 through 19 July 2019
ER -