TY - JOUR
T1 - Adaptive cache pre-forwarding policy for distributed deep learning
AU - Cheng, Sheng Tzong
AU - Hsu, Chih Wei
AU - Horng, Gwo Jiun
AU - Lin, Che Hsuan
N1 - Funding Information:
This work was supported by the "Allied Advanced Intelligent Biomedical Research Center, STUST" under Higher Education Sprout Project, Ministry of Education, Tainan, Taiwan.
Funding Information:
This work was supported by the "Allied Advanced Intelligent Biomedical Research Center, STUST" under Higher Education Sprout Project, Ministry of Education, Tainan, Taiwan.
Publisher Copyright:
© 2020 Elsevier Ltd
PY - 2020/3
Y1 - 2020/3
N2 - With the rapid growth of deep learning algorithms, several high-accuracy models have been developed and applied to many real-world domains. Deep learning is parallel and suitable for distributed computing, which can significantly improve the system throughput. However, there is a bottleneck for cross-machine training, that is, network latency. Nodes frequently need to wait for synchronization, and the content of each synchronization may range from several megabytes to hundred megabytes. Thus, network communication takes considerable time in the training process, which reduces system performance. Therefore, many computing architectures have been proposed. This paper proposes a type of distributed computing system for deep learning. Our design aims to reduce synchronization times and network blocking times by using a new cache mechanism, called cache pre-forwarding. The design concept of cache pre-forwarding aims to exploit reinforcement learning to train a pre-forwarding policy to increase the cache hit rate. Because of the features of reinforcement learning, our policy is adaptive and applicable to different computing environments. Finally, we experimentally demonstrate that our system is feasible.
AB - With the rapid growth of deep learning algorithms, several high-accuracy models have been developed and applied to many real-world domains. Deep learning is parallel and suitable for distributed computing, which can significantly improve the system throughput. However, there is a bottleneck for cross-machine training, that is, network latency. Nodes frequently need to wait for synchronization, and the content of each synchronization may range from several megabytes to hundred megabytes. Thus, network communication takes considerable time in the training process, which reduces system performance. Therefore, many computing architectures have been proposed. This paper proposes a type of distributed computing system for deep learning. Our design aims to reduce synchronization times and network blocking times by using a new cache mechanism, called cache pre-forwarding. The design concept of cache pre-forwarding aims to exploit reinforcement learning to train a pre-forwarding policy to increase the cache hit rate. Because of the features of reinforcement learning, our policy is adaptive and applicable to different computing environments. Finally, we experimentally demonstrate that our system is feasible.
UR - http://www.scopus.com/inward/record.url?scp=85078699789&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078699789&partnerID=8YFLogxK
U2 - 10.1016/j.compeleceng.2020.106558
DO - 10.1016/j.compeleceng.2020.106558
M3 - Article
AN - SCOPUS:85078699789
VL - 82
JO - Computers and Electrical Engineering
JF - Computers and Electrical Engineering
SN - 0045-7906
M1 - 106558
ER -