TY - GEN
T1 - A Fast and Scalable Cluster Simulator for Network Performance Projection of HPC Applications
AU - Liu, Cheng Yueh
AU - Huang, Po Yao
AU - Tu, Chia Heng
AU - Hung, Shih Hao
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/29
Y1 - 2018/10/29
N2 - For high-performance computing (HPC) cluster to solve distributed computing problems with big data, the performance is usually bounded by communication ability of each node and the underlying network architecture of the cluster. Network simulations are commonly adopted to evaluate the impacts of network configurations before the deployment of such a high-cost system, and a fast and scalable network performance simulation with profiling mechanism is crucial to explore the design space in time effectively. This work presents Clusim, a cluster simulator for exploration of the performance of interplay among distributed HPC applications and network architectures, where three primary techniques are developed in Clusim·. 1) the full system emulators forming a virtual cluster for collecting the network communication traces, 2) window-based simulation, and 3) phase-Aware accelerated network simulation. The first technique helps to reduce the cost and efforts of establishing a physical cluster system for performance estimation. The second one enables cross-platform performance projection by thoroughly preserving the concurrency of the recorded packet flows during simulation. The last one efficiently reduces the simulation time by skipping the detailed simulations of repeated phases of network traffics. We validate Clusim against a physical cluster and show that the difference of the estimated end-To-end delays of the simulated network and those measured on the physical counterpart is within 15%. Meanwhile, thanks to the phase-Aware simulation scheme, Clusim achieves up to 18x speedups, compared with the packet-by-packet simulation by ns-3. Our results suggest that Clusim is a promising design and can rapidly provide the network performance projection of a large scale system with the appropriate trade-off between accuracy and time.
AB - For high-performance computing (HPC) cluster to solve distributed computing problems with big data, the performance is usually bounded by communication ability of each node and the underlying network architecture of the cluster. Network simulations are commonly adopted to evaluate the impacts of network configurations before the deployment of such a high-cost system, and a fast and scalable network performance simulation with profiling mechanism is crucial to explore the design space in time effectively. This work presents Clusim, a cluster simulator for exploration of the performance of interplay among distributed HPC applications and network architectures, where three primary techniques are developed in Clusim·. 1) the full system emulators forming a virtual cluster for collecting the network communication traces, 2) window-based simulation, and 3) phase-Aware accelerated network simulation. The first technique helps to reduce the cost and efforts of establishing a physical cluster system for performance estimation. The second one enables cross-platform performance projection by thoroughly preserving the concurrency of the recorded packet flows during simulation. The last one efficiently reduces the simulation time by skipping the detailed simulations of repeated phases of network traffics. We validate Clusim against a physical cluster and show that the difference of the estimated end-To-end delays of the simulated network and those measured on the physical counterpart is within 15%. Meanwhile, thanks to the phase-Aware simulation scheme, Clusim achieves up to 18x speedups, compared with the packet-by-packet simulation by ns-3. Our results suggest that Clusim is a promising design and can rapidly provide the network performance projection of a large scale system with the appropriate trade-off between accuracy and time.
UR - http://www.scopus.com/inward/record.url?scp=85057364968&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057364968&partnerID=8YFLogxK
U2 - 10.1109/HPCS.2018.00153
DO - 10.1109/HPCS.2018.00153
M3 - Conference contribution
AN - SCOPUS:85057364968
T3 - Proceedings - 2018 International Conference on High Performance Computing and Simulation, HPCS 2018
SP - 970
EP - 977
BT - Proceedings - 2018 International Conference on High Performance Computing and Simulation, HPCS 2018
A2 - Zine-Dine, Khalid
A2 - Smari, Waleed W.
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th International Conference on High Performance Computing and Simulation, HPCS 2018
Y2 - 16 July 2018 through 20 July 2018
ER -