TY - GEN
T1 - Optimization of stride prefetching mechanism and dependent warp scheduling on GPGPU
AU - Tsou, Tsung Han
AU - Chen, Dun Jie
AU - Hung, Sheng Yang
AU - Wang, Yu Hsiang
AU - Chen, Chung Ho
N1 - Funding Information:
ACKNOWLEDGMENT This work is supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 109-2634-F-006-018-.
Publisher Copyright:
© 2020 IEEE
PY - 2020
Y1 - 2020
N2 - In this paper, we propose a data prefetching scheme, History-Awoken Stride (HAS) prefetching, optimized with a warp scheduler, Prefetched-Then-Executed (PTE), and evaluate the performance on the platform that we developed. Our platform is a single instruction, multiple thread (SIMT) GPGPU environment, supporting OpenCL 1.2 runtime and TensorFlow framework with CUDA-on-CL technology. Enormous amount of executing threads in GPU demands critical memory performance. HAS exploits history table of related memory accesses in intra-warp and inter-warp of the same workgroup as well as among workgroups, and uses address strides and warp status to monitor the prefetching progress of the executed warp. PTE precisely issues warps according to prefetching status from HAS. The experimental results of LeNet-5 inference and 11 PolyBench test programs on CAS-GPU show that our mechanism can achieve an average IPC performance improvement of 10.4%, and 7.8% reduction in data cache miss rate. The prefetch accuracy can reach 67.7%, and the proportion of prefetch request arrived at the appropriate time reaches 48.2%.
AB - In this paper, we propose a data prefetching scheme, History-Awoken Stride (HAS) prefetching, optimized with a warp scheduler, Prefetched-Then-Executed (PTE), and evaluate the performance on the platform that we developed. Our platform is a single instruction, multiple thread (SIMT) GPGPU environment, supporting OpenCL 1.2 runtime and TensorFlow framework with CUDA-on-CL technology. Enormous amount of executing threads in GPU demands critical memory performance. HAS exploits history table of related memory accesses in intra-warp and inter-warp of the same workgroup as well as among workgroups, and uses address strides and warp status to monitor the prefetching progress of the executed warp. PTE precisely issues warps according to prefetching status from HAS. The experimental results of LeNet-5 inference and 11 PolyBench test programs on CAS-GPU show that our mechanism can achieve an average IPC performance improvement of 10.4%, and 7.8% reduction in data cache miss rate. The prefetch accuracy can reach 67.7%, and the proportion of prefetch request arrived at the appropriate time reaches 48.2%.
UR - http://www.scopus.com/inward/record.url?scp=85109258369&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85109258369&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85109258369
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
BT - 2020 IEEE International Symposium on Circuits and Systems, ISCAS 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 52nd IEEE International Symposium on Circuits and Systems, ISCAS 2020
Y2 - 10 October 2020 through 21 October 2020
ER -