TY - JOUR
T1 - Enabling SIMT execution model on homogeneous multi-Core system
AU - Chen, Kuan Chung
AU - Chen, Chung Ho
N1 - Funding Information:
This work was supported by Ministry of Science and Technology, Taiwan, under Grant MOST 103-2221-E-006-266-MY3.
Publisher Copyright:
© 2018 ACM.
PY - 2018/3
Y1 - 2018/3
N2 - Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-performance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential on homogeneous multi-core processors, which generally run in multiple-instruction multiple-data (MIMD) mode when utilizing the multi-core resources. We address three architecture issues in enabling SIMT execution model on multi-core processor, including multithreading execution model, kernel thread context placement, and thread divergence. For the SIMT execution model, we propose a fine-grained multithreading mechanism on an ARM-based multi-core system. Each of the processor cores stores the kernel thread contexts in its L1 data cache for per-cycle thread-switching requirement. For divergence-intensive kernels, an Inner Conditional Statement First (ICS-First) mechanism helps early re-convergence to occur and significantly improves the performance. The experiment results show that effectiveness in data-parallel processing reduces on average 36% dynamic instructions, and boosts the SIMT executions to achieve on average 1.52× and up to 5× speedups over the MIMD counterpart for OpenCL benchmarks for single issue in-order processor cores. By using the explicit vectorization optimization on the kernels, the SIMT model gains further benefits from the SIMD extension and achieves 1.71× speedup over the MIMD approach. The SIMT model using in-order superscalar processor cores outperforms the MIMD model that uses superscalar out-of-order processor cores by 40%. The results show that, to exploit data-level parallelism, enabling the SIMT model on homogeneous multi-core processors is important.
AB - Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-performance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential on homogeneous multi-core processors, which generally run in multiple-instruction multiple-data (MIMD) mode when utilizing the multi-core resources. We address three architecture issues in enabling SIMT execution model on multi-core processor, including multithreading execution model, kernel thread context placement, and thread divergence. For the SIMT execution model, we propose a fine-grained multithreading mechanism on an ARM-based multi-core system. Each of the processor cores stores the kernel thread contexts in its L1 data cache for per-cycle thread-switching requirement. For divergence-intensive kernels, an Inner Conditional Statement First (ICS-First) mechanism helps early re-convergence to occur and significantly improves the performance. The experiment results show that effectiveness in data-parallel processing reduces on average 36% dynamic instructions, and boosts the SIMT executions to achieve on average 1.52× and up to 5× speedups over the MIMD counterpart for OpenCL benchmarks for single issue in-order processor cores. By using the explicit vectorization optimization on the kernels, the SIMT model gains further benefits from the SIMD extension and achieves 1.71× speedup over the MIMD approach. The SIMT model using in-order superscalar processor cores outperforms the MIMD model that uses superscalar out-of-order processor cores by 40%. The results show that, to exploit data-level parallelism, enabling the SIMT model on homogeneous multi-core processors is important.
UR - http://www.scopus.com/inward/record.url?scp=85045186557&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85045186557&partnerID=8YFLogxK
U2 - 10.1145/3177960
DO - 10.1145/3177960
M3 - Article
AN - SCOPUS:85045186557
SN - 1544-3566
VL - 15
JO - ACM Transactions on Architecture and Code Optimization
JF - ACM Transactions on Architecture and Code Optimization
IS - 1
M1 - a6
ER -