TY - JOUR
T1 - DOcyclical
T2 - A Latency-Resistant Cyclic Multi-Threading Approach for Automatic Program Parallelization
AU - Yu, Hairong
AU - Li, Guohui
AU - Li, Jianjun
AU - Shu, Lihchyun
N1 - Funding Information:
This research was supported by National Natural Science Foundation of China under Grants Nos. 61572215 and 61332001, China Postdoctoral Science Foundation under Grant No. 2013M531696, and China Postdoctoral Science Special Foundation under Grant No. 2015T80802.
PY - 2016/8/1
Y1 - 2016/8/1
N2 - Chip multiprocessors have been proposed for many years and have become the prevalent architecture for high-performance general-purpose processors. Currently, the search for automatic parallelization techniques that can take full advantage of processor resources is still an active research area. The cyclic multi-threading (CMT) approach, a popular parallelization paradigm, is widely applicable to many applications and delivers good performance scalability. Despite so, its performance could be quite sensitive to fluctuations in communication latencies without substantive operations that prefetch synchronization signals. To address this problem, we propose a novel CMT technique called ${rm DO}- rm cyclical}}$ that employs a priority-based scheme to reduce greatly the frequency of cross-core loop-carried dependences, hence removes considerable amount of communication latency from critical paths of loop executions. Further, it is the priority-based scheme that keeps all processors busy most of time while maintaining processor load balanced. To demonstrate the capacities of $rm DO rm cyclical}}$, we have evaluated it by using the SPEC CPU2006 and StreamIt benchmarks on three real platforms. Experimental results show that our method is much less sensitive to fluctuations in communication latencies, compared with traditional cyclical multi-threading techniques. Besides, $rm DO rm cyclical}}$ outperforms other well-known parallelization methods, including decoupled software pipelines (DSWP), PS-DSWP and HELIX, in terms of speedup by 21-50, 16-27 and 15-25%, respectively, on the three platforms.
AB - Chip multiprocessors have been proposed for many years and have become the prevalent architecture for high-performance general-purpose processors. Currently, the search for automatic parallelization techniques that can take full advantage of processor resources is still an active research area. The cyclic multi-threading (CMT) approach, a popular parallelization paradigm, is widely applicable to many applications and delivers good performance scalability. Despite so, its performance could be quite sensitive to fluctuations in communication latencies without substantive operations that prefetch synchronization signals. To address this problem, we propose a novel CMT technique called ${rm DO}- rm cyclical}}$ that employs a priority-based scheme to reduce greatly the frequency of cross-core loop-carried dependences, hence removes considerable amount of communication latency from critical paths of loop executions. Further, it is the priority-based scheme that keeps all processors busy most of time while maintaining processor load balanced. To demonstrate the capacities of $rm DO rm cyclical}}$, we have evaluated it by using the SPEC CPU2006 and StreamIt benchmarks on three real platforms. Experimental results show that our method is much less sensitive to fluctuations in communication latencies, compared with traditional cyclical multi-threading techniques. Besides, $rm DO rm cyclical}}$ outperforms other well-known parallelization methods, including decoupled software pipelines (DSWP), PS-DSWP and HELIX, in terms of speedup by 21-50, 16-27 and 15-25%, respectively, on the three platforms.
UR - http://www.scopus.com/inward/record.url?scp=84992135512&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84992135512&partnerID=8YFLogxK
U2 - 10.1093/comjnl/bxv125
DO - 10.1093/comjnl/bxv125
M3 - Article
AN - SCOPUS:84992135512
SN - 0010-4620
VL - 59
SP - 1155
EP - 1173
JO - Computer Journal
JF - Computer Journal
IS - 8
ER -