TY - JOUR
T1 - Data race avoidance and replay scheme for developing and debugging parallel programs on distributed shared memory systems
AU - Chiu, Yung Chang
AU - Shieh, Ce Kuen
AU - Huang, Tzu Chi
AU - Liang, Tyng Yeu
AU - Chu, Kuo Chih
N1 - Funding Information:
We gratefully acknowledge the National Science Council of Taiwan for their support of this project under Grant No. NSC 98-2811-E-006-014 . We further offer our special thanks to the editor and reviewers of the publishing journal for their valuable comments and suggestions, which materially improved the quality of this paper.
PY - 2011/1
Y1 - 2011/1
N2 - Distributed shared memory (DSM) allows parallel programs to run on distributed computers by simulating a global virtual shared memory, but data racing bugs may easily occur when the threads of a multi-threaded process concurrently access the physically distributed memory. Earlier tools to help programmers locate data racing bugs in non-DSM parallel programs are not easily applied to DSM systems. This study presents the data race avoidance and replay scheme (DRARS) to assist debugging parallel programs on DSM or multi-core systems. DRARS is a novel tool which controls the consistency protocol of the target program, automatically preventing a large class of data racing bugs when the parallel program is subsequently run, obviating much of the need for manual debugging. For data racing bugs that cannot be avoided automatically, DRARS performs a deterministic replay-type function on DSM systems, faithfully reproducing the behavior of the parallel program during run time. Because one class of data racing bugs has already been eliminated, the remaining manual debugging task is greatly simplified. Unlike previous debugging methods, DRARS does not require that the parallel program be written in a specific style or programming language. Moreover, DRARS can be implemented in most consistency protocols. In this paper, DRARS is realized and verified in real experiments using the eager release consistency protocol on a DSM system with various applications.
AB - Distributed shared memory (DSM) allows parallel programs to run on distributed computers by simulating a global virtual shared memory, but data racing bugs may easily occur when the threads of a multi-threaded process concurrently access the physically distributed memory. Earlier tools to help programmers locate data racing bugs in non-DSM parallel programs are not easily applied to DSM systems. This study presents the data race avoidance and replay scheme (DRARS) to assist debugging parallel programs on DSM or multi-core systems. DRARS is a novel tool which controls the consistency protocol of the target program, automatically preventing a large class of data racing bugs when the parallel program is subsequently run, obviating much of the need for manual debugging. For data racing bugs that cannot be avoided automatically, DRARS performs a deterministic replay-type function on DSM systems, faithfully reproducing the behavior of the parallel program during run time. Because one class of data racing bugs has already been eliminated, the remaining manual debugging task is greatly simplified. Unlike previous debugging methods, DRARS does not require that the parallel program be written in a specific style or programming language. Moreover, DRARS can be implemented in most consistency protocols. In this paper, DRARS is realized and verified in real experiments using the eager release consistency protocol on a DSM system with various applications.
UR - http://www.scopus.com/inward/record.url?scp=78650512724&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78650512724&partnerID=8YFLogxK
U2 - 10.1016/j.parco.2010.09.002
DO - 10.1016/j.parco.2010.09.002
M3 - Article
AN - SCOPUS:78650512724
SN - 0167-8191
VL - 37
SP - 11
EP - 25
JO - Parallel Computing
JF - Parallel Computing
IS - 1
ER -