TY - JOUR
T1 - An object-oriented approach to develop software fault-tolerant mechanisms for parallel programming systems
AU - Shieh, Ce Kuen
AU - Mac, Su Cheong
AU - Chang, Tzu Chiang
AU - Lai, Chung Ming
PY - 1996/3
Y1 - 1996/3
N2 - Some parallel programming systems are libraries that allow programmers to write thread-based parallel programs with existing sequential languages. Basically, parallel programs are hard to debug and much more complex than sequential programs which causes design faults to possibly reside in the parallel programs. This paper is aimed to design and implement a software fault-tolerant mechanism in an object-oriented approach for the existing parallel programming systems. With these software fault-tolerant objects, programmers can write their reliable parallel programs on these parallel programming systems. Recover Block, N-Version Programming, and Conversation software fault tolerant mechanisms are chosen to support. All these mechanisms are implemented and grouped into a separate software layer which resides on the top of the parallel programming system, used to monitor the behavior of applications, detect software faults, and recover and restart programs. Parallel programming systems are responsible for managing concurrent threads and for providing fault-tolerant mechanisms with necessary concurrent facilities. This layered system architecture makes these software fault-tolerant mechanisms portable, extensible, and lighter overhead. We have originally implemented the above software fault-tolerant objects based on Presto in C++. These objects have also been ported to C-Thread of Mach and LWP of SUN OS.
AB - Some parallel programming systems are libraries that allow programmers to write thread-based parallel programs with existing sequential languages. Basically, parallel programs are hard to debug and much more complex than sequential programs which causes design faults to possibly reside in the parallel programs. This paper is aimed to design and implement a software fault-tolerant mechanism in an object-oriented approach for the existing parallel programming systems. With these software fault-tolerant objects, programmers can write their reliable parallel programs on these parallel programming systems. Recover Block, N-Version Programming, and Conversation software fault tolerant mechanisms are chosen to support. All these mechanisms are implemented and grouped into a separate software layer which resides on the top of the parallel programming system, used to monitor the behavior of applications, detect software faults, and recover and restart programs. Parallel programming systems are responsible for managing concurrent threads and for providing fault-tolerant mechanisms with necessary concurrent facilities. This layered system architecture makes these software fault-tolerant mechanisms portable, extensible, and lighter overhead. We have originally implemented the above software fault-tolerant objects based on Presto in C++. These objects have also been ported to C-Thread of Mach and LWP of SUN OS.
UR - http://www.scopus.com/inward/record.url?scp=0030105443&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0030105443&partnerID=8YFLogxK
U2 - 10.1016/0164-1212(95)00126-3
DO - 10.1016/0164-1212(95)00126-3
M3 - Article
AN - SCOPUS:0030105443
SN - 0164-1212
VL - 32
SP - 215
EP - 225
JO - Journal of Systems and Software
JF - Journal of Systems and Software
IS - 3
ER -