Task Scheduling for Maximizing Performance and Reliability Considering Fault Recovery in Heterogeneous Distributed Systems

Research output: Contribution to journalArticlepeer-review

30 Citations (Scopus)

Abstract

Machine and network failures worsen the results of executing applications on system. Therefore, the reliability of applications on system is an important issue. The recovery of failed machines may increase processing time. This work studies the expected makespan to schedule tasks in heterogeneous distributed systems. A task may be replicated many times to reduce the expected execution time. A two-phase algorithm is proposed. The first phase uses a linear program formulation and a rounding procedure to obtain a favorable allotment to minimize the expected makespan. The second phase applies a scheduling method that is based on the expected executed time and the communication time. During execution, two strategies are considered. In the first strategy, no replication of a task on a set of processors can be stopped. In the second strategy, once a replication of a task has been completed, the other replications of the task are immediately aborted. A comparison reveals that the proposed algorithm significantly outperformed previously proposed algorithms in terms of schedule length ratio, reliability and speedup.

Original languageEnglish
Article number7042339
Pages (from-to)521-532
Number of pages12
JournalIEEE Transactions on Parallel and Distributed Systems
Volume27
Issue number2
DOIs
Publication statusPublished - 2016 Feb 1

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Task Scheduling for Maximizing Performance and Reliability Considering Fault Recovery in Heterogeneous Distributed Systems'. Together they form a unique fingerprint.

Cite this