Fault-containment in cache memories for TMR redundant processor systems

Chung-Ho Chen, Arun K. Somani

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

Cache data errors read by a processor may cause CPU control flow error and force the system to enter a CPU-cache reintegration process in redundant processor systems. The reintegration process degrades the system performance and reliability. To reduce the occurrences of such an event, we propose a real-time error recovery scheme that provides effective fault-containment for data errors in cache memories. The scheme is based on cache data broadcasting of a dirty line after modification. It effectively exploits the redundancy of a fault-tolerant system using hardware voting. The scheme recovers from erroneous cache data written by a processor with full coverage. This error recovery feature remedies the insufficiency of error-correcting codes that are unable to prevent such an error. In addition, more than 60 percent of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The protocol can also be used to speedup the CPU-cache reintegration process for a temporarily failed processor. The performance overhead of the protocol is to broadcast only 2-3 percent of the total memory references.

Original languageEnglish
Pages (from-to)386-397
Number of pages12
JournalIEEE Transactions on Computers
Volume48
Issue number4
DOIs
Publication statusPublished - 1999 Jan 1

Fingerprint

Cache memory
Cache
Fault
Program processors
Error Recovery
Percent
Computer systems
Data Broadcasting
Network protocols
Fault-tolerant Systems
Line
Error-correcting Codes
System Reliability
Flow Control
Broadcasting
Voting
Flow control
Broadcast
Redundancy
System Performance

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

@article{d63bd29ed5fc4b359e43b162bf7dbdf0,
title = "Fault-containment in cache memories for TMR redundant processor systems",
abstract = "Cache data errors read by a processor may cause CPU control flow error and force the system to enter a CPU-cache reintegration process in redundant processor systems. The reintegration process degrades the system performance and reliability. To reduce the occurrences of such an event, we propose a real-time error recovery scheme that provides effective fault-containment for data errors in cache memories. The scheme is based on cache data broadcasting of a dirty line after modification. It effectively exploits the redundancy of a fault-tolerant system using hardware voting. The scheme recovers from erroneous cache data written by a processor with full coverage. This error recovery feature remedies the insufficiency of error-correcting codes that are unable to prevent such an error. In addition, more than 60 percent of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The protocol can also be used to speedup the CPU-cache reintegration process for a temporarily failed processor. The performance overhead of the protocol is to broadcast only 2-3 percent of the total memory references.",
author = "Chung-Ho Chen and Somani, {Arun K.}",
year = "1999",
month = "1",
day = "1",
doi = "10.1109/12.762529",
language = "English",
volume = "48",
pages = "386--397",
journal = "IEEE Transactions on Computers",
issn = "0018-9340",
publisher = "IEEE Computer Society",
number = "4",

}

Fault-containment in cache memories for TMR redundant processor systems. / Chen, Chung-Ho; Somani, Arun K.

In: IEEE Transactions on Computers, Vol. 48, No. 4, 01.01.1999, p. 386-397.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Fault-containment in cache memories for TMR redundant processor systems

AU - Chen, Chung-Ho

AU - Somani, Arun K.

PY - 1999/1/1

Y1 - 1999/1/1

N2 - Cache data errors read by a processor may cause CPU control flow error and force the system to enter a CPU-cache reintegration process in redundant processor systems. The reintegration process degrades the system performance and reliability. To reduce the occurrences of such an event, we propose a real-time error recovery scheme that provides effective fault-containment for data errors in cache memories. The scheme is based on cache data broadcasting of a dirty line after modification. It effectively exploits the redundancy of a fault-tolerant system using hardware voting. The scheme recovers from erroneous cache data written by a processor with full coverage. This error recovery feature remedies the insufficiency of error-correcting codes that are unable to prevent such an error. In addition, more than 60 percent of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The protocol can also be used to speedup the CPU-cache reintegration process for a temporarily failed processor. The performance overhead of the protocol is to broadcast only 2-3 percent of the total memory references.

AB - Cache data errors read by a processor may cause CPU control flow error and force the system to enter a CPU-cache reintegration process in redundant processor systems. The reintegration process degrades the system performance and reliability. To reduce the occurrences of such an event, we propose a real-time error recovery scheme that provides effective fault-containment for data errors in cache memories. The scheme is based on cache data broadcasting of a dirty line after modification. It effectively exploits the redundancy of a fault-tolerant system using hardware voting. The scheme recovers from erroneous cache data written by a processor with full coverage. This error recovery feature remedies the insufficiency of error-correcting codes that are unable to prevent such an error. In addition, more than 60 percent of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The protocol can also be used to speedup the CPU-cache reintegration process for a temporarily failed processor. The performance overhead of the protocol is to broadcast only 2-3 percent of the total memory references.

UR - http://www.scopus.com/inward/record.url?scp=0032667663&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032667663&partnerID=8YFLogxK

U2 - 10.1109/12.762529

DO - 10.1109/12.762529

M3 - Article

VL - 48

SP - 386

EP - 397

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

SN - 0018-9340

IS - 4

ER -