AuthorsM. E. Gómez, J. Duato, J. Flich, P. Lopez, A. Robles, N. A. Nordbotten, O. Lysne, and T. Skeie
TitleAn Efficient Fault-Tolerant Routing Methodology for Meshes and Tori
AfilliationCommunication Systems, Communication Systems
StatusPublished
Publication TypeJournal Article
Year of Publication2004
JournalIEEE Computer Architecture Letters
Volume3
Issue1
Pagination3
Date PublishedMay
PublisherIEEE
Abstract

In this paper we present a methodology to design fault-tolerant routing algorithms for regular direct interconnection networks. It supports fully adaptive routing, does not degrade performance in the absence of faults, and supports a reasonably large number of faults without significantly degrading performance. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, at this node, without being ejected, they are adaptively forwarded to their destinations. In order to allow deadlock-free minimal adaptive routing, the methodology requires only one additional virtual channel (for a total of three), even for tori. Evaluation results for a 4x4x4 torus network show that the methodology is 5-fault tolerant. Indeed, for up to 14 link failures, the percentage of fault combinations supported is higher than 99.96%. Additionally, network throughput degrades by less than 10% when injecting three random link faults without disabling any node. In contrast, a mechanism similar to the one proposed in the BlueGene/L, that disables some network planes, would strongly degrade network throughput by 79%.

DOI10.1109/L-CA.2004.1
Citation KeyND.4.Gomez.2004