Overhead of using spare nodes-Reference-Cited by-同舟云学术

Overhead of using spare nodes

Published:2020-02-04 Issue:2 Volume:34 Page:208-226
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Hori Atsushi¹,Yoshinaga Kazumi²,Herault Thomas³,Bouteiller Aurélien³,Bosilca George³,Ishikawa Yutaka¹

Affiliation:

1. RIKEN Center for Computational Science, Kobe, Hyogo, Japan

2. Meguro-ku, Tokyo, Japan

3. Innovative Computing Laboratory, The University of Tennessee, Knoxville, TN, USA

Abstract

With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods. The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342020901885

Reference29 articles.

1. Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers

2. Post-failure recovery of MPI communication capability

3. Hardware-Centric Analysis of Network Performance for MPI Applications

4. The IBM Blue Gene/Q interconnection network and message unit

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Proteo: a framework for the generation and evaluation of malleable MPI applications;The Journal of Supercomputing;2024-07-02

2. Task-Level Resilience: Checkpointing vs. Supervision;International Journal of Networking and Computing;2022

3. Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks;2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2021-06

4. Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method;49th International Conference on Parallel Processing - ICPP;2020-08-17