A Container-based Lightweight Fault Tolerance Framework for High Performance Computing Workloads

A Container-based Lightweight Fault Tolerance Framework for High Performance Computing Workloads
Author :
Publisher :
Total Pages : 130
Release :
ISBN-10 : OCLC:1144931624
ISBN-13 :
Rating : 4/5 ( Downloads)

Book Synopsis A Container-based Lightweight Fault Tolerance Framework for High Performance Computing Workloads by : Mohamad Othman Sindi

Download or read book A Container-based Lightweight Fault Tolerance Framework for High Performance Computing Workloads written by Mohamad Othman Sindi and published by . This book was released on 2019 with total page 130 pages. Available in PDF, EPUB and Kindle. Book excerpt: According to the latest world's top 500 supercomputers list, ~90% of the top High Performance Computing (HPC) systems are based on commodity hardware clusters, which are typically designed for performance rather than reliability. The Mean Time Between Failures (MTBF) for some current petascale systems has been reported to be several days, while studies estimate it may be less than 60 minutes for future exascale systems. One of the largest studies on HPC system failures showed that more than 50% of failures were due to hardware, and that failure rates grew with system size. Hence, running extended workloads on such systems is becoming more challenging as system sizes grow. In this work, we design and implement a lightweight fault tolerance framework to improve the sustainability of running workloads on HPC clusters. The framework mainly includes a fault prediction component and a remedy component. The fault prediction component is implemented using a parallel algorithm that proactively predicts hardware issues with no overhead. This allows remedial actions to be taken before failures impact workloads. The algorithm uses machine learning applied to supercomputer system logs. We test it on actual logs from systems from Sandia National Laboratories (SNL). The massive logs come from three supercomputers and consist of ~750 million logs (~86 GB data). The algorithm is also tested online on our test cluster. We demonstrate the algorithm's high accuracy and performance in predicting cluster nodes with potential issues. The remedy component is implemented using the Linux container technology. Container technology has proven its success in the microservices domain. We adapt it towards HPC workloads to make use of its resilience potential. By running workloads inside containers, we are able to migrate workloads from nodes predicted to have hardware issues, to healthy nodes while workloads are running. This does not introduce any major interruption or performance overhead to the workload, nor require application modification. We test with multiple real HPC applications that use the Message Passing Interface (MPI) standard. Tests are performed on various cluster platforms using different MPI types. Results demonstrate successful migration of HPC workloads, while maintaining integrity of results produced.


A Container-based Lightweight Fault Tolerance Framework for High Performance Computing Workloads Related Books

A Container-based Lightweight Fault Tolerance Framework for High Performance Computing Workloads
Language: en
Pages: 130
Authors: Mohamad Othman Sindi
Categories:
Type: BOOK - Published: 2019 - Publisher:

DOWNLOAD EBOOK

According to the latest world's top 500 supercomputers list, ~90% of the top High Performance Computing (HPC) systems are based on commodity hardware clusters,
A Proactive Fault Tolerance Framework for High Performance Computing (HPC) Systems in the Cloud
Language: en
Pages:
Authors: Ifeanyi Paulinus Egwutuoha
Categories: Cloud computing
Type: BOOK - Published: 2014 - Publisher:

DOWNLOAD EBOOK

Fault-Tolerance Techniques for High-Performance Computing
Language: en
Pages: 325
Authors: Thomas Herault
Categories: Computers
Type: BOOK - Published: 2015-07-01 - Publisher: Springer

DOWNLOAD EBOOK

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introducti
New Software-based Fault Tolerance Methods for High Performance Computing
Language: en
Pages: 0
Authors: Robert D. Hunt
Categories:
Type: BOOK - Published: 2015 - Publisher:

DOWNLOAD EBOOK

Performance Engineering of a Lightweight Fault Tolerance Framework
Language: en
Pages: 70
Authors: Hua Chai
Categories: Fault-tolerant computing
Type: BOOK - Published: 2009 - Publisher:

DOWNLOAD EBOOK

It is well-known that the Paxos algorithm can be used to build provably correct practical fault tolerant systems. In this thesis, a lightweight consensus framew