Resilience of Parallel Applications

e-Archivo Repository

Show simple item record Losada, Nuria Martín, María J. González, Patricia
dc.contributor.editor Carretero Pérez, Jesús
dc.contributor.editor García Blas, Javier
dc.contributor.editor Petcu, Dana 2016-04-29T07:51:47Z 2016-04-29T07:51:47Z 2016-02
dc.identifier.bibliographicCitation Carretero Pérez, Jesús; (eds.). (2016). Proceedings of the First PhD Symposium on Sustainable UltrascaleComputing Systems (NESUS PhD 2016). Timisoara, Romania. Universidad Carlos III de Madrid, ARCOS. Pp. 29-32.
dc.identifier.isbn 978-84-608-6309-0
dc.description Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.
dc.description.abstract Future exascale systems are predicted to be formed by millions of cores. This is a great opportunity for HPC applications, however, it is also a hazard for the completion of their execution. Even if one computation node presents a failure every one century, a machine with 100.000 nodes will encounter a failure every 9 hours. Thus, HPC applications need to make use of fault tolerance techniques to ensure they successfully finish their execution. This PhD thesis is focused on fault tolerance solutions for generic parallel applications, more specifically in checkpointing solutions. We have extended CPPC, an MPI application-level portable checkpointing tool developed in our research group, to work with OpenMP applications, and hybrid MPI-OpenMP applications. Currently, we are working on transparently obtaining resilient MPI applications, that is, applications that are able to recover themselves from failures without stopping their execution.
dc.description.sponsorship European Cooperation in Science and Technology. COST
dc.description.sponsorship This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P, and the predoctoral grant of Nuria Losada ref. BES-2014-068066) and by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).
dc.format.extent 4
dc.format.mimetype application/pdf
dc.language.iso eng
dc.rights Atribución-NoComercial-SinDerivadas 3.0 España
dc.title Resilience of Parallel Applications
dc.type bookPart
dc.type conferenceObject
dc.subject.eciencia Informática
dc.rights.accessRights openAccess
dc.relation.projectID Gobierno de España. TIN2013-42148-P
dc.type.version publishedVersion
dc.relation.eventdate February 8-11, 2016
dc.relation.eventnumber 1
dc.relation.eventplace Timisoara, Romania
dc.relation.eventtitle PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)
dc.relation.eventtype proceeding
dc.identifier.publicationfirstpage 29
dc.identifier.publicationlastpage 32
dc.identifier.publicationtitle Proceedings of the First PhD Symposium on Sustainable UltrascaleComputing Systems (NESUS PhD 2016)
 Find Full text

Files in this item

*Click on file's image for preview. (Embargoed files's preview is not supported)

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record