Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016)

Permanent URI for this collection

Proceedings of the Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016) Sofia, Bulgaria




ÍNDICE DE CONTENIDOS

  • Elastic Cloud Services Compliance with Gustafson’s and Amdahl’s Laws, Sasko Ristov, Radu Prodan, Marjan Gusev, Dana Petcu and Jorge Barbosa

  • Exploring OpenMP Accelerator Model in a real-life scientific application using hybrid CPU-MIC platforms, Lukasz Szustak, Kamil Halbiniak, Roman Wyrzykowski and Alexey Lastovetsky

  • Geographical Competitiveness for Powering Datacenters with Renewable Energy, Simon Holmbacka, Enida Sheme, Sebastien Lafond and Neki Frasheri

  • Resource Management Optimization in Multi-Processor Platforms, Atanas Hristov, Iva Nikolova, Georgi Zapryanov, Dragi Kimovski and Vesna Kumbaroska

  • Analysis of fiber-reinforced concrete: micromechanics, parameter identification, fast solvers, Radim Blaheta, Ondrej Jakl, Jiri Stary, Ivan Georgiev, Svetozar Margenov and Roman Kohut

  • A Data-Aware Scheduling for DMCF workflows over Hercules, Fabrizio Marozzo, Francisco Rodrigo Duro, Javier Garcia Blas, Jesús Carretero, Domenico Talia and Paolo Triunfio

  • Efficient Energy Resource Scheduling in Green Powered Datacenters: A Cloudsim Implementation, Enida Sheme, Jean-Marc Pierson, Georges Da Costa, Patricia Stolf and Neki Frasheri

  • Heterogeneous computation of matrix products, Pedro Alonso, Ravi Reddy and Alexey Lastovetsky

  • An Array API for FDM, Jeva Burrows, Helmer Andre Friis and Magne Haveraaen

  • Highly Tuned Small Matrix Multiplications Applied to Spectral Element Code Nek5000, Berk Hess, Jing Gong, Szilard Pall, Philipp Schlatter and Adam Peplinski

  • Automatic Cache Aware Roofline Model Building and Validation Using Topology Detection, Nicolas Denoyelle, Aleksandar Ilic, Brice Goglin, Leonel Sousa and Emmanuel Jeannot

  • Energy-efficient Assignment of Applications to Servers by Taking into Account the Influence of Processes on Each Other, Mateusz Jarus, Ariel Oleksiak, Wahi Narsisian and Hrachya Astsatryan

  • On Parallel Numerical Algorithms for Fractional Diffusion Problems, Praimondas Ciegis, Vadimas Starikovicius and Svetozar Margenov

  • Browse

    Recent Submissions

    Now showing 1 - 13 of 13
    • Publication
      On Parallel Numerical Algorithms for Fractional Diffusion Problems
      (2016-12) Ciegi, Raimondas; Staricovicius, Vadimas; Margenov, Svetozar; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      In this work, we consider a parallel numerical solution of problems depending onf ractional power sof elliptic operator. Three different state of theart approaches are used to transform the original non-local problem into well-known local PDE problems. Parallel numerical algorithms for allt hreea pproaches are developed and discussed. Results of their parallel performance tests are presented and analysed.
    • Publication
      A Data-Aware Scheduling Strategy for DMCF workflows over Hercules
      (2016-10-06) Marozzo, Fabrizio; Carretero Pérez, Jesús; Rodrigo Duro, Francisco José; García Blas, Javier; Talia, Domenico; Trunfio, Paolo; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      As data-intensive scientific prevalence arises, there is a necessity of simplifying the development, deployment, and execution of complex data analysis applications. The Data Mining Cloud Framework is a service-oriented system for allowing users to design and execute data analysis applications, defined as workflows, on cloud platforms, relying on cloud-provided storage services for I/O operations. Hercules is an in-memory I/O solution that can be deployed as an alternative to cloud storage services, providing additional performance and flexibility features. This work extends the DMCF-Hercules cooperation by applying novel data placement and task scheduling techniques for exposing and exploiting data locality in data-intensive workflows.
    • Publication
      Heterogeneous computation of matrix products
      (2016-12) Alonso, Pedro; Manumachu, Ravy Reddy; Lastovetsky, Alexey; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The work presented here is an experimental study of performance in execution time and energy consumption of matrix multiplications on a heterogeneous server. The server features three different devices: a multicore CPU, an NVIDIA Tesla GPU, and an Intel Xeon Phi coprocessor. Matrix multiplication is one of the most used linear algebra kernels and, consequently, applications that make an intensive use of this operation can greatly benefit from efficient implementations. This is the case of the evaluation of matrix polynomials, a core operation used to calculate many matrix functions, which involve a very large number of products of square matrices. Although there exist many proposals for efficient implementations of matrix multiplications in heterogeneous environments, it is still difficult to find packages providing a matrix multiplication routine that is so easy to use, efficient, and versatile as its homogeneous counterparts. Our approach here is based on a simple implementation using OpenMP sections. We have also devised a functional model for the execution time that has been successfully applied to the evaluation of matrix polynomials of large degree so that it allows to balance the workload and minimizes the runtime cost.
    • Publication
      Automatic Cache Aware Roofline Model Building and Validation Using
      (2016-12) Denoyelle, Nicolas; Ilic, Aleksandar; Goglin, Brice; Sousa, Leonel; Jeannot, Emmanuel; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The ever growing complexity of high performance computing systems imposes significant challenges to exploit as much as possible their computational and memory resources. Recently, the Cache-aware Roofline Model has gained popularity due to its simplicity when modeling multi-cores with complex memory hierarchy, characterizing applications bottlenecks, and quantifying achieved or remaining improvements. In this short paper we involve hardware locality topology detection to build the Cache Aware Roofline Model for modern processors in an open-source locality-aware tool. The proposed tool also includes a set of specific micro-benchmarks to assess the micro-architecture performance upper-bounds. The experimental results show that by relying on the proposed tool, it was possible to reach near-theoretical bounds of an Intel 3770K processor, thus proving the effectiveness of the modeling methodology.
    • Publication
      Efficient Energy Sources Scheduling in Green Powered Datacenters: A Cloudsim Implementation
      (2016-12) Sheme, Enida; Stolf, Patricia; Da Costa, Georges; Pierson, Jean-Marc; Frasheri, Neki; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      In this paper we address the issue of managing different energy sources which supply green powered datacenters. The sources are scheduled based on a priority scheme, aiming to maximize the renewable energy utilization, minimize the energy used from the grid and optimize battery usage. Dynamic power capping technique is used to put a threshold on the drawn energy from the grid. The algorithm is implemented and tested in CloudSim simulator. Renewable energy is considered as solar energy. A workload scheduling algorithm is already implemented for higher renewable energy utilization. The results show that the proposed scheme is efficient and it is a promising direction in the field of the optimization in datacenters using renewable energy.
    • Publication
      Resource Management Optimization in Multi-Processor Platforms
      (2016-12) Hristov, Atanas; Nikolova, Iva; Zapryanov, Georgi; Kimovski, Dragi; Kumbaroska, Vesna; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The modern high-performance computing systems (HPCS) are composed of hundreds of thousand computational nodes. An effective resource allocation in HPCS is a subject for many scientific research investigations. Many programming models for effective resources allocation have been proposed. The main purpose of those models is to increase the parallel performance of the HPCS. This paper investigates the efficiency of parallel algorithm for resource management optimization based on Artificial Bee Colony (ABC) metaheuristic while solving a package of NP-complete problems on multi-processor platform.In order to achieve minimal parallelization overhead in each cluster node, a multi-level hybrid programming model is proposed that combines coarse-grain and fine-grain parallelism. Coarse-grain parallelism is achieved through domain decomposition by message passing among computational nodes using Message Passing Interface (MPI) and fine-grain parallelism is obtained by loop-level parallelism inside each computation node by compiler-based thread parallelization via Intel TBB. Parallel communications profiling is made and parallel performance parameters are evaluated on the basis of experimental results.
    • Publication
      An Array API for FDM
      (2016-12) Burrows, Eva; Friis, Helmer André; Haveraaen, Magne; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      As we move towards ultrascale computing, computer architecture is bound to see dramatic changes. Multiple nodes, with or without shared memory, multicore and accelerators (GPUs, FPGAs) will be the norm. For many problems, such as finite difference numerical simulations, the array used to represent a perfect match between the user level code and the hardware architecture’s uniform memory access. Arrays, and to some extent multiarrays, are well supported by most programming languages. A standard compiler maps the array for uniform memory. Some programming models, such as partitioned global address space, allows mapping an array across distributed, yet for each partition, uniform memory. For ultrascale architectures, the simple mapping between user level (multi)array and distributed, non-uniform memory, will disappear. Here we propose an API for arrays, empowering the software developer to implement their own array-memory layout. Application code written towards the API will be independent of underlying architecture changes, thus easily ported to new architectures as they evolve.
    • Publication
      Analysis of fiber-reinforced concrete: micromechanics, parameter identification, fast solvers
      (2016-12) Blaheta, Radim; Georgiev, Ivan; Georgiev, Krassimir; Jakl, Ondrej; Kohut, Roman; Margenov, Svetozar; Starý, Jiri; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      Ultrascale computing is required for many important applications in chemistry, computational fluid dynamics etc., see an overview in the paper Applications for Ultrascale Computing by M. Mihajlovic et al. published in the International Journal Supercomputing Frontiers and Innovations, Vol 2 (2015). In this abstract we shortly describe an application that involves many aspects described in the above paper - the multiscale material design problem. The problem of interest is analysis of the fiber reinforced concrete and we focus on modelling of stiffness through numerical homogenization and computing local material properties by inverse analysis. Both problems require a repeated solution of large-scale finite element problems up to 200 million degrees of freedom and therefore the importance of HPC and ultrascale computing is evident.
    • Publication
      Geographical Competitiveness for Powering Datacenters with Renewable Energy
      (2016-12) Holmbacka, Simon; Sheme, Enida; Lafond, Sébastien; Frasheri, Neki; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      In this paper we analyze the feasibility of using renewable energy for powering a data center located on the 60th parallel north. We analyze the workload energy consumption and the cost-energy trade-off related to available wind and solar energy sources. A wind and solar power model is built based on real weather data for three different geographical locations, and The available monthly and annual renewable energy is analyzed for different scenarios and compared with the energy consumption of a simulated data center. We show the impact different data center sizes have on the coverage percentage of renewables, and we discuss the competitiveness of constructing datacenters in different geographical location based on the results.
    • Publication
      Exploring OpenMP Accelerator Model in a real-life scientific application using hybrid CPU-MIC platforms
      (2016-12) Halbiniak, Kamil; Szustak, Lukasz; Lastovetsky, Alexey; Wyrzykowski, Roman; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The main goal of this paper is the suitability assessment of the OpenMP Accelerator Model (OMPAM) for porting a real-life scientific application to heterogeneous platforms containing a single Intel Xeon Phi coprocessor. This OpenMP extension is supported from version 4.0 of the standard, offering an unified directive-based programming model dedicated for massively parallel accelerators. In our study, we focus on applying the OMPAM extension together with the OpenMP tasks for a parallel application which implements the numerical model of alloy solidification. To map the application efficiently on target hybrid platforms using such constructs as omp target, omp target data and omp target update, we propose a decomposition of main tasks belonging to the computational core of the studied application. In consequence, the coprocessor is used to execute the major parallel workloads, while CPUs are responsible for executing a part of the application that do not require massively parallel resources. Effective overlapping computations with data transfers is another goal achieved in this way. The proposed approach allows us to execute the whole application 3.5 times faster than the original parallel version running on two CPUs.
    • Publication
      Elastic Cloud Services Compliance with Gustafson’s and Amdahl’s Laws
      (2016-12) Ristov, Sasko; Prodan, Radu; Gusev, Marjan; Petcu, Dana; Barbosa, Jorge; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The speedup that can be achieved with parallel and distributed architectures is limited at least by two laws: the Amdahl’s and Gustafson’s laws. The former limits the speedup to a constant value when a fixed size problem is executed on a multiprocessor, while the latter limits the speedup up to its linear value for the fixed time problems, which means that it is limited by the number of used processors. However, a superlinear speedup can be achieved (speedup greater than the number of used processors) due to insufficient memory, while, parallel and, especially distributed systems can even slowdown the execution due to the communication overhead, when compared to the sequential one. Since the cloud performance is uncertain and it can be influenced by available memory and networks, in this paper we investigate if it follows the same speedup pattern as the other traditional distributed systems. The focus is to determine how the elastic cloud services behave in the different scaled environments. We define several scaled systems and we model the corresponding performance indicators. The analysis shows that both laws limit the speedup for a specific range of the input parameters and type of scaling. Even more, the speedup in cloud systems follows the Gustafson’s extreme cases, i.e. insufficient memory and communication bound domains.
    • Publication
      Highly Tuned Small Matrix Multiplications Applied to Spectral Element Code Nek5000
      (2016-12) Hess, Berk; Gong, Jing; Páll, Szilárd; Schlatter, Philipp; Peplinski, Adam; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      Nek5000 is an open-source code for simulating incompressible flows using MPI for parallel communication. In the Nek5000 code, the tensor-product-based operator evaluation can be implemented as small dense matrix-matrix multiplications. It is clear that the routines for calculating the matrix-matrix product dominate the execution time of Nek5000. In this paper, we conduct the optimization of matrix-matrix multiplication using SIMD intrinsics and the LIBXSMM package. The evaluation of the computational cost and optimization of these subroutines is not only applied to the CFD code Nek5000, but also to the NekCEM and NekLEM software, which share same data structures with Nek5000.
    • Publication
      Energy-efficient Assignment of Applications to Servers by Taking into Account the Influence of Processes on Each Other
      (2016-12) Jarus, Mateusz; Oleksiak, Ariel; Narsisian, Wahi; Astsatryan, Hrachya; Carretero Pérez, Jesús; García Blas, Javier; Margenov, Svetozar; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The power consumption of data centers is becoming a crucial challenge in the context of the steadily increasing demand for computation. In this regard finding a way to improve energy efficiency of running applications in data centers is becoming a crucial trend. One method to improve the processor utilization is the consolidation of applications on physical servers. It is possible to run multiple jobs in parallel on the same machine, especially when their requirements regarding computation are smaller than the maximum processor performance. It reduces the number of servers in the data center required to handle multiple requests and therefore leads to energy usage reductions. In this paper, we introduce a realistic model of applications with deadlines executed in parallel on a server and competing for the shared resources and present an energy-aware algorithm which may be used to minimize the overall energy consumption of the servers.