First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014)

Permanent URI for this collection

First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014)
  • Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014)


  • ÍNDICE



  • Preface [Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014)]

  • Improving the Performance of the MPI_Allreduce Collective Operation through Rank Renaming, Juan-Antonio Rico-Gallego and Juan-Carlos Díaz-Martín

  • A Workflow-oriented Language for Scalable Data Analytics, Fabrizio Marozzo, Domenico Talia and Paolo Trunfio

  • Scheduling Real-Time Jobs in Distributed Systems - Simulation and Performance Analysis, Georgios L. Stavrinides and Helen D. Karatza

  • Bi-objective Workflow Scheduling in Production Clouds: Early Simulation Results and Outlook, Juan J. Durillo, Radu Prodan, Sorina Camarasu-Pop, Tristan Glattard and Frédéric Suter

  • An approach towards high productivity computing, Zorislav Shoyat

  • Efficient Parallel Video Encoding on Heterogeneous Systems, Svetislav Momcilovic, Aleksandar Ilic, Nuno Roma and Leonel Sousa

  • On Efficiency of the OpenFOAM-based Parallel Solver for the Heat Transfer in Electrical Power Cables, Raimondas Ciegis, Vadimas Starikovicius and Andrej Bugajev

  • On the Performance Of the Thread-Multiple Support Level In Thread-Based MPI, Juan-Carlos Díaz-Martín and Juan-Antonio Rico-Gallego

  • Content Delivery and Sharing in Federated Cloud Storage, José Luis González, Victor J. Sosa-Sosa, Jesús Carretero Pérez and Luis Miguel Sánchez García

  • Improvement of Heterogeneous Systems Efficiency Using Self-Configurable FPGA-based Computing, Anatolity Melnyk and Viktor Melnyk

  • Data parallel algorithm in finding 2-D site percolation backbones, Biljana Stamatovic and Roman Trobec

  • Exploiting data locality in Swift/T workflows using Hercules, Francisco José Rodrigo Duro, Javier García Blas, Florín Daniel Isaila, Jesús Carretero Pérez, Justin M. Wozniak and Rob Ross

  • Browse

    Recent Submissions

    Now showing 1 - 14 of 14
    • Publication
      A Workflow-oriented Language for Scalable Data Analytics
      (2014-11) Marozzo, Fabrizio; Talia, Domenico; Trunfio, Paolo; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      Data in digital repositories are everyday more and more massive and distributed. Therefore analyzing them requires efficient data analysis techniques and scalable storage and computing platforms. Cloud computing infrastructures offer an effective support for addressing both the computational and data storage needs of big data mining and parallel knowledge discovery applications. In fact, complex data mining tasks involve data- and compute-intensive algorithms that require large and efficient storage facilities together with high performance processors to get results in acceptable times. In this paper we describe a Data Mining Cloud Framework (DMCF) designed for developing and executing distributed data analytics applications as workflows of services. We describe also a workflow-oriented language, called JS4Cloud, to support the design and execution of script-based data analysis workflows on DMCF. We finally present a data analysis application developed with JS4Cloud, and the scalability achieved executing it on DMCF.
    • Publication
      Scheduling Real-Time Jobs in Distributed Systems - Simulation and Performance Analysis
      (2014-11) Stavrinides, Georgios L.; Karatza, Helen D.; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      One of the major challenges in ultrascale systems is the effective scheduling of complex jobs within strict timing constraints. The distributed and heterogeneous system resources constitute another critical issue that must be addressed by the employed scheduling strategy. In this paper, we investigate by simulation the performance of various policies for the scheduling of real-time directed acyclic graphs in a heterogeneous distributed environment. We apply bin packing techniques during the processor selection phase of the scheduling process, in order to utilize schedule gaps and thus enhance existing list scheduling methods. The simulation results show that the proposed policies outperform all of the other examined algorithms.
    • Publication
      Preface [Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014)]
      (2014-11) Carretero Pérez, Jesús; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
    • Publication
      On the Performance Of the Thread-Multiple Support Level In Thread-Based MPI
      (2014-11) Díaz-Martín, Juan-Carlos; Rico-Gallego, Juan-Antonio; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      Exascale systems are likely to have orders of magnitude less memory per core than current systems (though still large amounts of memory). As the amount of memory per core is dropping, going to thread-based models might be an unavoidable step towards the exascale milestone. AzequiaMPI is a thread-based open source full conformant implementation of MPI-1.3 for shared memory. We expose the techniques introduced in AzequiaMPI that, first, simplify the implementation and second, make the thread-based model to significantly improve the bandwidth of process-based implementations. Current version is also compliant with the MPI_THREAD_MULTIPLE thread-safety level, a feature of MPI-2.0 standard. The well known Thakur and Gropp MPI_THREAD_MULTIPLE tests show that both latency and bandwidth figures of AzequiaMPI significantly improve that of MPC-MPI, MPICH and Open MPI in an eight cores Intel Xeon E5620 Nehalem machine.
    • Publication
      Improving the Performance of the MPI_Allreduce Collective Operation through Rank Renaming
      (2014-11) Rico-Gallego, Juan-Antonio; Díaz-Martín, Juan-Carlos; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      Collective operations, a key issue in the global efficiency of HPC applications, are optimized in current MPI libraries by choosing at runtime between a set of algorithms, based on platform-dependent beforehand established parameters, as the message size or the number of processes. However, with progressively more cores per node, the cost of a collective algorithm must be mainly imputed to process-to-processor mapping, because its decisive influence over the network traffic. Hierarchical design of collective algorithms pursuits to minimize the data movement through the slowest communication channels of the multi-core cluster. Nevertheless, the hierarchical implementation of some collectives becomes inefficient, and even impracticable, due to the operation definition itself. This paper proposes a new approach that departs from a frequently found regular mapping, either sequential or round-robin. While keeping the mapping, the rank assignation to the processes is temporarily changed prior to the execution of the collective algorithm. The new assignation makes the communication pattern to adapt to the communication channels hierarchy. We explore this technique for the Ring algorithm when used in the well-known MPI_Allreduce collective, and discuss the obtained performance results. Extensions to other algorithms and collective operations are proposed.
    • Publication
      Improvement of Heterogeneous Systems Efficiency Using Self-Configurable FPGA-based Computing
      (2014-11) Melnyk, Anatolity; Melnyk, Viktor; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      Computer systems performance is is being improved today using two major approaches: general-purpose computers computing power increase (creation of multicore processors, multiprocessor computer systems, supercomputers), and adaptation of the computer hardware to the executed algorithm (class of algorithms). Last approach often provides application of the ASIC-based and FPGA-based hardware accelerators, also called reconfigurable, and is characterized by better performance / power consumption ratio and lower cost as compared to the general-purpose computers of equivalent performance. However, such systems have typical problems. The ASIC-based accelerators: 1) are effective for certain classes of algorithms only and 2) algorithms and software require adaptation for effective application. The FPGA-based accelerators and reconfigurable computer systems (that use FPGAs as a processing unit): 1) in the process of writing require a special program to perform computing tasks balancing between the general-purpose computer and FPGAs; 2) require designing the application-specific processor soft-cores; and 3) are effective for certain classes of problems only, for which application-specific processor soft-cores were previously developed. In this paper, we consider an emerging type of high-performance computer systems called self-configurable FPGA-based computer systems, which are deprived of specified challenges. We have analyzed the background of self-configurable computer systems creation, presented current results of our research, and introduced some ongoing works. Self-configurable computer systems are being developed within the project entitled "Improvement of heterogeneous systems efficiency using self-configurable FPGA-based computing" that is the part of the NESUS Action.
    • Publication
      Efficient Parallel Video Encoding on Heterogeneous Systems
      (2014-11) Momcilovic, Svetislav; Ilic, Aleksandar; Roma, Nuno; Sousa, Leonel; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      In this study we propose an efficient method for collaborative H.264/AVC inter-loop encoding in heterogeneous CPU+GPU systems. This method relies on specifically developed extensive library of highly optimized parallel algorithms for both CPU and GPU architectures, and all inter-loop modules. In order to minimize the overall encoding time, this method integrates adaptive load balancing for the most computationally intensive, inter-prediction modules, which is based on dynamically built functional performance models of heterogenous devices and inter-loop modules. The proposed method also introduces efficient communication-aware techniques, which maximize data reusing, and decrease the overhead of expensive data transfers in collaborative video encoding. The experimental results show that the proposed method is able of achieving real-time video encoding for very demanding video coding parameters, i.e., full HD video format, 64×64 pixels search area and the exhaustive motion estimation.
    • Publication
      On Efficiency of the OpenFOAM-based Parallel Solver for the Heat Transfer in Electrical Power Cables
      (2014-11) Ciegis, Raimondas; Starikovicius, Vadimas; Bugajev, Andrej; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      In this work, we study the efficiency of the OpenFOAM-based parallel solver for the heat conduction in electrical power cables. The 2D benchmark problem with three cables is used for our numerical tests. We study and compare the efficiency of conjugate gradient solver with diagonal incomplete Cholesky (DIC) preconditioner and generalized geometric algebraic multigrid solver (GAMG), which is available in Open- FOAM. The convergence and parallel scalability of the solvers are presented and analyzed. Parallel numerical tests are performed on the cluster of multicore computers.
    • Publication
      Data parallel algorithm in finding 2-D site percolation backbones
      (2014-11) Stamatovic, Biljana; Trobec, Roman; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      A data parallel solution approach formulated with cellular automata is proposed with a potential to become a part of future sustainable computers. It offers extreme parallelism on data-flow principles. If a problem can be formulated with a local and iterative methodology, so that data cell results always depend on neighbouring data items only, the cellular automata could be an efficient solution framework. We have demonstrated experimentally, on a graph-theoretical problem, that the performance of the proposed methodology has a potential to be for two orders of magnitude faster from known solutions.
    • Publication
      Bi-objective Workflow Scheduling in Production Clouds: Early Simulation Results and Outlook
      (2014-11) Durillo, Juan J.; Prodan, Radu; Camarasu-Pop, Sorina; Glattard, Tristan; Suter, Frédéric; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      We present MOHEFT, a multi-objective list scheduling heuristic that provides the user with a set of Pareto tradeoff optimal solutions from which the one that better suits the user requirements can be manually selected. We demonstrate the potential of our method for multi-objective workflow scheduling on the commercial Amazon EC2 Cloud by comparing the quality of the MOHEFT tradeoff solutions with a state-of-the-art multi-objective approach called SPEA2* for three types of synthetic workflows with different parallelism and load balancing characteristics. We conclude with an outlook into future research towards closing the gap between the scientific simulation and real-world experimentation.
    • Publication
      An approach towards high productivity computing
      (2014-11) Shoyat, Zorislav; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The notion of what exactly we mean by productivity is largely depending on the active paradigms of a particular field and, on a global level, on the present prevailing social, cultural, scientific and spiritual paradigms and environment. It follows that in a long term any specific definition of productivity will have to be changed. Unfortunately, due to the historical processes, present day human-computer communication is on an extremely low level of language complexity. Consequently our present day productivity in using computers from the idea till the implementation is very low. This is primarily due to the circulus vitiosus of interdependecy of (hardware) computer architectures and popular computer programming languages based on the designs of the first Electronic Brains of the mid-last century. The natural, human Language is the prime Human tool for building a common model of the Universe, a huge fractal dynamic system, i.e. machine, whose sub-machines are smaller fractal machines consisting of a series which goes through dialects, sociolects down to idiolects. On the other hand, regarding strictly formal non-adaptable "programming" languages we see that almost all our computer linguistic efforts are oriented towards fixed expressions which are simple enough to be easily and efficiently translated into the scalar serial presently prevailing computer architecture(s). Therefore a new, fresh approach is proposed, based on the idea that the lowest possible level of a computer system shall understand a natural-like communication language, which is contextful and deals with Information, not with Data without Meta-Data. By significantly leveling up the human-computer interaction towards the ideals of a semi-natural language completely new approaches for High Productivity Computing, both on Hardware and on Software level can be thought out, and the NESUS WG1 Focus Group High Productivity Computing has been established, to historically, futuristically and realistically define and, based on that, develop, through partner collaboration projects, such a (possible) High Productivity System based on specific hardware and software.
    • Publication
      Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014): Porto, Portugal
      (2014-11) Carretero Pérez, Jesús; García Blas, Francisco Javier; Barbosa, Jorge; Morla, Ricardo; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III. Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas
    • Publication
      Content Delivery and Sharing in Federated Cloud Storage
      (2014-11) González, José Luis; Sosa-Sosa, Victor J.; Carretero Pérez, Jesús; Sánchez García, Luis Miguel; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      Cloud-based storage is becoming a cost-effective solution for agencies, hospitals, government instances and scientific centers to deliver and share contents to/with a set of end-users. However, reliability, privacy and lack of control are the main problems that arise when contracting content delivery services with a single cloud storage provider. This paper presents the implementation of a storage system for content delivery and sharing in federated cloud storage networks. This system virtualizes the storage resources of a set of organizations as a single federated system, which is in charge of the content storage. The architecture includes a metadata management layer to keep the content delivery control in-house and a storage synchronization worker/monitor to keep the state of storage resources in the federation as well as to send contents near to the end-users. It also includes a redundancy layer based on a multi-threaded engine that enables the system to withstand failures in the federated network. We developed a prototype based on this scheme as a proof of concept. The experimental evaluation shows the benefits of building content delivery systems in federated cloud environments, in terms of performance, reliability and profitability of the storage space.
    • Publication
      Exploiting data locality in Swift/T workflows using Hercules
      (2014-11) Rodrigo Duro, Francisco José; García Blas, Javier; Isaila, Florín Daniel; Carretero Pérez, Jesús; Wozniak, Justin M.; Ross, Rob; Carretero Pérez, Jesús; García Blas, Javier; Barbosa, Jorge; Morla, Ricardo; Universidad Carlos III de Madrid. Computer Architecture, Communications and Systems Group (ARCOS)
      The ever-increasing power of supercomputer systems is both driving and enabling the emergence of new problem-solving methods that require the efficient execution of many concurrent and interacting tasks. Swift/T, as a description language and runtime, offers the dynamic creation and execution of workflows, varying in granularity, on high-component-count platforms. Swift/T takes advantage of the Asynchronous Dynamic Load Balancing (ADLB) library to dynamically distribute the tasks among the nodes. These tasks may share data using a parallel file system, an approach that could degrade performance as a result of interference with other applications and poor exploitation of data locality. The objective of this work is to expose and exploit data locality in Swift/T through Hercules, a distributed in-memory store based on Memcached, and to explore tradeoffs between data locality and load balance in distributed workflow executions. In this paper we present our approach to enable locality-based optimizations in Swift/T by guiding ADLB to schedule computation jobs in the nodes containing the required data. We also analyze the interaction between locality and load balance: our initial measurements based on various raw file access patterns show promising results. Moreover, we present future work based on the promising results achieved so far.