A data commons brings together (or co-locates) data with cloud computing infrastructure and commonly used software services, tools and applications for managing, analyzing and sharing data to create an interoperable resource for a research community. We introduce an architectural design principle for data commons called the narrow middle architecture that is broadly based upon the end-to-end argument in systems design. We also discuss important core services for data commons and the role of standards.
We present SmartShards: a new sharding algorithm for improving Byzantine tolerance and churn resistance in blockchains. Our algorithm places a peer in multiple shards to create an overlap. This simplifies cross-shard communication and shard membership management. We describe SmartShards, prove it correct and evaluate its performance. We propose several SmartShards extensions: defense against a slowly adaptive adversary, combining transactions into blocks, fortification against the join/leave attack.
This paper describes a new algorithm called PAT, for Parallel Aggregated Trees, and which can be used to implement all-gather and reduce-scatter operations. This algorithm works on any number of ranks, has a logarithmic number of network transfers for small size operations, minimizes long-distance communication, and requires a logarithmic amount of internal buffers, independently from the total operation size. It is aimed at improving the performance of the NCCL library in cases where the ring algorithm would be inefficient, as its linear latency would show poor performance for small sizes and/or at scale.
The author's research of topologies of parallel computing systems and the tasks solved with them, including the corresponding tools of their modeling, is summarized in the present paper. The original topological model of such systems is presented based on the modified Amdahl law. It allowed formalizing the dependence of the necessary number of processors and the maximal distance between information-adjacent vertices in a graph on the directive values of acceleration or efficiency. The dependences of these values on the system interconnection topology and on the information graph of the parallel task are also formalized. The tools for a comparative evaluation of these dependences, topological criteria and the functions of scaling and fault-tolerant operation of parallel systems are based on the author|s technique of projective description of graphs and the algorithms used in it.
A few grid-computing tools are available for public use. However, such systems are usually quite complex and require several man-months to set up. In case the user wishes to set-up an ad-hoc grid in a small span of time, such tools cannot be used. Moreover, the complex services they provide, like, reliable file transfer, extra layers of security etc., act as an overhead to performance in case the network is small and reliable. In this paper we describe the structure of our grid-computing framework, which can be implemented and used, easily on a moderate sized network.
In recent years, utilization of heterogeneous hardware other than small core CPU such as GPU, FPGA or many core CPU is increasing. However, when using heterogeneous hardware, barriers of technical skills such as CUDA are high. Based on that, I have proposed environment-adaptive software that enables automatic conversion, configuration, and high performance operation of once written code, according to the hardware to be placed. However, the source language for offloading was mainly C/C++ language applications currently, and there was no research for common offloading for various language applications. In this paper, I study a common method for automatically offloading for various language applications not only in C language but also in Python and Java.
MPI applications begin with a fixed number of rank and, by default, the rank remains constant throughout the application's lifetime. The developer can choose to increase the rank by dynamically spawning MPI processes. However doing this manually adds complexity to the MPI application. Making the MPI applications malleable \cite{b20} would allow HPC applications to have the same elasticity as that of cloud applications. We propose multiple approaches to change the rank of an MPI program agnostic to the modification of the user code. We use checkpointing as a tool to achieve mutability of rank by halting the execution and resuming the MPI program with a new state. In this paper, we focus on the scenario of increasing the rank of an MPI program using ExaMPI as the implementation for MPI.
Flat combining is a concurrency threaded technique whereby one thread performs all the operations in batch by scanning a queue of operations to-be-done and performing them together. Flat combining makes sense as long as k operations each taking O(n) separately can be batched together and done in less than O(k*n). Red black tree is a balanced binary search tree with permanent balancing warranties. Operations in red black tree are hard to batch together: for example inserting nodes in two different branches of the tree affect different areas of the tree. In this paper we investigate alternatives to making a flat combine approach work for red black trees.
Non-volatile memory is expected to co-exist or replace DRAM in upcoming architectures. Durable concurrent data structures for non-volatile memories are essential building blocks for constructing adequate software for use with these architectures. In this paper, we propose a new approach for durable concurrent sets and use this approach to build the most efficient durable hash tables available today. Evaluation shows a performance improvement factor of up to 3.3x over existing technology.
Building a library of concurrent data structures is an essential way to simplify the difficult task of developing concurrent software. Lock-free data structures, in which processes can help one another to complete operations, offer the following progress guarantee: If processes take infinitely many steps, then infinitely many operations are performed. Handcrafted lock-free data structures can be very efficient, but are notoriously difficult to implement. We introduce numerous tools that support the development of efficient lock-free data structures, and especially trees.
The use of virtualization technologies in different contexts - such as Cloud Environments, Internet of Things (IoT), Software Defined Networking (SDN) - has rapidly increased during the last years. Among these technologies, container-based solutions own characteristics for deploying distributed and lightweight applications. This paper presents a performance evaluation of container technologies on constrained devices, in this case, on Raspberry Pi. The study shows that, overall, the overhead added by containers is negligible.
The design of a parallel computing system using several thousands or even up to a million processors asks for processing units that are simple and thus small in space, to make as many processing units as possible fit on a single die. The design presented herewith is far from being optimised, it is not meant to compete with industry performance devices. Its main purpose is to allow for a prototypical implementation of a dynamic software system as a proof of concept.
I introduce a new distributed system for effective training and regularizing of Large-Scale Neural Networks on distributed computing architectures. The experiments demonstrate the effectiveness of flexible model partitioning and parallelization strategies based on neuron-centric computation model, with an implementation of the collective and parallel dropout neural networks training. Experiments are performed on MNIST handwritten digits classification including results.
Solving the software dependency issue under the HPC environment has always been a difficult task for both computing system administrators and application scientists. This work would like to tackle the issue by introducing the modern container technology, the Docker, to be specific. By integrating the auto-scaling feature of service discovery with the light-weight virtualization tool, the Docker, the construction of a virtual cluster on top of physical cluster hardware is attempted. Thus, through the isolation of computing environment, a remedy of software dependency of HPC environment is possible.
Matlab is one of the most widely used mathematical computing environments in technical computing. It has an interactive environment which provides high performance computing (HPC) procedures and easy to use. Parallel computing with Matlab has been an interested area for scientists of parallel computing researches for a number of years. Where there are many attempts to parallel Matlab. In this paper, we present most of the past,present attempts of parallel Matlab such as MatlabMPI, bcMPI, pMatlab, Star-P and PCT. Finally, we expect the future attempts.