Distributed computing frameworks such as MapReduce have become essential for large-scale data processing by decomposing tasks across multiple nodes. The multi-access distributed computing (MADC) model further advances this paradigm by decoupling mapper and reducer roles: dedicated mapper nodes store data and compute intermediate values, while reducer nodes are connected to multiple mappers and aggregate results to compute final outputs. This separation reduces communication bottlenecks without requiring file replication. In this paper, we introduce privacy constraints into MADC and develop private coded schemes for two specific connectivity models. We construct new families of extended placement delivery arrays and derive corresponding coding schemes that guarantee privacy of each reducer's assigned function.
Portable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality.
NBI-Slurm is a Perl package that provides a simplified, user-friendly interface for submitting and managing jobs on SLURM high-performance computing (HPC) clusters. It offers both a library of Perl modules for programmatic job management and a suite of command-line tools designed to reduce the cognitive overhead of SLURM's native interface. Distinctive features of NBI-Slurm are (a) TUI applications to view and cancel jobs, (b) the possibility to generate tool-specific wrappers for (bioinformatic) tools and (c) an energy-aware scheduling mode -- "eco mode" -- that automatically defers flexible jobs to off-peak periods, helping research institutions reduce their computational carbon footprint without requiring users to manually plan submission times.
We consider large-scale systems influenced by burnout variables - state variables that start active, shape dynamics, and irreversibly deactivate once certain conditions are met. Simulating what-if scenarios in such systems is computationally demanding, as alternative trajectories often require sequential processing, which does not scale very well. This challenge arises in settings like online advertising, because of campaigns budgets, complicating counterfactual analysis despite rich data availability. We introduce a new type of algorithms based on what we refer to as uncertainty relaxation, that enables efficient parallel computation, significantly improving scalability for counterfactual estimation in systems with burnout variables.
Matrix-vector multiplication forms the basis of many iterative solution algorithms and as such is an important algorithm also for hierarchical matrices which are used to represent dense data in an optimized form by applying low-rank compression. However, due to its low computational intensity, the performance of matrix-vector multiplication is typically limited by the available memory bandwidth on parallel systems. With floating point compression the memory footprint can be optimized, which reduces the stress on the memory sub system and thereby increases performance. We will look into the compression of different formats of hierachical matrices and how this can be used to speed up the corresponding matrix-vector multiplication.
This paper describes a simple and efficient asynchronous Binary Byzantine faulty tolerant consensus algorithm. In the algorithm, non-faulty nodes perform an initial broadcast followed by a executing a series of rounds each consisting of a single message broadcast plus the computation of a global random coin using threshold signatures. Each message is accompanied by a cryptographic proof of its validity. Up to one third of the nodes can be faulty and termination is expected in a constant number of rounds. An optimization is described allowing the round message plus the coin message to be combined, reducing rounds to a single message delay. Geodistributed experiments are run on replicates in ten data center regions showing average latencies as low as 400 milliseconds.
Merim Dzaferagic, Neal McBride, Ryan Thomas
et al.
We present a novel way of considering in-network computing (INC), using ideas from statistical physics. We define degeneracy for INC as the multiplicity of possible options available within the network to perform the same function with a given macroscopic property (e.g. delay). We present an efficient algorithm to determine all these alternatives. Our results show that by exploiting the set of possible degenerate alternatives, we can significantly improve the successful computation rate of a symmetric function, while still being able to satisfy requirements such as delay or energy consumption.
Optimization of applications for supercomputers of the highest performance class requires parallelization at multiple levels using different techniques. In this contribution we focus on parallelization of particle physics simulations through vector instructions. With the advent of the Scalable Vector Extension (SVE) ISA, future ARM-based processors are expected to provide a significant level of parallelism at this level.
In this paper, we discuss voltage control method for buck converter operating in continuous conduction mode (CCM) using analog feedback system. The aim of this work is to control the output voltage of a buck converter during the variation in load current. This is obtained using analog feedback made with operational amplifier (Opamp). However, the same technique can be applied to other DC-DC converters (e.g boost, buck-boost, cuk converter, etc) in CCM mode, but for the purpose of analysis buck converter is chosen as an example.
The timing system focuses on synchronizing and coordinating each subsystem according to the trigger signals. A new prototype timing slave node based on precision time protocol has been developed by using ARM STM32 platform. The proposed slave timing module is tested and results show that the synchronization accuracy between slave nodes is in sub-microsecond range.
Swapneel Mehta, Prasanth Kothuri, Daniel Lanza Garcia
We propose an architecture for analysing database connection logs across different instances of databases within an intranet comprising over 10,000 users and associated devices. Our system uses Flume agents to send notifications to a Hadoop Distributed File System for long-term storage and ElasticSearch and Kibana for short-term visualisation, effectively creating a data lake for the extraction of log data. We adopt machine learning models with an ensemble of approaches to filter and process the indicators within the data and aim to predict anomalies or outliers using feature vectors built from this log data.
Selective bulk analyses, such as statistical learning on temporal/spatial data, are fundamental to a wide range of contemporary data analysis. However, with the increasingly larger data-sets, such as weather data and marketing transactions, the data organization/access becomes more challenging in selective bulk data processing with the use of current big data processing frameworks such as Spark or keyvalue stores. In this paper, we propose a method to optimize selective bulk analysis in big data processing and referred to as Oseba. Oseba maintains a super index for the data organization in memory to support fast lookup through targeting the data involved with each selective analysis program. Oseba is able to save memory as well as computation in comparison to the default data processing frameworks.
The matricized-tensor times Khatri-Rao product computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.
Julian Andres Cuellar, Camila Sanchez Sandoval, Sergio Alfonso Huertas
El artículo presentado a continuación tiene por objetivo generar una reflexión alrededor de los postulados epistémicos de los trabajos adelantados por (Cuellar Argote, 2007) y (Leyva & Ramirez, 2015), con el ánimo de apostar por una actualización de las premisas y avances que estos autores otorgaron a la discusión sobre la Ciencia Política en Colombia y su proceso de consolidación. Así, el artículo hace uso de la metodología de análisis documental y su desarrollo evidencia algunos intereses compartidos en los autores alrededor de la enseñanza/formación de la Ciencia Política en Colombia.
Johanne Cohen, Jonas Lefèvre, Khaled Maâmra
et al.
We propose a self-stabilizing algorithm for computing a maximal matching in an anonymous network. The complexity is $O(n^3)$ moves with high probability, under the adversarial distributed daemon. In this algorithm, each node can determine whether one of its neighbors points to it or to another node, leading to a contradiction with the anonymous assumption. To solve this problem, we provide under the classical link-register model, a self-stabilizing algorithm that gives a unique name to a link such that this name is shared by both extremities of the link.
MPSoCs are gaining popularity because of its potential to solve computationally expensive applications. A multi-core processor combines two or more independent cores (normally a CPU) into a single package composed of a single integrated circuit (Chip). However, as the number of components on a single chip and their performance continue to increase, a shift from computation-based to communication-based design becomes mandatory. As a result, the communication architecture plays a major role in the area, performance, and energy consumption of the overall system. In this paper, multiple soft-cores (IPs) such as Micro Blaze in an FPGA is used to study the effect of different connection topologies on the performance of a parallel program.
In multi-cloud environment, task scheduling has attracted a lot of attention due to NP-Complete nature of the problem. Moreover, it is very challenging due to heterogeneity of the cloud resources with varying capacities and functionalities. Therefore, minimizing the makespan for task scheduling is a challenging issue. In this paper, we propose a genetic algorithm (GA) based approach for solving task scheduling problem. The algorithm is described with innovative idea of fitness function derivation and mutation. The proposed algorithm is exposed to rigorous testing using various benchmark datasets and its performance is evaluated in terms of total makespan.
Cloud Computing is flourishing day by day and it will continue in developing phase until computers and internet era is in existence. While dealing with cloud computing, a number of security and traffic related issues are confronted. Load Balancing is one of the answers to these issues. RBAC deals with such an answer. The proposed technique involves the hybrid of FCFS with RBAC technique. RBAC will assign roles to the clients and clients with a particular role can only access the particular document. Hence identity management and access management are fully implemented using this technique.