This work proposes a novel computing performance unit grounded in information theory. Modern computing systems are increasingly diverse, supporting low-precision formats, hardware specialization, and emerging paradigms such as analog, quantum, and reversible logic. Traditional metrics like floating-point operations (flops) no longer accurately capture this complexity. We frame computing as the transformation of information through a channel and define performance in terms of the mutual information between a system's inputs and outputs. This approach measures not just the quantity of data processed, but the amount of meaningful information encoded, manipulated, and retained through computation. Our framework provides a principled, implementation-agnostic foundation for evaluating performance.
As the size of DLRMs gets larger, the models must be partitioned across multiple GPUs or nodes of GPUs due to the size limitation of total HBM memory that can be packaged in a GPU. This partitioning adds communication and synchronization overhead of sending and receiving data across GPUs. We use the NCCL and NVSHMEM libraries to measure the performance of an Embedding Bag kernel implemented on H100 GPUs. We compare its performance across diOerent batch sizes, number of tables, table sizes, pooling factors, and embedding dimensions. For a large embedding table that spans multiple GPUs, we project the performance slowdown from distributing an embedding table across multiple GPUs.
Thomas K. Sears, Wenxia Wang, Michael Drumm
et al.
Background/Objectives: The gene F3, encoding Tissue Factor (TF), is expressed in many cancers and contributes to their malignancy. Among adult-type diffuse gliomas, IDH1/2 wild-type (IDHwt) glioblastomas (GBM) express more TF than IDH1/2 mutant (IDHmut) gliomas. Tisotumab vedotin (TisVed), an anti-TF antibody conjugated to monomethyl auristatin E, is a therapeutic designed to target cells expressing TF. We therefore sought to determine the therapeutic potential of TisVed in IDHwt vs. IDHmut gliomas. Methods: We treated IDHwt and IDHmut patient-derived glioma cells with control IgG, unconjugated tisotumab (Tis), or TisVed in vitro, followed by cell viability assays and the assessment of TF signaling. We tested Tis and TisVed in mice intracranially engrafted with patient-derived IDHwt and IDHmut gliomas and mice flank engrafted with IDHwt GBM. Results: TisVed was more active against cultured IDHwt GBM cells than IDHmut glioma cells. This activity was increased by the daily washout of soluble TF secreted by IDHwt GBM cells. Unconjugated Tis had less effect than TisVed, and TF signaling was minimally inhibited. TisVed extended the survival of mice intracranially engrafted with IDHwt GBM (p = 0.006), but not mice with IDHmut glioma (p = 0.88). TisVed also reduced the growth of IDHwt GBM flank xenografts. Tis alone had no antitumor effect in either setting. Notably, both TisVed and Tis were associated with hemorrhage in flank tumors. Conclusions: TisVed targets high-TF-expressing IDHwt GBM, but not low-TF-expressing IDHmut glioma. This is predominately through the vedotin conjugate rather than inhibition of TF signaling. Though the effect size is modest, TisVed shows anticancer effects against IDHwt GBM. However, there could be complications related to hemostasis and hemorrhage.
We identify two major issues in the SoftDist paper (Xia et al.): (1) the failure to run all steps of different baselines on the same hardware environment, and (2) the use of inconsistent time measurements when comparing to other baselines. These issues lead to flawed conclusions. When all steps are executed in the same hardware environment, the primary claim made in SoftDist is no longer supported.
Optimal transport (OT) and unbalanced optimal transport (UOT) are central in many machine learning, statistics and engineering applications. 1D OT is easily solved, with complexity O(n log n), but no efficient algorithm was known for 1D UOT. We present a new approach that leverages the successive shortest path algorithm for the corresponding network flow problem. By employing a suitable representation, we bundle together multiple steps that do not change the cost of the shortest path. We prove that our algorithm solves 1D UOT in O(n log n), closing the gap.
Efficient use of energy is essential for today's supercomputing systems, as energy cost is generally a major component of their operational cost. Research into "green computing" is needed to reduce the environmental impact of running these systems. As such, several scientific communities are evaluating the trade-off between time-to-solution and energy-to-solution. While the runtime of an application is typically easy to measure, power consumption is not. Therefore, we present the Power Measurement Toolkit (PMT), a high-level software library capable of collecting power consumption measurements on various hardware. The library provides a standard interface to easily measure the energy use of devices such as CPUs and GPUs in critical application sections.
The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current world's fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.
We consider a distributed storage system which stores several hot (popular) and cold (less popular) data files across multiple nodes or servers. Hot files are stored using repetition codes while cold files are stored using erasure codes. The nodes are prone to failure and hence at any given time, we assume that only a fraction of the nodes are available. Using a cavity process based mean field framework, we analyze the download time for users accessing hot or cold data in the presence of failed nodes. Our work also illustrates the impact of the choice of the storage code on the download time performance of users in the system.
In this paper we extend the concept of the traditional transactor, which focuses on correct content transfer, to a new timing-coherent transactor that also accurately aligns the timing of each transaction boundary so that designers can perform precise concurrent system behavior analysis in mixed-abstraction-level system simulations which are essential to increasingly complex system designs. To streamline the process, we also developed an automatic approach for timing-coherent transactor generation. Our approach is actually applied in mixed-level simulations and the results show that it achieves 100% timing accuracy while the conventional approach produces results of 25% to 44% error rate.
Liang Yang, Yin Yang, Daniel Benevides da Costa
et al.
In this work, the performance of reconfigurable intelligent surface (RIS)-aided communication systems corrupted by the co-channel interference (CCI) at the destination is investigated. Assuming Rayleigh fading and equal-power CCI, we present the analysis for the outage probability (OP), average bit error rate (BER), and ergodic capacity. In addition, an asymptotic outage analysis is carried in order to obtain further insights. Our analysis shows that the number of reflecting elements as well as the number of interferers have a great impact on the overall system performance.
Hermes Senger, Jaime F. de Souza, Edson S. Gomi
et al.
We evaluate the performance of Devito, a domain specific language (DSL) for finite differences on Arm ThunderX2 processors. Experiments with two common seismic computational kernels demonstrate that Arm processors can deliver competitive performance compared to other Intel Xeon processors.
Geraint I. Palmer, Vincent A. Knight, Paul R. Harper
et al.
This paper introduces Ciw, an open source library for conducting discrete event simulations that has been developed in Python. The strengths of the library are illustrated in terms of best practice and reproducibility for computational research. An analysis of Ciw's performance and comparison to several alternative discrete event simulation frameworks is presented.
Asymptotic properties of Markov Processes, such as steady state probabilities or hazard rate for absorbing states can be efficiently calculated by means of linear algebra even for large-scale problems. This paper discusses the methods for adjusting parameters of the Markov models to account for non-constant transition rates. In particular, transitions with fixed delays are considered along with the transitions that follow Weibull and lognormal distributions. Procedures for both steady-state solutions in the absence of an absorbing state, and for hazard rates to an absorbing state are provided and demonstrated on several examples.
Markus Wittmann, Thomas Zeiser, Georg Hager
et al.
Simple floating point operations like addition or multiplication on normalized floating point values can be computed by current AMD and Intel processors in three to five cycles. This is different for denormalized numbers, which appear when an underflow occurs and the value can no longer be represented as a normalized floating-point value. Here the costs are about two magnitudes higher.
Scientific software applications are increasingly developed by large interdiscplinary teams operating on functional modules organized around a common software framework, which is capable of integrating new functional capabilities without modifying the core of the framework. In such environment, software correctness and modularity take precedence at the expense of code performance, which is an important concern during execution on supercomputing facilities, where the allocation of core-hours is a valuable resource. To alleviate the performance problems, we propose automated performance tuning (autotuning) of software to extract the maximum performance on a given hardware platform and to enable performance portability across heterogeneous hardware platforms. The resulting code remains generic without committing to a particular software stack and yet is compile-time specializable for maximal sustained performance.
This technical report covers a set of experiments on the 64-core SPARC T3-4 system, comparing it to two similar AMD and Intel systems. Key characteristics as maximum integer and floating point arithmetic throughput are measured as well as memory throughput, showing the scalability of the SPARC T3-4 system. The performance of POSIX threads primitives is characterized and compared in detail, such as thread creation and mutex synchronization. Scalability tests with a fine grained multithreaded runtime are performed, showing problems with atomic CAS operations on such physically highly parallel systems.
We consider a wireless network where each flow (instead of each link) runs its own CSMA (Carrier Sense Multiple Access) algorithm. Specifically, each flow attempts to access the radio channel after some random time and transmits a packet if the channel is sensed idle. We prove that, unlike the standard CSMA algorithm, this simple distributed access scheme is optimal in the sense that the network is stable for all traffic intensities in the capacity region of the network.
New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.