Hasil untuk "cs.PF"

Menampilkan 20 dari ~90546 hasil Β· dari DOAJ, arXiv, CrossRef

JSON API
arXiv Open Access 2025
Back to Bits: Extending Shannon's communication performance framework to computing

Max Hawkins, Richard Vuduc

This work proposes a novel computing performance unit grounded in information theory. Modern computing systems are increasingly diverse, supporting low-precision formats, hardware specialization, and emerging paradigms such as analog, quantum, and reversible logic. Traditional metrics like floating-point operations (flops) no longer accurately capture this complexity. We frame computing as the transformation of information through a channel and define performance in terms of the mutual information between a system's inputs and outputs. This approach measures not just the quantity of data processed, but the amount of meaningful information encoded, manipulated, and retained through computation. Our framework provides a principled, implementation-agnostic foundation for evaluating performance.

en cs.PF
arXiv Open Access 2025
Dissecting Embedding Bag Performance in DLRM Inference

Chandrish Ambati, Jing Ding, Trung Diep

As the size of DLRMs gets larger, the models must be partitioned across multiple GPUs or nodes of GPUs due to the size limitation of total HBM memory that can be packaged in a GPU. This partitioning adds communication and synchronization overhead of sending and receiving data across GPUs. We use the NCCL and NVSHMEM libraries to measure the performance of an Embedding Bag kernel implemented on H100 GPUs. We compare its performance across diOerent batch sizes, number of tables, table sizes, pooling factors, and embedding dimensions. For a large embedding table that spans multiple GPUs, we project the performance slowdown from distributing an embedding table across multiple GPUs.

en cs.PF
CrossRef Open Access 2025
F3 Expression Drives Sensitivity to the Antibody-Drug Conjugate Tisotumab Vedotin in Glioblastoma

Thomas K. Sears, Wenxia Wang, Michael Drumm et al.

Background/Objectives: The gene F3, encoding Tissue Factor (TF), is expressed in many cancers and contributes to their malignancy. Among adult-type diffuse gliomas, IDH1/2 wild-type (IDHwt) glioblastomas (GBM) express more TF than IDH1/2 mutant (IDHmut) gliomas. Tisotumab vedotin (TisVed), an anti-TF antibody conjugated to monomethyl auristatin E, is a therapeutic designed to target cells expressing TF. We therefore sought to determine the therapeutic potential of TisVed in IDHwt vs. IDHmut gliomas. Methods: We treated IDHwt and IDHmut patient-derived glioma cells with control IgG, unconjugated tisotumab (Tis), or TisVed in vitro, followed by cell viability assays and the assessment of TF signaling. We tested Tis and TisVed in mice intracranially engrafted with patient-derived IDHwt and IDHmut gliomas and mice flank engrafted with IDHwt GBM. Results: TisVed was more active against cultured IDHwt GBM cells than IDHmut glioma cells. This activity was increased by the daily washout of soluble TF secreted by IDHwt GBM cells. Unconjugated Tis had less effect than TisVed, and TF signaling was minimally inhibited. TisVed extended the survival of mice intracranially engrafted with IDHwt GBM (p = 0.006), but not mice with IDHmut glioma (p = 0.88). TisVed also reduced the growth of IDHwt GBM flank xenografts. Tis alone had no antitumor effect in either setting. Notably, both TisVed and Tis were associated with hemorrhage in flank tumors. Conclusions: TisVed targets high-TF-expressing IDHwt GBM, but not low-TF-expressing IDHmut glioma. This is predominately through the vedotin conjugate rather than inhibition of TF signaling. Though the effect size is modest, TisVed shows anticancer effects against IDHwt GBM. However, there could be complications related to hemostasis and hemorrhage.

arXiv Open Access 2024
Comment on paper: Position: Rethinking Post-Hoc Search-Based Neural Approaches for Solving Large-Scale Traveling Salesman Problems

Yimeng Min

We identify two major issues in the SoftDist paper (Xia et al.): (1) the failure to run all steps of different baselines on the same hardware environment, and (2) the use of inconsistent time measurements when comparing to other baselines. These issues lead to flawed conclusions. When all steps are executed in the same hardware environment, the primary claim made in SoftDist is no longer supported.

en cs.PF, cs.AI
arXiv Open Access 2023
An Efficient Algorithm for Unbalanced 1D Transportation

Gabriel Gouvine

Optimal transport (OT) and unbalanced optimal transport (UOT) are central in many machine learning, statistics and engineering applications. 1D OT is easily solved, with complexity O(n log n), but no efficient algorithm was known for 1D UOT. We present a new approach that leverages the successive shortest path algorithm for the corresponding network flow problem. By employing a suitable representation, we bundle together multiple steps that do not change the cost of the shortest path. We prove that our algorithm solves 1D UOT in O(n log n), closing the gap.

en cs.PF, cs.CC
arXiv Open Access 2022
PMT: Power Measurement Toolkit

Stefano Corda, Bram Veenboer, Emma Tolley

Efficient use of energy is essential for today's supercomputing systems, as energy cost is generally a major component of their operational cost. Research into "green computing" is needed to reduce the environmental impact of running these systems. As such, several scientific communities are evaluating the trade-off between time-to-solution and energy-to-solution. While the runtime of an application is typically easy to measure, power consumption is not. Therefore, we present the Power Measurement Toolkit (PMT), a high-level software library capable of collecting power consumption measurements on various hardware. The library provides a standard interface to easily measure the energy use of devices such as CPUs and GPUs in critical application sections.

en cs.PF
arXiv Open Access 2021
Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms

Benjamin Michalowicz, Eric Raut, Yan Kang et al.

The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current world's fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.

arXiv Open Access 2021
Download time analysis for distributed storage systems with node failures

Tim Hellemans, Arti Yardi, Tejas Bodas

We consider a distributed storage system which stores several hot (popular) and cold (less popular) data files across multiple nodes or servers. Hot files are stored using repetition codes while cold files are stored using erasure codes. The nodes are prone to failure and hence at any given time, we assume that only a fraction of the nodes are available. Using a cavity process based mean field framework, we analyze the download time for users accessing hot or cold data in the presence of failed nodes. Our work also illustrates the impact of the choice of the storage code on the download time performance of users in the system.

en cs.PF
arXiv Open Access 2021
Automatic Timing-Coherent Transactor Generation for Mixed-level Simulations

Li-Chun Chen, Hsin-I Wu, Ren-Song Tsay

In this paper we extend the concept of the traditional transactor, which focuses on correct content transfer, to a new timing-coherent transactor that also accurately aligns the timing of each transaction boundary so that designers can perform precise concurrent system behavior analysis in mixed-abstraction-level system simulations which are essential to increasingly complex system designs. To streamline the process, we also developed an automatic approach for timing-coherent transactor generation. Our approach is actually applied in mixed-level simulations and the results show that it achieves 100% timing accuracy while the conventional approach produces results of 25% to 44% error rate.

en cs.PF
arXiv Open Access 2020
Performance Analysis of an Interference-Limited RIS-Aided Network

Liang Yang, Yin Yang, Daniel Benevides da Costa et al.

In this work, the performance of reconfigurable intelligent surface (RIS)-aided communication systems corrupted by the co-channel interference (CCI) at the destination is investigated. Assuming Rayleigh fading and equal-power CCI, we present the analysis for the outage probability (OP), average bit error rate (BER), and ergodic capacity. In addition, an asymptotic outage analysis is carried in order to obtain further insights. Our analysis shows that the number of reflecting elements as well as the number of interferers have a great impact on the overall system performance.

en cs.PF, cs.NI
arXiv Open Access 2019
Performance of Devito on HPC-Optimised ARM Processors

Hermes Senger, Jaime F. de Souza, Edson S. Gomi et al.

We evaluate the performance of Devito, a domain specific language (DSL) for finite differences on Arm ThunderX2 processors. Experiments with two common seismic computational kernels demonstrate that Arm processors can deliver competitive performance compared to other Intel Xeon processors.

en cs.PF
arXiv Open Access 2017
Ciw: An open source discrete event simulation library

Geraint I. Palmer, Vincent A. Knight, Paul R. Harper et al.

This paper introduces Ciw, an open source library for conducting discrete event simulations that has been developed in Python. The strengths of the library are illustrated in terms of best practice and reproducibility for computational research. An analysis of Ciw's performance and comparison to several alternative discrete event simulation frameworks is presented.

en cs.PF
arXiv Open Access 2017
Correcting for Non-Markovian Asymptotic Effects using Markovian Representation

Vitali Volovoi

Asymptotic properties of Markov Processes, such as steady state probabilities or hazard rate for absorbing states can be efficiently calculated by means of linear algebra even for large-scale problems. This paper discusses the methods for adjusting parameters of the Markov models to account for non-constant transition rates. In particular, transitions with fixed delays are considered along with the transitions that follow Weibull and lognormal distributions. Procedures for both steady-state solutions in the absence of an absorbing state, and for hazard rates to an absorbing state are provided and demonstrated on several examples.

en cs.PF
arXiv Open Access 2015
Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero

Markus Wittmann, Thomas Zeiser, Georg Hager et al.

Simple floating point operations like addition or multiplication on normalized floating point values can be computed by current AMD and Intel processors in three to five cycles. This is different for denormalized numbers, which appear when an underflow occurs and the value can no longer be represented as a normalized floating-point value. Here the costs are about two magnitudes higher.

en cs.PF
arXiv Open Access 2013
Software Autotuning for Sustainable Performance Portability

Azamat Mametjanov, Boyana Norris

Scientific software applications are increasingly developed by large interdiscplinary teams operating on functional modules organized around a common software framework, which is capable of integrating new functional capabilities without modifying the core of the framework. In such environment, software correctness and modularity take precedence at the expense of code performance, which is an important concern during execution on supercomputing facilities, where the allocation of core-hours is a valuable resource. To alleviate the performance problems, we propose automated performance tuning (autotuning) of software to extract the maximum performance on a given hardware platform and to enable performance portability across heterogeneous hardware platforms. The resulting code remains generic without committing to a particular software stack and yet is compile-time specializable for maximal sustained performance.

en cs.PF
arXiv Open Access 2011
A Characterization of the SPARC T3-4 System

Michiel W. van Tol

This technical report covers a set of experiments on the 64-core SPARC T3-4 system, comparing it to two similar AMD and Intel systems. Key characteristics as maximum integer and floating point arithmetic throughput are measured as well as memory throughput, showing the scalability of the SPARC T3-4 system. The performance of POSIX threads primitives is characterized and compared in detail, such as thread creation and mutex synchronization. Scalability tests with a fine grained multithreaded runtime are performed, showing problems with atomic CAS operations on such physically highly parallel systems.

en cs.PF, cs.DC
arXiv Open Access 2010
On the stability of flow-aware CSMA

T. Bonald, M. Feuillet

We consider a wireless network where each flow (instead of each link) runs its own CSMA (Carrier Sense Multiple Access) algorithm. Specifically, each flow attempts to access the radio channel after some random time and transmits a packet if the channel is sensed idle. We prove that, unlike the standard CSMA algorithm, this simple distributed access scheme is optimal in the sense that the network is stable for all traffic intensities in the capacity region of the network.

en cs.PF, cs.NI
arXiv Open Access 2009
Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

Markus Wittmann, Georg Hager, Gerhard Wellein

New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.

en cs.PF, cs.DC

Halaman 5 dari 4528