Hasil untuk "cs.PF"

Menampilkan 20 dari ~90589 hasil · dari DOAJ, CrossRef, arXiv

JSON API
arXiv Open Access 2026
Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Yifan Zhu, Yekai Pan, Chen Ding

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.

en cs.PF, cs.AI
arXiv Open Access 2026
From 8 Seconds to 370ms: Kernel-Fused SAR Imaging on Apple Silicon via Single-Dispatch FFT Pipelines

Mohamed Amine Bergach

We present the first kernel-fused SAR Range Doppler pipeline on any GPU platform. By fusing FFT, matched-filter multiply, and IFFT into a single Metal compute dispatch -- keeping all intermediate data in 32\,KiB on-chip memory -- we process a $4096\!\times\!4096$ complex SAR scene in \textbf{370\,ms} on an Apple M1 GPU, a \textbf{22$\times$} speedup over the multi-dispatch baseline (8.16\,s). We further report the first FFT to exploit Apple's \texttt{simdgroup\_matrix} 8$\times$8 hardware MMA, enabled by an in-place Cooley--Tukey decimation-in-frequency formulation that halves the memory footprint versus Stockham. Radar image quality is preserved: all five point targets show 0.0\,dB SNR deviation from the unfused FP32 reference.

en cs.PF
CrossRef Open Access 2025
Thin Epitaxial Ionic Fluoride Films for Electronics Applications

Giulia Giovanelli, Mauro Borghi, Alessandro Lodi et al.

The realization of novel electronic devices based on 2D materials, i.e., field-effect transistors, has recently stimulated a renewed interest regarding ultrathin fluoride epitaxial films. Thanks to their chemical and dielectric properties, ionic fluorides could have the potential to be used as insulators in many applications that require high processing control down to the nanoscale. Here we provide a review of some of the principal results that have been achieved in the past decades regarding the controlled growth of epitaxial fluorides on different types of materials relevant for electronics. The aim is to provide a concise summary of the growth modes, crystallinity, film morphologies, and chemical interactions of different types of fluorides on different type of substrates, highlighting the possibilities of applications and the future perspectives.

arXiv Open Access 2025
Dissecting RISC-V Performance: Practical PMU Profiling and Hardware-Agnostic Roofline Analysis on Emerging Platforms

Alexander Batashev

As RISC-V architectures proliferate across embedded and high-performance domains, developers face persistent challenges in performance optimization due to fragmented tooling, immature hardware features, and platform-specific defects. This paper delivers a pragmatic methodology for extracting actionable performance insights on RISC-V systems, even under constrained or unreliable hardware conditions. We present a workaround to circumvent hardware bugs in one of the popular RISC-V implementations, enabling robust event sampling. For memory-compute bottleneck analysis, we introduce compiler-driven Roofline tooling that operates without hardware PMU dependencies, leveraging LLVM-based instrumentation to derive operational intensity and throughput metrics directly from application IR. Our open source toolchain automates these workarounds, unifying PMU data correction and compiler-guided Roofline construction into a single workflow.

en cs.PF
arXiv Open Access 2025
XTC, A Research Platform for Optimizing AI Workload Operators

Pompougnac Hugo, Guillon Christophe, Noiry Sylvain et al.

Achieving high efficiency on AI operators demands precise control over computation and data movement. However, existing scheduling languages are locked into specific compiler ecosystems, preventing fair comparison, reuse, and evaluation across frameworks. No unified interface currently decouples scheduling specification from code generation and measurement. We introduce XTC, a platform that unifies scheduling and performance evaluation across compilers. With its common API and reproducible measurement framework, XTC enables portable experimentation and accelerates research on optimization strategies.

en cs.PF, cs.AI
arXiv Open Access 2024
Communication efficient application of sequences of planar rotations to a matrix

Thijs Steel, Julien Langou

We present an efficient algorithm for the application of sequences of planar rotations to a matrix. Applying such sequences efficiently is important in many numerical linear algebra algorithms for eigenvalues. Our algorithm is novel in three main ways. First, we introduce a new kernel that is optimized for register reuse in a novel way. Second, we introduce a blocking and packing scheme that improves the cache efficiency of the algorithm. Finally, we thoroughly analyze the memory operations of the algorithm which leads to important theoretical insights and makes it easier to select good parameters. Numerical experiments show that our algorithm outperforms the state-of-the-art and achieves a flop rate close to the theoretical peak on modern hardware.

en cs.PF, cs.DS
CrossRef Open Access 2023
Latanoprost PF vs. Bimatoprost PF: Which Treats the Ocular Surface Better?

Georgios S. Dimtsas, Anastasia Tsiogka, Marilita M. Moschos

(1) Background: The current study aimed to compare two of the most frequently prescribed preservative-free (PF) antiglaucoma drops, (Latanoprost PF vs. Bimatoprost PF) in promoting OSD in patients with POAG. (2) Methods: In this prospective study, 44 eyes from 44 participants were included. In the control group we enrolled 24 eyes, 11 eyes treated only with Latanoprost PF were enrolled in the Latanoprost PF group, and 9 eyes treated only with Bimatoprost PF in the Bimatoprost PF group. In all eyes, we evaluated the ocular levels of MMP-9 using the InflammaDry kit. We also performed Schirmer’s test and the TBUT test. (3) Results: We found elevated ocular levels of MMP-9 (>40 ng/mL) in the Bimatoprost PF group (88.89% of the participants) compared to the control (8.33%) and the Latanoprost PF group (27.27%), and the difference was statistically significant (p < 0.001). The Schirmer’s test values were statistically significantly lower in the Bimatoprost PF group compared to the other two groups. Additionally, the TBUT values were lower in the Bimatoprost PF group compared to the control group, and the difference was statistically significant. (4) Conclusions: Latanoprost PF eye drops treat the ocular surface better and they do not induce overexpression of MMP-9, a molecule that is related to OSD.

arXiv Open Access 2022
A Probabilistic Bound for Peak Age of Information Guarantee

Ailing Zhong, Zhidu Li, Tong Tang et al.

This paper considers the distribution of a general peak age of information (AoI) model and develops a general analysis approach for probabilistic performance guarantee from the time-domain perspective. Firstly, a general relationship between the peak AoI and the inter-arrival and service times of packets is revealed. With the help of martingale theory, a probabilistic bound on the peak AoI is then derived for the general case of endogenous independently and identically distributed increments in information generation and transmission processes. Thereafter, the application of the obtained bound is illustrated with the M/M/1 and D/M/1 queuing models. The validity of the proposed bound is finally examined with numerical results.

en cs.PF
arXiv Open Access 2021
An Analysis into the Performance and Memory Usage of MATLAB Strings

Travis Near

MATLAB is a mathematical computing environment used by many engineers, mathematicians, and students to process and understand their data. Important to all data science is the managing of textual data. MATLAB supports two textual data containers: (1) cell arrays of characters and (2) string arrays. This research showcases the strengths of string arrays over cell arrays by quantifying their performance, memory contiguity, syntax readability, interface fluidity, and autocomplete capabilities. These results demonstrate that string arrays often run 2x to 40x faster than cell arrays for common string benchmarks, are optimized for data locality by reducing metadata overhead, and offer a more expressive syntax due to their automatic data type conversions and vectorized methods.

en cs.PF
arXiv Open Access 2021
Age of Information for Small Buffer Systems

George Kesidis, Takis Konstantopoulos, Michael Zazanis

Consider a message processing system whose objective is to produce the most current information as measured by the quantity known as "age of information". We have argued in previous papers that if we are allowed to design the message processing policy ad libitum, we should keep a small buffer and operate according to a LIFO policy. In this small note we provide an analysis for the AoI of the P_m system which uses a buffer of size m, a single server, operating without service preemption and in a LIFO manner for stored messages. Analytical expressions for the mean (or even distribution) of the AoI in steady-state are possible but with the aid computer algebra. We explain the the analysis for m=3.

en cs.PF, math.PR
arXiv Open Access 2020
A Prompt Report on the Performance of Intel Optane DC Persistent Memory Module

Takahiro Hirofuchi, Ryousei Takano

In this prompt report, we present the basic performance evaluation of Intel Optane Data Center Persistent Memory Module (Optane DCPMM), which is the first commercially-available, byte-addressable non-volatile memory modules released in April 2019. Since at the moment of writing only a few reports on its performance were published, this letter is intended to complement other performance studies. Through experiments using our own measurement tools, we obtained that the latency of random read-only access was approximately 374 ns. That of random writeback-involving access was 391 ns. The bandwidths of read-only and writeback-involving access for interleaved memory modules were approximately 38 GB/s and 3 GB/s, respectively.

arXiv Open Access 2019
Real-Time Prediction of Delay Distribution in Service Systems using Mixture Density Networks

Majid Raeis, Ali Tizghadam, Alberto Leon-Garcia

Motivated by interest in providing more efficient services in customer service systems, we use statistical learning methods and delay history information to predict the conditional distribution of the customers' waiting times in queueing systems. From the predicted distributions, descriptive statistics of the system such as the mean, variance and percentiles of the waiting times can be obtained, which can be used for delay announcements, SLA conformance and better system management. We model the conditional distributions by mixtures of Gaussians, parameters of which can be estimated using Mixture Density Networks. The evaluations show that exploiting more delay history information can result in much more accurate predictions under realistic time-varying arrival assumptions.

en cs.PF
arXiv Open Access 2019
Computing System Congestion Management Using Exponential Smoothing Forecasting

James F Brady

An overloaded computer must finish what it starts and not start what will fail or hang. A congestion management algorithm the author developed, and Siemens Corporation patented for telecom products, effectively manages traffic overload with its unique formulation of Exponential Smoothing forecasting. Siemens filed for exclusive rights to this technique in 2003 and obtained US patent US7301903B2 in 2007 with this author, an employee at the time of the filing, the sole inventor. A computer program, written in C language, which exercises the methodology is listed at the end of this document and available on GitHub.

en cs.PF
arXiv Open Access 2019
Reliable Access to Massive Restricted Texts: Experience-based Evaluation

Zong Peng, Beth Plale

Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for general access especially under failures depends on the primary storage system. In this paper, we identify the requirements of managing for computational analysis a massive text corpus and use it as basis to evaluate candidate storage solutions. The study based on the 5.9 billion page collection of the HathiTrust digital library. Our findings led to the choice of Cassandra 3.x for the primary back end store, which is currently in deployment in the HathiTrust Research Center.

en cs.PF
CrossRef Open Access 2017
Synthesis of PF-06928215 and analogues v1

Les Dakin

Discovery of PF-06928215 as a high affinity inhibitor of cGAS enabled by a novel fluorescence polarization assay Cyclic GMP-AMP synthase (cGAS) initiates the innate immune system in response to cytosolic dsDNA. After binding and activation from dsDNA, cGAS uses ATP and GTP to synthesize 2 ʹ, 3 ʹ -cGAMP (cGAMP), a cyclic dinucleotide second messenger with mixed 2ʹ-5ʹ and 3ʹ-5ʹ phosphodiester bonds. Inappropriate stimulation of cGAS has been implicated in autoimmune disease such as systemic lupus erythematosus, thus inhibition of cGAS may be of therapeutic benefit in some diseases; however, the size and polarity of the cGAS active site makes it a challenging target for the development of conventional substrate-competitive inhibitors. We report here the development of a high affinity (KD = 200 nM) inhibitor from a low affinity fragment hit with supporting biochemical and structural data showing these molecules bind to the cGAS active site. We also report a new high throughput cGAS fluorescence polarization (FP)-based assay to enable the rapid identification and optimization of cGAS inhibitors. This FP assay uses Cy5-labelled cGAMP in combination with a novel high affinity monoclonal antibody that specifically recognizes cGAMP with no cross reactivity to cAMP, cGMP, ATP, or GTP. Given its role in the innate immune response, cGAS is a promising therapeutic target for autoinflammatory disease. Our results demonstrate its druggability, provide a high affinity tool compound, and establish a high throughput assay for the identification of next generation cGAS inhibitors.

arXiv Open Access 2017
High-Performance Code Generation though Fusion and Vectorization

Jason Sewall, Simon J. Pennycook

We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware. We introduce representations for the abstract relationships and data dependencies of kernels in loop nests and algorithms for manipulating them into more efficient form; we similarly introduce techniques for determining data access patterns for stencil-like array accesses and show how this can be used to elide storage and improve vectorization. We discuss our prototype implementation of these ideas---named HFAV---and its use of a declarative, inference-based front-end to drive transformations, and we present results for some prominent codes in HPC.

en cs.PF, cs.DC
arXiv Open Access 2016
Breakdown of a Benchmark Score Without Internal Analysis of Benchmarking Program

Naoki Matagawa, Kazuyuki Shudo

A breakdown of a benchmark score is how much each aspect of the system performance affects the score. Existing methods require internal analysis on the benchmarking program and then involve the following problems: (1) require a certain amount of labor for code analysis, profiling, simulation, and so on and (2) require the benchmarking program itself. In this paper, we present a method for breaking down a benchmark score without internal analysis of the benchmarking program. The method utilizes regression analysis of benchmark scores on a number of systems. Experimental results with 3 benchmarks on 15 Android smartphones showed that our method could break down those benchmark scores even though there is room for improvement in accuracy.

en cs.PF
arXiv Open Access 2016
A Unified, Hardware-Fitted, Cross-GPU Performance Model

James Stevens, Andreas Klöckner

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run time. We use a series of `performance-instructive' kernels to fit the parameters of a unified model to the performance characteristics of GPU hardware from multiple hardware generations and vendors. We evaluate the predictive power of the model on a broad array of computational kernels relevant to scientific computing. In terms of the geometric mean, our simple, vendor- and GPU-type-independent model achieves relative accuracy comparable to that of previously published work using hardware specific models.

en cs.PF, cs.DC
arXiv Open Access 2012
Performance Evaluation: Ball-Tree and KD-Tree in the Context of MST

Hazarath Munaga, Venkata Jarugumalli

Now a days many algorithms are invented or being inventing to find the solution for Euclidean Minimum Spanning Tree, EMST, problem, as its applicability is increasing in much wide range of fields containing spatial or spatio temporal data viz. astronomy which consists of millions of spatial data. To solve this problem, we are presenting a technique by adopting the dual tree algorithm for finding efficient EMST and experimented on a variety of real time and synthetic datasets. This paper presents the observed experimental observations and the efficiency of the dual tree framework, in the context of kdtree and ball tree on spatial datasets of different dimensions.

Halaman 7 dari 4530