Hasil untuk "cs.PF"

Menampilkan 20 dari ~90655 hasil · dari DOAJ, CrossRef, arXiv

JSON API
arXiv Open Access 2026
Colored Markov Modulated Fluid Queues

Benny Van Houdt

Markov-modulated fluid queues (MMFQs) are a powerful modeling framework for analyzing the performance of computer and communication systems. Their distinguishing feature is that the underlying Markov process evolves on a continuous state space, making them well suited to capture the dynamics of workloads, energy levels, and other performance-related quantities. Although classical MMFQs do not permit jumps in the fluid level, they can still be applied to analyze a wide range of jump processes. In this paper, we generalize the MMFQ framework in a new direction by introducing {\bf colored MMFQs} and {\bf colored MMFQs with fluid jumps}. This enriched framework provides an additional form of memory: the color of incoming fluid can be used to keep track of the fluid level when certain events took place. This capability greatly enhances modeling flexibility and enables the analysis of queueing systems that would otherwise be intractable due to the curse of dimensionality or state-space explosion.

en cs.PF
arXiv Open Access 2026
Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Maya Taylor, Carl Pearson, Luc Berger-Vergiat et al.

As AI accelerators gain prominence, their potential for traditional scientific computing workloads remains unclear. This paper explores Tenstorrent's Wormhole architecture, a spatial computing platform designed for neural network acceleration, by implementing three numerical kernels and composing them into a conjugate gradient solver. We present architecture-specific optimizations for sparse numerical algorithms, evaluate their performance against Nvidia GPUs, and expose both challenges and opportunities in porting numerical methods to spatial architectures. Our results demonstrate that AI accelerators merit consideration for workloads traditionally dominated by CPUs and GPUs, and more work should be invested in understanding the capabilities of these architectures and making them accessible to the scientific computing community.

en cs.PF
arXiv Open Access 2025
Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

Youpeng Zhao, Jinpeng LV, Di Wu et al.

Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.

en cs.PF, cs.AI
arXiv Open Access 2025
DarwinGame: Playing Tournaments for Tuning Applications in Noisy Cloud Environments

Rohan Basu Roy, Vijay Gadepally, Devesh Tiwari

This work introduces a new subarea of performance tuning -- performance tuning in a shared interference-prone computing environment. We demonstrate that existing tuners are significantly suboptimal by design because of their inability to account for interference during tuning. Our solution, DarwinGame, employs a tournament-based design to systematically compare application executions with different tunable parameter configurations, enabling it to identify the relative performance of different tunable parameter configurations in a noisy environment. Compared to existing solutions, DarwinGame achieves more than 27% reduction in execution time, with less than 0.5% performance variability. DarwinGame is the first performance tuner that will help developers tune their applications in shared, interference-prone, cloud environments.

en cs.PF
arXiv Open Access 2023
Model-Based Performance Analysis of the HyTeG Finite Element Framework

Dominik Thönnes, Ulrich Rüde

In this work, we present how code generation techniques significantly improve the performance of the computational kernels in the HyTeG software framework. This HPC framework combines the performance and memory advantages of matrix-free multigrid solvers with the flexibility of unstructured meshes. The pystencils code generation toolbox is used to replace the original abstract C++ kernels with highly optimized loop nests. The performance of one of those kernels (the matrix-vector multiplication) is thoroughly analyzed using the Execution-Cache-Memory (ECM) performance model. We validate these predictions by measurements on the SuperMUC-NG supercomputer. The experiments show that the performance mostly matches the predictions. In cases where the prediction does not match, we discuss the discrepancies. Additionally, we conduct a node-level scaling study which shows the expected behavior for a memory-bound compute kernel.

CrossRef Open Access 2022
Visual Cues for Turning in Parkinson’s Disease

Julia Das, Rodrigo Vitorio, Allissa Butterfield et al.

Turning is a common impairment of mobility in people with Parkinson’s disease (PD), which increases freezing of gait (FoG) episodes and has implications for falls risk. Visual cues have been shown to improve general gait characteristics in PD. However, the effects of visual cues on turning deficits in PD remains unclear. We aimed to (i) compare the response of turning performance while walking (180° and 360° turns) to visual cues in people with PD with and without FoG; and (ii) examine the relationship between FoG severity and response to visual cues during turning. This exploratory interventional study measured turning while walking in 43 participants with PD (22 with self-reported FoG) and 20 controls using an inertial sensor placed at the fifth lumbar vertebrae region. Participants walked straight and performed 180° and 360° turns midway through a 10 m walk, which was done with and without visual cues (starred pattern). The turn duration and velocity response to visual cues were assessed using linear mixed effects models. People with FoG turned slower and longer than people with PD without FoG and controls (group effect: p < 0.001). Visual cues reduced the velocity of turning 180° across all groups and reduced the velocity of turning 360° in people with PD without FoG and controls. FoG severity was not significantly associated with response to visual cues during turning. Findings suggest that visual cueing can modify turning during walking in PD, with response influenced by FoG status and turn amplitude. Slower turning in response to visual cueing may indicate a more cautious and/or attention-driven turning pattern. This study contributes to our understanding of the influence that cues can have on turning performance in PD, particularly in freezers, and will aid in their therapeutic application.

arXiv Open Access 2022
TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

Felix Chern, Blake Hechtman, Andy Davis et al.

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model that takes into account both the memory and instruction bottlenecks. Our algorithm comes with an analytical guarantee of recall in expectation and does not require maintaining sophisticated index data structure or tuning, making it suitable for applications with frequent updates. Our work is available in the open-source package of Jax and Tensorflow on TPU.

en cs.PF, cs.LG
arXiv Open Access 2022
Markovian Analysis of Coordination Strategies in Tandem Polling Queues with Setups

Ravi Suman, Ananth Krishnamurthy

We analyze a network of tandem polling queues with two stations operating under synchronized polling (SP) and out-of-sync polling (OP) strategies, and with nonzero setups. We conduct an exact analysis using a decomposition approach to compare the performance in terms of throughput and mean waiting times to investigate when one strategy might be preferred over the other. We also numerically investigate the condition for network stability operating under the two strategies and show that polling network is unstable when there is bottleneck at downstream stations. We find that the SP strategy outperforms the OP strategy in case of product and station symmetric networks while under certain settings of product and station asymmetry, OP strategy outperforms the SP strategy.

en cs.PF
arXiv Open Access 2020
Erlang Redux: An Ansatz Method for Solving the M/M/m Queue

Neil J. Gunther

This exposition presents a novel approach to solving an M/M/m queue for the waiting time and the residence time. The motivation comes from an algebraic solution for the residence time of the M/M/1 queue. The key idea is the introduction of an ansatz transformation, defined in terms of the Erlang B function, that avoids the more opaque derivation based on applied probability theory. The only prerequisite is an elementary knowledge of the Poisson distribution, which is already necessary for understanding the M/M/1 queue. The approach described here supersedes our earlier approximate morphing transformation.

en cs.PF, cs.DC
arXiv Open Access 2018
A Comparative Evaluation of Log-Based Process Performance Analysis Techniques

Fredrik Milani, Fabrizio M. Maggi

Process mining has gained traction over the past decade and an impressive body of research has resulted in the introduction of a variety of process mining approaches measuring process performance. Having this set of techniques available, organizations might find it difficult to identify which approach is best suited considering context, performance indicator, and data availability. In light of this challenge, this paper aims at introducing a framework for categorizing and selecting performance analysis approaches based on existing research. We start from a systematic literature review for identifying the existing works discussing how to measure process performance based on information retrieved from event logs. Then, the proposed framework is built starting from the information retrieved from these studies taking into consideration different aspects of performance analysis.

en cs.PF, cs.SE
arXiv Open Access 2018
Performance analysis and optimization of the JOREK code for many-core CPUs

T. B. Fehér, M. Hölzl, G. Latu et al.

This report investigates the performance of the JOREK code on the Intel Knights Landing and Skylake processor architectures. The OpenMP scaling of the matrix construction part of the code was analyzed and improved synchronization methods were implemented. A new switch was implemented to control the number of threads used for the linear equation solver independently from other parts of the code. The matrix construction subroutine was vectorized, and the data locality was also improved. These steps led to a factor of two speedup for the matrix construction.

en cs.PF
arXiv Open Access 2018
An infinite-server queueing model MMAPkGk in semi-Markov random environment with marked MAP arrival and subject to catastrophes

K. Kerobyan, R. Covington, R. Kerobyan et al.

In the present paper the infinite-server MMAPkGk queueing model with random resource vector of customers, marked MAP arrival and semi-Markov (SM) arrival of catastrophes is considered. The joint generating functions (PGF) of transient and stationary distributions of number of busy servers and numbers of different types served customers, as well as Laplace transformations (LT) of joint distributions of total accumulated resources in the model at moment and total accumulated resources of served customers during time interval are found. The basic differential and renewal equations for transient and stationary PGF of queue sizes of customers are found.

en cs.PF, math.PR
arXiv Open Access 2017
Computation of gray-level co-occurrence matrix based on CUDA and its optimization

Huichao Hong, Lixin Zheng, Shuwan Pan

As in various fields like scientific research and industrial application, the computation time optimization is becoming a task that is of increasing importance because of its highly parallel architecture. The graphics processing unit is regarded as a powerful engine for application programs that demand fairly high computation capabilities. Based on this, an algorithm was introduced in this paper to optimize the method used to compute the gray-level co-occurrence matrix (GLCM) of an image, and strategies (e.g., "copying", "image partitioning", etc.) were proposed to optimize the parallel algorithm. Results indicate that without losing the computational accuracy, the speed-up ratio of the GLCM computation of images with different resolutions by GPU by the use of CUDA was 50 times faster than that of the GLCM computation by CPU, which manifested significantly improved performance.

en cs.PF
arXiv Open Access 2017
Theoretical Performance Analysis of Vehicular Broadcast Communications at Intersection and their Optimization

Tatsuaki Kimura, Hiroshi Saito

In this paper, we propose an optimization method for the broadcast rate in vehicle-to-vehicle (V2V) broadcast communications at an intersection on the basis of theoretical analysis. We consider a model in which locations of vehicles are modeled separately as queuing and running segments and derive key performance metrics of V2V broadcast communications via a stochastic geometry approach. Since these theoretical expressions are mathematically intractable, we developed closed-form approximate formulae for them. Using them, we optimize the broadcast rate such that the mean number of successful receivers per unit time is maximized. Because of the closed form approximation, the optimal rate can be used as a guideline for a real-time control-method, which is not achieved through time-consuming simulations. We evaluated our method through numerical examples and demonstrated the effectiveness of our method.

en cs.PF
arXiv Open Access 2014
Modeling LRU caches with Shot Noise request processes

Emilio Leonardi, Giovanni Luca Torrisi

In this paper we analyze Least Recently Used (LRU) caches operating under the Shot Noise requests Model (SNM). The SNM was recently proposed to better capture the main characteristics of today Video on Demand (VoD) traffic. We investigate the validity of Che's approximation through an asymptotic analysis of the cache eviction time. In particular, we provide a large deviation principle, a law of large numbers and a central limit theorem for the cache eviction time, as the cache size grows large. Finally, we derive upper and lower bounds for the "hit" probability in tandem networks of caches under Che's approximation.

en cs.PF
arXiv Open Access 2013
Stochastic Service Curve and Delay Bound Analysis: A Single Node Case

Yuming Jiang

A packet-switched network node with constant capacity (in bps) is considered, where packets within each flow are served in the first in first out (FIFO) manner. While this single node system is perhaps the simplest computer communication system, its stochastic service curve characterization and independent case analysis in the context of stochastic network calculus (snetcal) are still basic and many crucial questions surprisingly remain open. Specifically, when the input is a single flow, what stochastic service curve and delay bound does the node provide? When the considered flow shares the node with another flow, what stochastic service curve and delay bound does the node provide to the considered flow, and if the two flows are independent, can this independence be made use of and how? The aim of this paper is to provide answers to these fundamental questions.

en cs.PF, cs.NI
arXiv Open Access 2012
Best practices for HPM-assisted performance engineering on modern multicore processors

Jan Treibig, Georg Hager, Gerhard Wellein

Many tools and libraries employ hardware performance monitoring (HPM) on modern processors, and using this data for performance assessment and as a starting point for code optimizations is very popular. However, such data is only useful if it is interpreted with care, and if the right metrics are chosen for the right purpose. We demonstrate the sensible use of hardware performance counters in the context of a structured performance engineering approach for applications in computational science. Typical performance patterns and their respective metric signatures are defined, and some of them are illustrated using case studies. Although these generic concepts do not depend on specific tools or environments, we restrict ourselves to modern x86-based multicore processors and use the likwid-perfctr tool under the Linux OS.

en cs.PF, cs.DC
arXiv Open Access 2012
Reengineering multi tiered enterprise business applications for performance enhancement and reciprocal or rectangular hyperbolic relation of variation of data transportation time with row pre-fetch size of relational database drivers

Sridhar Sowmiyanarayanan

Reengineering multi tiered enterprise business applications for performance enhancement and reciprocal or rectangular hyperbolic relation of variation of data transportation time with row pre-fetch size of relational database drivers

en cs.PF, cs.SE

Halaman 10 dari 4533