Abstract The Epstein-Barr virus (EBV) oncogene Latent membrane protein 1 (LMP1) is essential for B-cell transformation into continuously growing lymphoblastoid cell lines. LMP1 traffics to plasma membrane and intracellular signaling sites to mimic aspects of signaling by the B cell co-receptor CD40. LMP1 is expressed in many EBV-associated cancers, including post-transplant lymphoma, Hodgkin lymphoma, T/NK lymphoma and nasopharyngeal carcinoma, where it activates key growth and survival pathways. LMP1 signaling is also implicated in multiple sclerosis pathogenesis. To identify host dependency factors that support LMP1 trafficking and signaling, we performed a human genome-wide CRISPR-Cas9 screen in B cells. The screen identified both known and previously uncharacterized mediators of LMP1 signaling. The ER resident protein STEEP1, implicated in DNA sensor STING trafficking and signaling, was a top screen hit. Importantly, STEEP1 did not score in our prior B cell CRISPR screen for factors that support CD40 signaling, suggesting specificity. STEEP1 depletion strongly impaired LMP1 signaling, including activation of NF-kB and MAP kinase pathways. Mechanistically, STEEP1 associated with LMP1 in a manner dependent on the N-terminal cytoplasmic tail and supported LMP1 egress from the ER to signaling sites in both B and epithelial cells. Collectively, these findings reveal STEEP1 as a key host factor that supports trafficking of newly synthesized LMP1 molecules to intracellular signaling sites and highlights LMP1/STEEP1 interaction as a novel therapeutic target. Importance Epstein-Barr virus (EBV) infects most people worldwide. While infection is often benign, it causes infectious mononucleosis, is associated with a range of lymphomas, nasopharyngeal and gastric carcinoma and is a major trigger for autoimmune disease, including multiple sclerosis. The EBV encoded oncogene LMP1 is a key driver of EBV pathogenesis, and its signaling is necessary for viral immortalization of B lymphocytes into continuously growing lymphoblasts (LCLs). Here, we performed a CRISPR genetic screen to identify host factors that support continuous, ligand-independent signaling by LMP1. This analysis identified an ER-resident protein called STEEP1, previously implicated in support of trafficking of the DNA sensor STING, as a key LMP1 partner. We found that STEEP1 associates with LMP1 and supports LMP1 trafficking out of the endoplasmic reticulum to cellular signaling sites. As STEEP1 knockout impaired LMP1 function and LCL survival, our study identifies the STEEP1/LMP1 complex as a therapeutic target.
We present a lightweight tool for the analysis and tuning of application data placement in systems with heterogeneous memory pools. The tool allows non-intrusively identifying, analyzing, and controlling the placement of individual allocations of the application. We use the tool to analyze a set of benchmarks running on the Intel Sapphire Rapids platform with both HBM and DDR memory. The paper also contains an analysis of the performance of both memory subsystems in terms of read/write bandwidth and latency. The key part of the analysis is to focus on performance if both subsystems are used together. We show that only about 60% to 75% of the data must be placed in HBM memory to achieve 90% of the potential performance of the platform on those benchmarks.
Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares the performance of five different matrix multiplication algorithms using CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads. We find statistical significance with a p-value below 5e-12 to support the hypothesis that for square $N \times N$ matrices where $N$ is at least 10,000 then the in order performance as measured in floating point operations per second (FLOPS) for these matrix multiplication algorithms is CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads.
This book, by Molero, Juiz, and Rodeno, titled Performance Evaluation and Modeling of Computer Systems, presents a comprehensive summary of simple quantitative techniques that help answer the above questions. Its approach is not one of theory for theory's sake; rather, in each chapter, after a brief theoretical review, it delves deeply into numerous problems grouped into three categories: those with complete solutions, those for which only the solution is given, and, finally, those whose resolution is left to the reader's discretion. Although some of the solved problems may be considered purely academic in terms of complexity, they should not be underestimated, as they reveal, on a reduced scale, the process that must be followed with the help of appropriate tools to solve equivalent real-world problems of an industrial scale.
As the volume of data being produced is increasing at an exponential rate that needs to be processed quickly, it is reasonable that the data needs to be available very close to the compute devices to reduce transfer latency. Due to this need, local filesystems are getting close attention to understand their inner workings, performance, and more importantly their limitations. This study analyzes few popular Linux filesystems: EXT4, XFS, BtrFS, ZFS, and F2FS by creating, storing, and then reading back one billion files from the local filesystem. The study also captured and analyzed read/write throughput, storage blocks usage, disk space utilization and overheads, and other metrics useful for system designers and integrators. Furthermore, the study explored other side effects such as filesystem performance degradation during and after these large numbers of files and folders are created.
AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same model quality both for training and inference. Following these trends, GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8 and int8 GeMM operations with an exceptional performance via Tensor Cores. However, this paper proposes a new algorithm called msGeMM which shows that AI models with low-precision datatypes can run with ~2.5x fewer multiplication and add instructions. Efficient implementation of this algorithm requires special CUDA cores with the ability to add elements from a small look-up table at the rate of Tensor Cores.
Having just made the cover of Science (PMID 34726479) and being a strong candidate for drug of the year (21 or 22?) the SARS-Cov-2 antiviral M-protease inhibitor PF-07321332 needs no introduction. It does however present a very topical name-to-structure (n2s) example to track through various sources. I have included different profiles from which we can distill aspects of interest.
A FORTRAN program to simulate the operation of infinite servers queues is presented in this work. Poisson arrivals processes are considered but not only. For many parameters of interest in queuing systems study or application, either there are not theoretical results or, existing, they are mathematically intractable what makes their utility doubtful. In this case a possible issue is to use simulation methods in order to get more useful results. Indeed, using simulation, some experiences may be performed and the respective results used to conjecture about certain queue systems interesting quantities. In this paper this procedure is followed to learn something more about quantities of interest for those infinite servers queue systems, in particular about busy period parameters and probability distributions.
We investigate the performance of two size-based routing policies: the Size Interval Task Assignment (SITA) and Task Assignment based on Guessing Size (TAGS). We consider a system with two servers and Bounded Pareto distributed job sizes with tail parameter 1 where the difference between the size of the largest and the smallest job is finite. We show that the ratio between the mean waiting time of TAGS over the mean waiting time of SITA is unbounded when the largest job size is large and the arrival rate times the largest job size is less than one. We provide numerical experiments that show that our theoretical findings extend to Bounded Pareto distributed job sizes with tail parameter different to 1.
We present a parallel profiling tool, GAPP, that identifies serialization bottlenecks in parallel Linux applications arising from load imbalance or contention for shared resources . It works by tracing kernel context switch events using kernel probes managed by the extended Berkeley Packet Filter (eBPF) framework. The overhead is thus extremely low (an average 4% run time overhead for the applications explored), the tool requires no program instrumentation and works for a variety of serialization bottlenecks. We evaluate GAPP using the Parsec3.0 benchmark suite and two large open-source projects: MySQL and Nektar++ (a spectral/hp element framework). We show that GAPP is able to reveal a wide range of bottleneck-related performance issues, for example arising from synchronization primitives, busy-wait loops, memory operations, thread imbalance and resource contention.
We consider the problem of scheduling serverless-computing instances such as Amazon Lambda functions, or scheduling microservices within (privately held) virtual machines (VMs). Instead of a quota per tenant/customer, we assume demand for Lambda functions is modulated by token-bucket mechanisms per tenant. Such quotas are due to, e.g., limited resources (as in a fog/edge-cloud context) or to prevent excessive unauthorized invocation of numerous instances by malware. Based on an upper bound on the stationary number of active "Lambda servers" considering the execution-time distribution of Lambda functions, we describe an approach that the cloud could use to overbook Lambda functions for improved utilization of IT resources. An earlier bound for a single service tier is extended to multiple service tiers. For the context of scheduling microservices in a private setting, the framework could be used to determine the required VM resources for a token-bucket constrained workload stream. Finally, we note that the looser Markov inequality may be useful in settings where the job service times are dependent.
Efficiently exploiting the resources of data centers is a complex task that requires efficient and reliable load balancing and resource allocation algorithms. The former are in charge of assigning jobs to servers upon their arrival in the system, while the latter are responsible for sharing server resources between their assigned jobs. These algorithms should take account of various constraints, such as data locality, that restrict the feasible job assignments. In this paper, we propose a token-based mechanism that efficiently balances load between servers without requiring any knowledge on job arrival rates and server capacities. Assuming a balanced fair sharing of the server resources, we show that the resulting dynamic load balancing is insensitive to the job size distribution. Its performance is compared to that obtained under the best static load balancing and in an ideal system that would constantly optimize the resource utilization.
Miss-ratio curve (MRC), or equivalently hit-ratio curve (HRC), construction techniques have recently gathered the attention of many researchers. Recent advancements have allowed for approximating these curves in constant time, allowing for online working-set-size (WSS) measurement. Techniques span the algorithmic design paradigm from classic dynamic programming to artificial intelligence inspired techniques. Our survey produces broad classification of the current techniques primarily based on \emph{what} locality metric is being recorded and \emph{how} that metric is stored for processing. Applications of theses curves span from dynamic cache partitioning in the processor, to improving block allocation at the operating system level. Our survey will give an overview of the historical, exact MRC construction methods, and compare them with the state-of-the-art methods present in today's literature. In addition, we will show where there are still open areas of research and remain excited to see what this domain can produce with a strong theoretical background.
Tilman Hüneke, Oliver-Alex Aderhold, Jannik Bounin
et al.
Abstract. We report on a novel 6 channel optical spectrometer (further on called mini-DOAS instrument) for aircraftborne nadir and limb measurements of atmospheric trace gases, liquid and solid water, and spectral radiances in the UV/vis and nearIR spectral ranges. The spectrometer was developed for measurements from aboard the HALO (http://www.halo.dlr.de/) research aircraft during dedicated research missions. Here we report on the relevant instrumental details and the novel scaling method used to infer the mixing ratios of UV/vis absorbing trace gases from their absorption measured in limb geometry. The uncertainties of the scaling method are assessed for NO2 and BrO measurements. Some first results are reported along with complementary measurements and comparisons with model predictions for a selected HALO research flight from Cape Town to Antarctica, which was performed during the research mission ESMVal on 13 September 2012.
We present a comparative analysis of the maximum performance achieved by the Linpack benchmark on compute intensive hardware publicly available from multiple cloud providers. We study both performance within a single compute node, and speedup for distributed memory calculations with up to 32 nodes or at least 512 computing cores. We distinguish between hyper-threaded and non-hyper-threaded scenarios and estimate the performance per single computing core. We also compare results with a traditional supercomputing system for reference. Our findings provide a way to rank the cloud providers and demonstrate the viability of the cloud for high performance computing applications.
In this paper we develop a novel technique to analyze both isolated and interconnected caches operating under different caching strategies and realistic traffic conditions. The main strength of our approach is the ability to consider dynamic contents which are constantly added into the system catalogue, and whose popularity evolves over time according to desired profiles. We do so while preserving the simplicity and computational efficiency of models developed under stationary popularity conditions, which are needed to analyze several caching strategies. Our main achievement is to show that the impact of content popularity dynamics on cache performance can be effectively captured into an analytical model based on a fixed content catalogue (i.e., a catalogue whose size and objects' popularity do not change over time).
As a follow-up to a recent paper considering two symmetric queues, the \textit{Shortest Queue First} service discipline is presently analysed for two general asymmetric queues. Using the results previously established and assuming exponentially distributed service times, the bivariate Laplace transform of workloads in each queue is shown to depend on the solution $\mathbf{M}$ to a two-dimensional functional equation $$ \mathbf{M} = Q_1 \cdot \mathbf{M}\circ h_1 + Q_2 \cdot \mathbf{M}\circ h_2 + \mathbf{L} $$ with given matrices $Q_1$, $Q_2$ and vector $\mathbf{L}$ and where functions $h_1$ and $h_2$ are defined each on some rational curve; solution $\mathbf{M}$ can then represented by a series expansion involving the semi-group $< h_1, h_2 >$ generated by these two functions. The empty queue probabilities along with the tail behaviour of the workload distribution at each queue are characterised.
We formulate the loop-free, binary superoptimization task as a stochastic search problem. The competing constraints of transformation correctness and performance improvement are encoded as terms in a cost function, and a Markov Chain Monte Carlo sampler is used to rapidly explore the space of all possible programs to find one that is an optimization of a given target program. Although our method sacrifices com- pleteness, the scope of programs we are able to reason about, and the quality of the programs we produce, far exceed those of existing superoptimizers. Beginning from binaries com- piled by llvm -O0 for 64-bit X86, our prototype implemen- tation, STOKE, is able to produce programs which either match or outperform the code sequences produced by gcc with full optimizations enabled, and, in some cases, expert handwritten assembly.