In this work we have enhanced gem5's GPU model support to add Matrix Core Engines (MCEs). Specifically, on the AMD MI200 and MI300 GPUs that gem5 supports, these MCEs perform Matrix Fused Multiply Add (MFMA) instructions for a variety of precisions. By adding this support, our changes enable running state-of-the-art ML workloads in gem5, as well as examining how MCE optimizations impact the behavior of future systems.
This paper presents an innovative approach to ensuring functional stability of neural networks through hardware redundancy at the individual neuron level. Unlike the classical Dropout method, which is used during training for regularization purposes, the proposed system ensures resilience to hardware failures during network operation. Each neuron is implemented on a separate microcomputer (ESP32), allowing the system to continue functioning even when individual computational nodes fail.
This paper presents a new solution to address the challenge of increasing memory usage in high-performance computing simulations of Lattice-Bolzmann or Finite-Volume schemes.Our approach utilises a lossy compression scheme based on the Discrete Wavelet Transform (DWT) to achieve high compression ratios while preserving the accuracy of the simulation.Our evaluation on two different FV/LBM schemes demonstrates that the approach can reduce memory usage by several orders of magnitude.
In this paper, we examine a ready-to-use, robust, and computationally fast fixed-size memory pool manager with no-loops and no-memory overhead that is highly suited towards time-critical systems such as games. The algorithm achieves this by exploiting the unused memory slots for bookkeeping in combination with a trouble-free indexing scheme. We explain how it works in amalgamation with straightforward step-by-step examples. Furthermore, we compare just how much faster the memory pool manager is when compared with a system allocator (e.g., malloc) over a range of allocations and sizes.
Yuze Chi, Weikang Qiao, Atefeh Sohrabizadeh
et al.
In the past few years, domain-specific accelerators (DSAs), such as Google's Tensor Processing Units, have shown to offer significant performance and energy efficiency over general-purpose CPUs. An important question is whether typical software developers can design and implement their own customized DSAs, with affordability and efficiency, to accelerate their applications. This article presents our answer to this question.
First total synthesis of a naturally occurring sesquiterpenoid, (+)-ar-macrocarpene, has been achieved via a key [3,3]-sigmatropic rearrangement effecting reductive transposition through allylic diazene rearrangement (ADR).
This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors (CMP) from Amdahl law perspective. We find that 3D CMP may reach its thermal limit before reaching its maximum power. We show that a high level of parallelism may lead to high peak temperatures even in small scale 3D CMPs, thus limiting 3D CMP scalability and calling for different, in-memory computing architectures.
This paper describes the design of a 1024-core processor chip in 16nm FinFet technology. The chip ("Epiphany-V") contains an array of 1024 64-bit RISC processors, 64MB of on-chip SRAM, three 136-bit wide mesh Networks-On-Chip, and 1024 programmable IO pins. The chip has taped out and is being manufactured by TSMC. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
The Baugh-Wooley algorithm is a well-known iterative algorithm for performing multiplication in digital signal processing applications. Decomposition logic is used with Baugh-Wooley algorithm to enhance the speed and to reduce the critical path delay. In this paper a high speed multiplier is designed and implemented using decomposition logic and Baugh-Wooley algorithm. The result is compared with booth multiplier. FPGA based architecture is presented and design has been implemented using Xilinx 12.3 device.
This volume contains the papers accepted at the Second International Workshop on FPGAs for Software Programmers (FSP 2015), held in London, United Kingdom, September 1st, 2015. FSP 2015 was co-located with the International Conference on Field Programmable Logic and Applications (FPL).
In the past, efforts were taken to improve the performance of a processor via frequency scaling. However, industry has reached the limits of increasing the frequency and therefore concurrent execution of instructions on multiple cores seems the only possible option. It is not enough to provide concurrent execution by the hardware, software also have to introduce concurrency in order to exploit the parallelism.
Fluid Stochastic Petri Nets are used to capture the dynamic behavior of an ILP processor, and discrete-event simulation is applied to assess the performance potential of predictions and speculative execution in boosting the performance of ILP processors that fetch, issue, execute and commit a large number of instructions per cycle.
In this paper, we present a SRAM-PCM hybrid cache design, along with a cache replacement policy, named dead fast block (DFB) to manage the hybrid cache. This design aims to leverage the best features of both SRAM and PCM devices. Compared to a PCM-only cache, the hybrid cache with DFB policy provides superior results on all relevant evaluation metrics, viz. cache lifetime, performance and energy efficiency. Also, use of DFB policy for managing the hybrid cache provides better results compared to LRU replacement policy on all the evaluation metrics.
Microprocessors have revolutionized the world we live in and continuous efforts are being made to manufacture not only faster chips but also smarter ones. A number of techniques such as data level parallelism, instruction level parallelism and hyper threading (Intel's HT) already exists which have dramatically improved the performance of microprocessor cores. This paper briefs on evolution of multi-core processors followed by introducing the technology and its advantages in today's world. The paper concludes by detailing on the challenges currently faced by multi-core processors and how the industry is trying to address these issues.
The approach of applying associative processor for decision making problem was proposed. It focuses on hardware implementations of fuzzy processing systems, associativity as effective management basis of fuzzy processor. The structural approach is being developed resulting in a quite simple and compact parallel associative memory unit (PAMU). The memory cost and speed comparison of processors with rigid and soft-variable structure is given. Also the example PAMU flashing is considered.
Many existing thermal management techniques focus on reducing the overall power consumption of the chip, and do not address location-specific temperature problems referred to as hotspots. We propose the use of dynamic runtime reconfiguration to shift the hotspot-inducing computation periodically and make the thermal profile more uniform. Our analysis shows that dynamic reconfiguration is an effective technique in reducing hotspots for NoCs.