Modular SAIL: dream or reality?
Petr Kourzanov, Anmol
In order to truly benefit from RISC-V ISA modularity, the community has to address the issue of compositionality, going beyond modules at the specification level covering larger subsets of the RISC-V development flow including emulation, simulation and verification. In this paper we introduce modular SAIL, an experiment to inject compositionality into the SAIL-RISCV golden model. We show that it is, in principle, not difficult to adapt the SAIL-RISCV flow (and ideally the SAIL compiler itself) to support modules at the emulator level. We back our findings by a comparative study of the resulting pluggable emulator's performance using both static and dynamic binding, which both exhibit same functional behavior as the original monolithic emulator (aka RISC-V ISS).
FoldedHexaTorus: An Inter-Chiplet Interconnect Topology for Chiplet-based Systems using Organic and Glass Substrates
Patrick Iff, Maciej Besta, Torsten Hoefler
Chiplet-based systems are rapidly gaining traction in the market. Two packaging options for such systems are the established organic substrates and the emerging glass substrates. These substrates are used to implement the inter-chiplet interconnect (ICI), which is crucial for overall system performance. To guide the development of ICIs, we introduce three design principles for ICI network topologies on organic and glass substrates. Based on our design principles, we propose the novel FoldedHexaTorus network topology. Our evaluation shows that the FoldedHexaTorus achieves significantly higher throughput than state-of-the-art topologies while maintaining low latency.
Regular mixed-radix DFT matrix factorization for in-place FFT accelerators
Sergey Salishev
The generic vector memory based accelerator is considered which supports DIT and DIF FFT with fixed datapath. The regular mixed-radix factorization of the DFT matrix coherent with the accelerator architecture is proposed and the correction proof is presented. It allows better understanding of architecture requirements and simplifies the developing and proving correctness of more complicated algorithms and conflict-free addressing schemes.
Accelerating Computer Architecture Simulation through Machine Learning
Wajid Ali, Ayaz Akram
This paper presents our approach to accelerate computer architecture simulation by leveraging machine learning techniques. Traditional computer architecture simulations are time-consuming, making it challenging to explore different design choices efficiently. Our proposed model utilizes a combination of application features and micro-architectural features to predict the performance of an application. These features are derived from simulations of a small portion of the application. We demonstrate the effectiveness of our approach by building and evaluating a machine learning model that offers significant speedup in architectural exploration. This model demonstrates the ability to predict IPC values for the testing data with a root mean square error of less than 0.1.
PIMSIM-NN: An ISA-based Simulation Framework for Processing-in-Memory Accelerators
Xinyu Wang, Xiaotian Sun, Yinhe Han
et al.
Processing-in-memory (PIM) has shown extraordinary potential in accelerating neural networks. To evaluate the performance of PIM accelerators, we present an ISA-based simulation framework including a dedicated ISA targeting neural networks running on PIM architectures, a compiler, and a cycleaccurate configurable simulator. Compared with prior works, this work decouples software algorithms and hardware architectures through the proposed ISA, providing a more convenient way to evaluate the effectiveness of software/hardware optimizations. The simulator adopts an event-driven simulation approach and has better support for hardware parallelism. The framework is open-sourced at https://github.com/wangxy-2000/pimsim-nn.
Optimized Implementation of Neuromorphic HATS Algorithm on FPGA
Khushal Sethi, Manan Suri
In this paper, we present first-ever optimized hardware implementation of a state-of-the-art neuromorphic approach Histogram of Averaged Time Surfaces (HATS) algorithm to event-based object classification in FPGA for asynchronous time-based image sensors (ATIS). Our Implementation achieves latency of 3.3 ms for the N-CARS dataset samples and is capable of processing 2.94 Mevts/s. Speed-up is achieved by using parallelism in the design and multiple Processing Elements can be added. As development platform, Zynq-7000 SoC from Xilinx is used. The tradeoff between Average Absolute Error and Resource Utilization for fixed precision implementation is analyzed and presented. The proposed FPGA implementation is $\sim$ 32 x power efficient compared to software implementation.
Two New CNTFET Quaternary Full Adders for Carry-Propagate Adders
Daniel Etiemble
In Carry Propagate Adders, carry propagation is the critical delay. For the 1-digit adders that they use, the most efficient scheme is to generate two intermediate carries: C$_{out0}$ ($C_{in}$=0) and $C_{out1}$($C_{in}$=1). Then multiplex them to produce the correct output according to $C_{in}$. For any radix, the carry output has always a logical value 0 or 1. We show that using 0 and $V_{dd}$ levels for input and output carries instead of 0 and $V_{dd}$/3 in quaternary full adders significantly reduce the carry propagation. We compare such a quaternary full adder with binary full adders to implement N-digit carry propagate adders.
Holistic Hardware Security Assessment Framework: A Microarchitectural Perspective
Tochukwu Idika, Ismail Akturk
Our goal is to enable holistic hardware security evaluation from the microarchitectural point of view. To achieve this, we propose a framework that categorizes threat models based on the microarchitectural components being targeted, and provides a generic security metric that can be used to assess the vulnerability of components, as well as the system as a whole.
Optimization Techniques to Improve Inference Performance of a Forward Propagating Neural Network on an FPGA
Matthew Joseph Adiletta, Brian Flanagan
This paper describes an optimized implementation of a Forward Propagating Classification Neural Network which has been previously trained. The implementation described highlights a novel means of using Python scripts to generate a Verilog hardware implementation. The characteristics of this implementation include optimizations to scale input data, use selected addends instead of multiplication functions, hardware friendly activation functions and simplified output selection. Inference performance comparison of a 28x28 pixel 'hand-written' recognition NN between a software implementation on an Intel i7 vs a Xilinx FPGA will be detailed.
Appearances of the Birthday Paradox in High Performance Computing
Victor Eijkhout, Margaret Myers, John McCalpin
We give an elementary statistical analysis of two High Performance Computing issues, processor cache mapping and network port mapping. In both cases we find that, as in the birthday paradox, random assignment leads to more frequent coincidences than one expects a priori. Since these correspond to contention for limited resources, this phenomenon has important consequences for performance.
Sparse Matrix Multiplication on CAM Based Accelerator
Leonid Yavits, Ran Ginosar
Sparse matrix multiplication is an important component of linear algebra computations. In this paper, an architecture based on Content Addressable Memory (CAM) and Resistive Content Addressable Memory (ReCAM) is proposed for accelerating sparse matrix by sparse vector and matrix multiplication in CSR format. Using functional simulation, we show that the proposed ReCAM-based accelerator exhibits two orders of magnitude higher power efficiency as compared to existing sparse matrix-vector multiplication implementations.
A Many-Core Overlay for High-Performance Embedded Computing on FPGAs
Mário Véstias, Horácio Neto
In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.
Floorplanning and Topology Generation for Application-Specific Network-on-Chip
Bei Yu, Sheqin Dong, Song Chen
et al.
Network-on-chip (NoC) architectures have been proposed as a promising alternative to classical bus-based communication architectures. In this paper, we propose a two phases framework to solve application-specific NoCs topology generation problem. At floorplanning phase, we carry out partition driven floorplanning. At post-floorplanning phase, a heuristic method and a min-cost max-flow algorithm is used to insert switches and network interfaces. Finally, we allocate paths to minimize power consumption. The experimental results show our algorithm is effective for power saving.
Application Specific Cache Simulation Analysis for Application Specific Instruction set Processor
Ravi Khatwal, Manoj Kumar Jain
An Efficient Simulation of application specific instruction-set processors (ASIP) is a challenging onus in the area of VLSI design. This paper reconnoiters the possibility of use of ASIP simulators for ASIP Simulation. This proposed study allow as the simulation of the cache memory design with various ASIP simulators like Simple scalar and VEX. In this paper we have implemented the memory configuration according to desire application. These simulators performs the cache related results such as cache name, sets, cache associativity, cache block size, cache replacement policy according to specific application.
Multi-Amdahl: Optimal Resource Sharing with Multiple Program Execution Segments
Tsahee Zidenberg, Isaac Keslassy, Uri Weiser
This paper presents Multi-Amdahl, a resource allocation analytical tool for heterogeneous systems. Our model includes multiple program execution segments, where each one is accelerated by a specific hardware unit. The acceleration speedup of the specific hardware unit is a function of a limited resource, such as the unit area, power, or energy. Using the Lagrange theorem we discover the optimal resource distribution between all specific units. We then illustrate this general Multi-Amdahl technique using several examples of area and power allocation among several cores and accelerators.
Noise Limited Computational Speed
Luca Gammaitoni
In modern transistor based logic gates, the impact of noise on computation has become increasingly relevant since the voltage scaling strategy, aimed at decreasing the dissipated power, has increased the probability of error due to the reduced switching threshold voltages. In this paper we discuss the role of noise in a two state model that mimic the dynamics of standard logic gates and show that the presence of the noise sets a fundamental limit to the computing speed. An optimal idle time interval that minimizes the error probability, is derived.
SOC Testing Methodology and Practice
Cheng-Wen Wu
On a commercial digital still camera (DSC) controller chip we practice a novel SOC test integration platform, solving real problems in test scheduling, test IO reduction, timing of functional test, scan IO sharing, embedded memory built-in self-test (BIST), etc. The chip has been fabricated and tested successfully by our approach. Test results justify that short test integration cost, short test time, and small area overhead can be achieved. To support SOC testing, a memory BIST compiler and an SOC testing integration system have been developed.
A Memory Aware High Level Synthesis Too
Gwenolé Corre, Nathalie Julien, Eric Senn
et al.
We introduce a new approach to take into account the memory architecture and the memory mapping in High- Level Synthesis for data intensive applications. We formalize the memory mapping as a set of constraints for the synthesis, and defined a Memory Constraint Graph and an accessibility criterion to be used in the scheduling step. We use a memory mapping file to include those memory constraints in our HLS tool GAUT. It is possible, with the help of GAUT, to explore a wide range of solutions, and to reach a good tradeoff between time, power-consumption, and area.
Scalability Terminology: Farms, Clones, Partitions, Packs, RACS and RAPS
Bill Devlin, Jim Gray, Bill Laing
et al.
Defines a vocabulary for scaleable systems: Geoplexes, Farms, Clones, RACS, RAPS, clones, partitions, and packs and dicusses the design tradeoffs of using clones, partitons, and packs.
On the Information Engine of Circuit Design
Denis V. Popel, Nawar Al-Hakeem
This paper addresses a new approach to find a spectrum of information measures for the process of digital circuit synthesis. We consider the problem from the information engine point of view. The circuit synthesis as a whole and different steps of the design process (an example of decision diagram is given) are presented via such measurements as entropy, logical work and information vitality. We also introduce new information measures to provide better estimates of synthesis criteria. We show that the basic properties of information engine, such as the conservation law of information flow and the equilibrium law of information can be formulated.