Processing in Memory (PIM) and similar terms such as Compute In Memory (CIM), Logic in Memory (LIM), In Memory Computing (IMC), and Near Memory Computing (NMC) have gained attention recently as a potentially ``revolutionary new'' technique. The truth, however, is that many examples of the technology go back over 60 years. This document attempts to provide an annotated bibliography of PIM technology that attempts to cover the whole time-frame, and is organized to augment a forth-coming article.
Gerardo Bandera, Javier Salamero, Miquel Moreto
et al.
HUB format is an emerging technique to improve the hardware and time requirement when round to nearest is needed. On the other hand, RISC-V is an open-source ISA that many companies currently use in their designs. This paper presents a tailored floating point HUB adder implemented in the Sargantana RISC-V processor.
Côme Allart, Jean-Roch Coulon, André Sintzoff
et al.
A performance model of CVA6 RISC-V processor is built to evaluate performance related modifications before implementing them in RTL. Its accuracy is 99.2% on CoreMark. This model is used to evaluate a superscalar feature for CVA6. During design phase, the model helped detecting and fixing performance bugs. The superscalar feature resulted in a CVA6 performance improvement of 40% on CoreMark.
This volume represents the proceedings of the 3rd Workshop on Open-Source Design Automation (OSDA) 2023, co-hosted with Design, Automation, and Test in Europe (DATE) conference in Antwerp, Belgium, April 17, 2023.
Abstract Parametric measurements of pulsed output energy from the four-level exciplex pumped alkali laser (XPAL) for Cs-Ar, Cs-Kr, and Cs-Xe as a function of input pump energy and temperature show a strong dependence on temperature. All three Cs-rare gas mixtures show a D 2 line laser performance increase with temperature towards a peak efficiency, followed by a decrease as temperature is increased beyond a peak performance point temperature. Prior simulations of Cs-Ar XPAL measurements indicated that energy pooling from the 6 2 P 3/2 state of Cs was significant at higher temperature and it was hypothesized that the addition of temperature-dependent reaction rates may be important. This paper presents new BLAZE Multiphysics™ simulations using temperature-dependent energy pooling reaction rates baselined to available experimental rate data. Also included are photoionization and Penning ionization reactions. These new calculations for Cs-Ar and Cs-Xe (Cs-Kr not yet simulated) show that the inclusion of temperature-dependent energy pooling rates and the subsequent onset of significant ionization can explain the rise and fall of XPAL performance with temperature with reasonable accuracy. Further, while Cs-Xe has a much stronger absorption characteristic than Cs-Ar, simulations show that the energy well present in the Cs-Xe B 2 Σ 1 / 2 + state increases the fraction of the Cs-Xe B-state relative to the Cs-Ar B-state, thereby resulting in energy output levels of Cs-Xe similar to that of Cs-Ar.
In literature computer architectures are frequently claimed to be highly flexible, typically implying there exist trade-offs between flexibility and performance or energy efficiency. Processor flexibility, however, is not very sharply defined, and as such these claims can not be validated, nor can such hypothetical relations be fully understood and exploited in the design of computing systems. This paper is an attempt to introduce scientific rigour to the notion of flexibility in computing systems.
The MUX implementation of ternary half adders and full adders using predecessor and successor functions lead to the most efficient efficient implementation using the smallest transistor count. These designs are compared with the binary implementation of the corresponding half adders and full adders using the MUX technique or the typical complementary CMOS circuit style. The transistor count ratio between ternary and binary implementations is always greater than the information ratio ($log_2(3)/log_2(2)$ = 1.585) between ternary and binary wires.
In this work, we provide an overview of the hardware architecture of wireless power transfer (WPT), RFID, and wireless information and power transfer (WIPT) systems. The historical milestones and structure differences among WPT, RFID, and WIPT are introduced.
The computing paradigm invented for processing a small amount of data on a single segregated processor cannot meet the challenges set by the present-day computing demands. The paper proposes a new computing paradigm (extending the old one to use several processors explicitly) and discusses some questions of its possible implementation. Some advantages of the implemented approach, illustrated with the results of a loosely-timed simulator, are presented.
The appearance and disappearance of coprocessors by integration into the CPU, the success or failure of coprocessors are examined by summarizing their characteristics from the mainframes of the 1960s. The coprocessors most particularly reviewed are the IBM 360 and CDC-6600 I/O processors, the Intel 8087 math coprocessor, the Cell processor, the Intel Xeon Phi coprocessors, the GPUs, the FPGAs, and the coprocessors of manycores SW26010 and Pezy SC-2 used in high-ranked supercomputers in the TOP500 or Green500. The conditions for a coprocessor to be viable in the medium or long-term are defined.
This paper presents an overview of the main trends in processor architecture. It starts with an analysis of the past evolution of processors and the main driving forces behind it, and then it focuses on a description of the main architectural features of current processors. Finally, it presents a discussion on some promising directions for future evolution of processor architectures.
Hameedah Sultan, Shashank Varshney, Smruti R Sarangi
In this work, we present a study of the leakage power modeling techniques commonly used in the architecture community. We further provide an analysis of the error in leakage power estimation using the various modeling techniques. We strongly believe that this study will help researchers determine an appropriate leakage model to use in their work, based on the desired modeling accuracy and speed.
GRVI is an FPGA-efficient RISC-V RV32I soft processor. Phalanx is a parallel processor and accelerator array framework. Groups of processors and accelerators form shared memory clusters. Clusters are interconnected with each other and with extreme bandwidth I/O and memory devices by a 300- bit-wide Hoplite NOC. An example Kintex UltraScale KU040 system has 400 RISC-V cores, peak throughput of 100,000 MIPS, peak shared memory bandwidth of 600 GB/s, NOC bisection bandwidth of 700 Gbps, and uses 13 W.
This paper presents the FPGA hardware design of a turbo decoder for the cdma2000 standard. The work includes a study and mathematical analysis of the turbo decoding process, based on the MAX-Log-MAP algorithm. Results of decoding for a packet size of two hundred fifty bits are presented, as well as an analysis of area versus performance, and the key variables for hardware design in turbo decoding.
Svetlana Chumachenko, Wajeb Gharibi, Anna Hahanova
et al.
A novel approach to evaluation of hardware and software testability, represented in the form of register transfer graph, is proposed. Instances of making of software graph models for their subsequent testing and diagnosis are shown.
This article exhibits a particular encoding of logic circuits into a sheaf formalism. The central result of this article is that there exists strictly more information available to a circuit designer in this setting than exists in static truth tables, but less than exists in event-level simulation. This information is related to the timing behavior of the logic circuits, and thereby provides a ``bridge'' between static logic analysis and detailed simulation.
This article is about the architecture of a lossless wavelet filter bank with reprogrammable logic. It is based on second generation of wavelets with a reduced of number of operations. A new basic structure for parallel architecture and modules to forward and backward integer discrete wavelet transform is proposed.