Sonu Kumar, Mukul Lokhande, Santosh Kumar Vishvakarma
Spiking Neural Networks (SNNs) offer a promising solution for energy-efficient edge intelligence; however, their hardware deployment is constrained by memory overhead, inefficient scaling operations, and limited parallelism. This work proposes L-SPINE, a low-precision SIMD-enabled spiking neural compute engine for efficient edge inference. The architecture features a unified multi-precision datapath supporting 2-bit, 4-bit, and 8-bit operations, leveraging a multiplier-less shift-add model for neuron dynamics and synaptic accumulation. Implemented on an AMD VC707 FPGA, the proposed neuron requires only 459 LUTs and 408 FFs, achieving a critical delay of 0.39 ns and 4.2 mW power. At the system level, L-SPINE achieves 46.37K LUTs, 30.4K FFs, 2.38 ms latency, and 0.54 W power. Compared to CPU and GPU platforms, it reduces inference latency from seconds to milliseconds, achieving an up to three orders-of-magnitude improvement in energy efficiency. Quantisation analysis shows that INT2/INT4 configurations significantly reduce memory footprint with minimal accuracy loss. These results establish L-SPINE as a scalable and efficient solution for real-time edge SNN deployment.
Large Language Models (LLMs) are powerful but incur high memory and computation costs. Quantization is an effective solution, with INT weights and FP activations being widely adopted to preserve accuracy. Prior works further reduce FP overhead by using block floating point (BFP) activations in linear layers, but fail to extend BFP to attention layers due to severe accuracy degradation, limiting overall efficiency. To address this challenge, we propose Harmonia, an algorithm-hardware co-design framework that enables all-layer BFP activations with a configurable hardware architecture. First, we systematically explore BFP configurations to achieve a better trade-off between accuracy and activation compression across all layers. Second, to reduce KV-cache storage and computation in attention layers, we introduce an asymmetric bit-allocation strategy and computations in attention layers,we introduce an asymmetric bit-allocation strategy combined with a hybrid offline-online outlier smoothing technique. This allow aggressive KV-cache compression from FP16 to 4-bit-mantissa BFP with only 0.3% average accuracy loss. Third, to fully exploit all-layer BFP activations, we design dedicated hardware components, including a reconfigurable PE supporting mixed data formats (BFP-INT and BPF-BFP), a real-time FP16-to-BFP converter, and a tiling-aware dataflow to reduce memory traffic. We evaluate Harmonia on GEMM operations in both linear and attention layers across eight widely used LLMs. Compared with prior works, Harmonia achieves 3.84x (up to 5.05x) higher area efficiency, 2.03x (up to 3.90x) better energy efficiency, and 3.08x (up to 4.62x) speedup on average.
Recent advancements in computer vision, driven by artificial intelligence, have significantly enhanced monitoring systems. One notable application is traffic monitoring, which leverages computer vision alongside deep learning-based object detection and counting. We present an offline, real-time traffic monitoring system that couples a pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV and wrapped in a Qt-based desktop UI. The CNN pipeline enables efficient vehicle detection and counting from video streams without cloud dependencies. Across diverse scenes, the system achieves (66.67-95.83%) counting accuracy. Class-wise detection yields high precision (cars: 0.97-1.00; trucks: 1.00) with strong recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of (0.90-1.00 for cars and 0.82-1.00 for trucks). While adverse weather conditions may negatively impact this performance, results remain robust in typical conditions. By integrating lightweight models with an accessible, cloud-independent interface, this paper contributes to the modernization and development of future smart cities by showing the capacity of AI-driven traffic monitoring systems.
Hardware-software co-design systems abound in diverse modern application areas such as automobile control, telecommunications, big data processing and cloud computing. Reliability of a hardware-software co-design system relates to why, when and how system hardware or software failures occur. Existing works on reliability modeling of such co-design systems have mostly assumed hardware and software subsystems behave independently of each other. However, these two subsystems may have significant interactions in practice. In this thesis, we develop Markov and integration based approaches to analyze reliability of hardware-software co-design systems considering interactions between hardware and software during the system performance degradation and failure process. The proposed integration approach is applicable to arbitrary types of time-to-failure distributions. Case studies on systems without and with warm standby sparing are performed to illustrate the proposed methodologies. Effects of different component parameters on system performance are also investigated through examples..
Abstract Enhancing resilience and robustness in multi-agent systems (MAS) under complex attack patterns and dynamic environments remains a formidable challenge. This paper proposes the Graph Diffusion Reinforcement Recovery (GDRR) framework, which integrates multi-scale diffusion (MSD) embedding with dependency-aware hierarchical proximal policy optimization (DAPPO) to optimize resilience under multi-dimensional perturbations. MSD captures both local and global structural dependencies by performing multi-hop information diffusion over heterogeneous graph topologies, thereby improving feature robustness and fault tolerance. DAPPO dynamically prunes the action space by evaluating the importance of collaboration chains, enabling the agent to prioritize critical recovery operations. Experimental evaluations on cooperative swarm systems, such as simulated UAV swarm networks, demonstrate that GDRR outperforms existing methods by 1.54% to 2.91% in resilience under 50% attack intensity across four distinct adversarial scenarios. These results highlight that the collaborative integration of MSD and DAPPO enables GDRR to achieve enhanced resilience and rapid recovery in real-world MAS deployments.
Computer engineering. Computer hardware, Information technology
Currently, in deep learning-based smart contract vulnerability detection solutions, the direct use of bytecode or source code for textual sequence feature representation lacks a comprehensive understanding of program semantics. The smart contract vulnerability detection technology based on Abstract Syntax Tree (AST) embedding fully considers the syntax and semantic features needed for contract vectorization and appropriate processing granularity, enabling more accurate capturing of smart contract vulnerability features. First, it employs Solidity syntax tree parsing to design a smart-contract vectorization method based on AST embedding. It partitions node types recursively at the statement level to generate sequences of statement trees. Subsequently, a recursive neural network is employed to encode each statement tree from the bottom up, transforming the intricate AST structure into statement-level feature vectors. Building on this foundation, a Bidirectional Gated Recurrent neural network model with an Attention mechanism (BiGRU-ATT) is constructed. This facilitates the learning of features from the sequences of statement trees and accomplishes the detection and categorization of five typical vulnerabilities: re-entrancy, unchecked return values, timestamp dependency, access control, and denial-of-service attacks. Experimental results demonstrate that the proposed method improves the micro-F1 and macro-F1 metrics by 13 and 10 percentage points, respectively, compared to the direct vectorization of source code as a text sequence. In tasks related to timestamp dependence, access control, and denial-of-service attack vulnerability classification, the BiGRU-ATT model with the attention mechanism achieves an F1 value of over 88%.
Alberto Avila-Jimenez, David Barreda, Sarah-Laurie Evans
et al.
We study the computational power of the Full-Tilt model of motion planning, where slidable polyominos are moved maximally around a board by way of a sequence of directional ``tilts.'' We focus on the deterministic scenario in which the tilts constitute a repeated clockwise rotation. We show that general-purpose computation is possible within this framework by providing a direct and efficient simulation of space-bounded Turing machines in which one computational step of the machine is simulated per $O(1)$ rotations. We further show that the initial tape of the machine can be programmed by an initial tilt-sequence preceding the rotations. This result immediately implies new PSPACE-completeness results for the well-studied problems of \emph{occupancy} (deciding if a given board location can be occupied by a tile), \emph{vacancy} (deciding if a location can be emptied), \emph{relocation} (deciding if a tile can be moved from one location to another), and \emph{reconfiguration} (can a given board configuration be reconfigured into a second given configuration) that hold even for deterministically repeating tilt cycles such as rotations. All of our PSPACE-completeness results hold even when there is only a single domino in the system beyond singleton tiles. Following, we show that these results work in the Single-Step tilt model for larger constant cycles. We then investigate computational efficiency by showing a modification to implement a two-tape Turing machine in the Full-Tilt model and Systolic Arrays in the Single-Step model. Finally, we show a cyclic implementation for tilt-efficient Threshold Circuits.
Hans Rosenberger, Rodrigo Fischer, Johanna S. Fröhlich
et al.
As state of the art neural networks (NNs) continue to grow in size, their resource-efficient implementation becomes ever more important. In this paper, we introduce a compression scheme that reduces the number of computations required for NN inference on reconfigurable hardware such as FPGAs. This is achieved by combining pruning via regularized training, weight sharing and linear computation coding (LCC). Contrary to common NN compression techniques, where the objective is to reduce the memory used for storing the weights of the NNs, our approach is optimized to reduce the number of additions required for inference in a hardware-friendly manner. The proposed scheme achieves competitive performance for simple multilayer perceptrons, as well as for large scale deep NNs such as ResNet-34.
Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz
et al.
Predictive Coding (PC) offers a brain-inspired alternative to backpropagation for neural network training, described as a physical system minimizing its internal energy. However, in practice, PC is predominantly digitally simulated, requiring excessive amounts of compute while struggling to scale to deeper architectures. This paper reformulates PC to overcome this hardware-algorithm mismatch. First, we uncover how the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient in digital simulation, inevitably resulting in exponential signal decay that stalls the entire minimization process. Then, to overcome this fundamental limitation, we introduce error-based PC (ePC), a novel reparameterization of PC which does not suffer from signal decay. Though no longer biologically plausible, ePC numerically computes exact PC weights gradients and runs orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation's performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling PC-based learning to deeper architectures on digital hardware and beyond.
Research in quantum machine learning has recently proliferated due to the potential of quantum computing to accelerate machine learning. An area of machine learning that has not yet been explored is neural ordinary differential equation (neural ODE) based residual neural networks (ResNets), which aim to improve the effectiveness of neural networks using the principles of ordinary differential equations. In this work, we present our insights about why analog Rydberg atom quantum computers are especially well-suited for ResNets. We also introduce ResQ, a novel framework to optimize the dynamics of Rydberg atom quantum computers to solve classification problems in machine learning using analog quantum neural ODEs.
Video face detection and recognition in public places at the edge is required in several applications, such as security reinforcement and contactless access to authorized venues. This paper aims to maximize the simultaneous usage of hardware engines available in edge GPUs nowadays by leveraging the concurrency and pipelining of tasks required for face detection and recognition. This also includes the video decoding task, which is required in most face monitoring applications as the video streams are usually carried via Gbps Ethernet network. This constitutes an improvement over previous works where the tasks are usually allocated to a single engine due to the lack of a unified and automated framework that simultaneously explores all hardware engines. In addition, previously, the input faces were usually embedded in still images or within raw video streams that overlook the burst delay caused by the decoding stage. The results on real-life video streams suggest that simultaneously using all the hardware engines available in the recent NVIDIA edge Orin GPU, higher throughput, and a slight saving of power consumption of around 300 mW, accounting for around 5%, have been achieved while satisfying the real-time performance constraint. The performance gets even higher by considering several video streams simultaneously. Further performance improvement could have been obtained if the number of shuffle layers that were created by the tensor RT framework for the face recognition task was lower. Thus, the paper suggests some hardware improvements to the existing edge GPU processors to enhance their performance even higher.
Abstract With the aim to enhance prediction accuracy for nonlinear time series, this paper put forward an improved deep Echo State Network based on reservoir states reconstruction driven by a Self-Normalizing Activation (SNA) function as the replacement for the traditional Hyperbolic tangent activation function to reduce the model’s sensitivity to hyper-parameters. The Strategy was implemented in a two-state reconstruction process by first inputting the time series data to the model separately. Once, the time data passes through the reservoirs and is activated by the SNA activation function, the new state for the reservoirs is created. The state is input to the next layer, and the concatenate states module saves. Pairs of states are selected from the activated multi-layer reservoirs and input into the state reconstruction module. Multiple input states are transformed through the state reconstruction module and finally saved to the concatenate state module. Two evaluation metrics were used to benchmark against three other ESNs with SNA activation functions to achieve better prediction accuracy.
This study investigates the effect of the black level and color gamut on the overall display brightness using a psychophysical experiment. Two stimuli with different color gamut were matched until the overall brightness appeared the same by controlling the peak display luminance. Three experimental sessions were conducted with varying black levels and color gamut. The results indicate that the effect of the color gamut on perceived brightness was significant, unlike that of the black level. The matched luminance increased by 6% to 8% as the color gamut widened from sRGB to DCI-P3. The equivalent luminance prediction models were tested using the average stimuli luminance, exhibiting poor performances.
Theodoros Trochatos, Chuanqi Xu, Sanjay Deshpande
et al.
The cloud-based environments in which today's and future quantum computers will operate, raise concerns about the security and privacy of user's intellectual property. Quantum circuits submitted to cloud-based quantum computer providers represent sensitive or proprietary algorithms developed by users that need protection. Further, input data is hard-coded into the circuits, and leakage of the circuits can expose users' data. To help protect users' circuits and data from possibly malicious quantum computer cloud providers, this work presented the first hardware architecture for a trusted execution environment for quantum computers. To protect the user's circuits and data, the quantum computer control pulses are obfuscated with decoy control pulses. While digital data can be encrypted, analog control pulses cannot and this paper proposed the novel decoy pulse approach to obfuscate the analog control pulses. The proposed decoy pulses can easily be added to the software by users. Meanwhile, the hardware components of the architecture proposed in this paper take care of eliminating, i.e. attenuating, the decoy pulses inside the superconducting quantum computer's dilution refrigerator before they reach the qubits. The hardware architecture also contains tamper-resistant features to protect the trusted hardware and users' information. The work leverages a new metric of variational distance to analyze the impact and scalability of hardware protection. The variational distance of the circuits protected with our scheme, compared to unprotected circuits, is in the range of only $0.16$ to $0.26$. This work demonstrates that protection from possibly malicious cloud providers is feasible and all the hardware components needed for the proposed architecture are available today.
Lei Zhao, Aishwarya Natarajan, Luca Buonanno
et al.
Transformer models represent the cutting edge of Deep Neural Networks (DNNs) and excel in a wide range of machine learning tasks. However, processing these models demands significant computational resources and results in a substantial memory footprint. While In-memory Computing (IMC)offers promise for accelerating Vector-Matrix Multiplications(VMMs) with high computational parallelism and minimal data movement, employing it for other crucial DNN operators remains a formidable task. This challenge is exacerbated by the extensive use of complex activation functions, Softmax, and data-dependent matrix multiplications (DMMuls) within Transformer models. To address this challenge, we introduce a Reconfigurable Analog Computing Engine (RACE) by enhancing Analog Content Addressable Memories (ACAMs) to support broader operations. Based on the RACE, we propose the RACE-IT accelerator (meaning RACE for In-memory Transformers) to enable efficient analog-domain execution of all core operations of Transformer models. Given the flexibility of our proposed RACE in supporting arbitrary computations, RACE-IT is well-suited for adapting to emerging and non-traditional DNN architectures without requiring hardware modifications. We compare RACE-IT with various accelerators. Results show that RACE-IT increases performance by 453x and 15x, and reduces energy by 354x and 122x over the state-of-the-art GPUs and existing Transformer-specific IMC accelerators, respectively.
To enhance the accuracy and generalization of the network intrusion detection model, this study proposes a network intrusion detection model based on the Gravitational Search Algorithm(GSA) and Differential Evolution(DE) algorithm to optimize the hybrid kernel Extreme Learning Machine (ELM).Aiming to improve the weak generalization and poor learning capabilities of ELM models with single kernel function, this model combines the advantages of a polynomial kernel function and radial basis function to construct the so-called Hybrid Kernel ELM(HKELM) model.Furthermore, GSA and DE are combined to optimize the parameters of HKELM, which improves its global and local optimization ability in anomaly detection.Then, the Kernel Principal Component Analysis(KPCA) algorithm is used for data dimensionality reduction and feature extraction from intrusion detection data.Finally, the proposed approach constructs a network intrusion detection model based on GSA and DE optimized hybrid kernel ELM (KPCA-GSADE-HKELM).Experimental results on the KDD99 dataset demonstrate that KPCA-GSADE-HKELM model achievesa higher detection accuracy and faster detection speed compared with KDDwinner, CSVAC, CPSO-SVM, and Dendron models.
WANG Lijuan, ZHANG Lin, YIN Ming, HAO Zhifeng, CAI Ruichu, WEN Wen
The consistency of multiview data is important for multiview clustering.To achieve multiview data with better consistency, this paper proposes a new multiview clustering algorithm, OMTSC.The OMTSC algorithm simultaneously learns the cluster assignment matrix and feature embedding of each view.Each cluster assignment matrix can be decomposed into shared orthogonal basis-cluster coding matrices.An orthogonal basis matrix can capture and store consistent multiview data and form latent cluster centers.A weighted multiview cluster coding matrix can balance the quality differences of different views effectively.Meanwhile, bipartite graph co-clustering is introduced to realize knowledge transfer, which involves clustering coding, feature embedding, and the orthogonal basis.This improves the multiview data consistency and diversity learning, as well as allows the OMTSC algorithm to leverage the diversity of feature embedding for maximizing multiview consistency and learning the optimal latent cluster centers, thus further improving the performance of multiview clustering.In addition, feature embedding based on group sparse constraints is robust to noise in view data.Experimental results on WikipediaArticles, COIL20, and ORL datasets show that the OMTSC algorithm is superior to SC-Best, Co-Reg, and advanced multiview clustering algorithms, and that it yields the highest score in all three evaluation indexes, i.e., the ACC, NMI, and ARI on COIL20 and ORL datasets, the NMI evaluation index for the OMTSC algorithm exceeds 0.9.
Yasuki Kansha, Satoko Horikoshi, Hikaru Kiyomoto
et al.
It has been reported that about 50% of the total amount of fuel in an oil refinery plant is consumed in crude oil and vacuum distillations. Thus, the heat exchanger networks of crude oil distillation units significantly affect the overall energy efficiency and CO2 emissions of refinery plants. Crude oil fouling in heat exchanger networks is one of the most troublesome problems in crude oil refineries. It reduces heat transfer amount or blocks the flows in tubes, leading to requiring additional fuel for the furnace following heat exchanger networks. Therefore, many cleaning methods have been developed. Mechanical cleaning of heat exchangers is the most effective method to mitigate the fouling in heat exchangers. However, it is necessary to open heat exchangers for cleaning. So, the timing of mechanical cleaning is limited because the normal refinery operation must be stopped. Therefore, to keep the good energy and economic performance of the refinery, it is necessary to predict the appropriate maintenance timing and to conduct a suitable cleaning. To find a suitable cleaning schedule or timing, the authors proposed a method for predicting fouling resistance in near future for crude oil fouling from actual online plant data in this research. The prediction results are in good agreement with the measured results, which demonstrate that the proposed method is effective for predicting the cleaning timing for the industry in the future.
Chemical engineering, Computer engineering. Computer hardware