A. Fiat, Adi Shamir
Hasil untuk "Architecture"
Menampilkan 20 dari ~2412458 hasil · dari arXiv, DOAJ, Semantic Scholar
Pooyan Jamshidi, C. Pahl, N. Mendonça et al.
Sung‐Choon Kang, Scott A. Snell
Aojie Jiang, Kang Zhu, Zhiheng Zhang et al.
In-network computing techniques, exemplified by NVLink SHARP (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations such as All-Reduce to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger in-switch reduction, which means that the data reduced in the switch must be transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS during inference still operates at 16-bit precision, leading to substantial bandwidth waste. To address these limitations, we propose SCIN, the first switch-centric in-network architecture for multi-accelerator shared-memory networks, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of directly accessing the memory regions in attached accelerators for in-network processing, together with a co-designed communication fabric that enables such access with negligible protocol overhead. SCIN delivers lower All-Reduce latency than NVLS by eliminating redundant data movement. Moreover, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a multi-FPGA prototype of SCIN to validate its feasibility and effectiveness. Simulation results for an 8-GPU system show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, yielding up to 1.74x TTFT speedup and 1.34x TPOT speedup on LLaMA-2 models.
Jinendra Malekar, Peyton Chandarana, Md Hasibul Amin et al.
In this paper, we propose PIM-LLM, a hybrid architecture developed to accelerate 1-bit large language models (LLMs). PIM-LLM leverages analog processing-in-memory (PIM) architectures and digital systolic arrays to accelerate low-precision matrix multiplication (MatMul) operations in projection layers and high-precision MatMul operations in attention heads of 1-bit LLMs, respectively. Our design achieves up to roughly 80x improvement in tokens per second and a 70% increase in tokens per joule compared to conventional hardware accelerators. Additionally, PIM-LLM outperforms previous PIM-based LLM accelerators, setting a new benchmark with at least 2x and 5x improvement in GOPS and GOPS/W, respectively.
Yilun Zhao, Kangding Zhao, Peng Zhou et al.
The design of a scalable Quantum Control Architecture (QCA) faces two primary challenges. First, the continuous growth in qubit counts has rendered distributed QCA inevitable, yet the nondeterministic latencies inherent in feedback loops demand cycleaccurate synchronization across multiple controllers. Existing synchronization strategies -- whether lock-step or demand-driven -- introduce significant performance penalties. Second, existing quantum instruction set architectures are polarized, being either too abstract or too granular. This lack of a unifying design necessitates recurrent hardware customization for each new control requirement, which limits the system's reconfigurability and impedes the path toward a scalable and unified digital microarchitecture. Addressing these challenges, we propose Distributed-HISQ, featuring: (i) HISQ, A universal instruction set that redefines quantum control with a hardware-agnostic design. By decoupling from quantum operation semantics, HISQ provides a unified language for control sequences, enabling a single microarchitecture to support various control methods and enhancing system reconfigurability. (ii) BISP, a booking-based synchronization protocol that can potentially achieve zero-cycle synchronization overhead. The feasibility and adaptability of Distributed-HISQ are validated through its implementation on a commercial quantum control system targeting superconducting qubits. We performed a comprehensive evaluation using a customized quantum software stack. Our results show that BISP effectively synchronizes multiple control boards, leading to a 22.8% reduction in average program execution time and a $\sim$5$\times$ reduction in infidelity when compared to an existing lock-step synchronization scheme.
Alexandre de Limas Santana, Adrià Armejach, Francesc Martinez et al.
GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA.
Chiyue Wei, Cong Guo, Junyao Zhang et al.
Vision-Language Models (VLMs) have demonstrated strong performance on tasks such as video captioning and visual question answering. However, their growing scale and video-level inputs lead to significant computational and memory overhead, posing challenges for real-time deployment on hardware accelerators. While prior work attempts to reduce redundancy via token pruning or merging, these methods typically operate at coarse granularity and incur high runtime overhead due to global token-level operations. In this study, we propose Focus, a Streaming Concentration Architecture that efficiently accelerates VLM inference through progressive, fine-grained redundancy elimination. Focus introduces a multilevel concentration paradigm that hierarchically compresses vision-language inputs at three levels: (1) semantic-guided token pruning based on textual prompts, (2) spatial-temporal block-level concentration using localized comparisons, and (3) vector-level redundancy removal via motion-aware matching. All concentration steps are tightly co-designed with the architecture to support streaming-friendly, on-chip execution. Focus leverages GEMM tiling, convolution-style layout, and cross-modal attention to minimize off-chip access while enabling high throughput. Implemented as a modular unit within a systolic-array accelerator, Focus achieves a 2.4x speedup and 3.3x reduction in energy, significantly outperforming state-of-the-art accelerators in both performance and energy efficiency. Full-stack implementation of Focus is open-sourced at https://github.com/dubcyfor3/Focus.
Wenlong Yan, Haoran Cheng, Meng Zhang et al.
This study investigates the performance of alkali-activated mortar incorporating slag, fly ash, and desert sand, with a focus on flowability, mechanical properties, sulfate resistance, and microstructural characteristics. A four-factor, three-level orthogonal experimental design was used to analyze the effects of the fly ash substitution rate, alkali content (Na<sub>2</sub>O/b), activator modulus, and desert sand replacement rate for natural sand. The results indicate that increased slag and desert sand contents reduce mortar flowability. Despite this, the mortar exhibits excellent mechanical strength, with compressive strength reaching 77.7 MPa at 28 days and increasing to 89.34 MPa under sulfate exposure. However, after 120 days of sulfate erosion, a decline in strength is observed due to the formation of expansive products such as gypsum and caliche, leading to cracking. Microstructural analyses (XRD, SEM/EDS, MIP) reveal partial dissolution of desert sand under alkali activation, enhancing gel formation and reducing cumulative porosity. The pore structure predominantly consists of harmless pores. These findings demonstrate the potential of slag–fly ash–desert sand alkali-activated mortar as a durable and sustainable material for structural and construction engineering applications, especially in sulfate-rich environments or arid regions where desert sand is abundant.
J. M. Wolstenholme, F. Cooper, R. E. Thomas et al.
Abstract Hedgerows are a key component of the UK landscape that form boundaries, borders and limits of land whilst providing vital landscape‐scale ecological connectivity for a range of organisms. They are diverse habitats in the agricultural landscape providing a range of ecosystem services. Poorly managed hedgerows often present with gaps, reducing their ecological connectivity, resulting in fragmented habitats. However, hedgerow gap frequency and spatial distributions are often unquantified at the landscape‐scale. Here we present a novel methodology based on deep learning (DL) that is coupled with high‐resolution aerial imagery. We demonstrate how this provides a route towards a rapid, adaptable, accurate assessment of hedgerow and gap abundance at such scales, with minimal training data. We present the training and development of a DL model using the U‐Net architecture to automatically identify hedgerows across the East Riding of Yorkshire (ERY) in the UK and demonstrate the ability of the model to estimate hedgerow gap types, lengths and their locations. Our method was both time efficient and accurate, processing an area of 2479 km2 in 32 h with an overall accuracy of 92.4%. The substantive results allow us to estimate that in the ERY alone, there were 3982 ± 302 km of hedgerows and 2865 ± 217 km of hedgerow gaps (with 339 km classified as for access). Our approach and study show that hedgerows and gaps can be extracted from true colour aerial imagery without the requirement of elevation data and can produce meaningful results that lead to the identification of prioritisation areas for hedgerow gap infilling, replanting and restoration. Such replanting could significantly contribute towards national tree planting goals and meeting net zero targets in a changing climate.
Perla Zambrano Prado, Tanya Chávez-Camarena
This study analyzes the spatial needs of families living in homes affected by ground subsidence in Jocotepec. The research involved residents from 17 households, using a mixed-methods approach that combined surveys and participatory dynamics as data collection tools. The results show that children represent 16% of the population and people with disabilities 3%. The most valued housing features include green areas (59%), spacious rooms (35%), good lighting (29%), and natural ventilation (24%). Other important aspects mentioned are accessibility, comfort, bedrooms with bathrooms, functional layout, and spaces for pets (6%). Additionally, the most requested extra spaces are gardens and bathrooms, both mentioned by 59% of the participants. Finally, the majority of residents expressed a preference for detached single-family housing typology.
Neelay Fruitwala, Gang Huang, Yilun Xu et al.
Quantum circuits utilizing real time feedback techniques (such as active reset and mid-circuit measurement) are a powerful tool for NISQ-era quantum computing. Such techniques are crucial for implementing error correction protocols, and can reduce the resource requirements of certain quantum algorithms. Realizing these capabilities requires flexible, low-latency classical control. We have developed a custom FPGA-based processor architecture for QubiC, an open source platform for superconducting qubit control. Our architecture is distributed in nature, and consists of a bank of lightweight cores, each configured to control a small (1-3) number of signal generator channels. Each core is capable of executing parameterized control and readout pulses, as well as performing arbitrary control flow based on mid-circuit measurement results. We have also developed a modular compiler stack and domain-specific intermediate representation for programming the processor. Our representation allows users to specify circuits using both gate and pulse-level abstractions, and includes high-level control flow constructs (e.g. if-else blocks and loops). The compiler stack is designed to integrate with quantum software tools and programming languages, such as TrueQ, pyGSTi, and OpenQASM3. In this work, we will detail the design of both the processor and compiler stack, and demonstrate its capabilities with a quantum state teleportation experiment using transmon qubits at the LBNL Advanced Quantum Testbed.
Mohammed Elbtity, Peyton Chandarana, Ramtin Zand
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
Andrew Boutros, Aman Arora, Vaughn Betz
Deep learning (DL) is becoming the cornerstone of numerous applications both in datacenters and at the edge. Specialized hardware is often necessary to meet the performance requirements of state-of-the-art DL models, but the rapid pace of change in DL models and the wide variety of systems integrating DL make it impossible to create custom computer chips for all but the largest markets. Field-programmable gate arrays (FPGAs) present a unique blend of reprogrammability and direct hardware execution that make them suitable for accelerating DL inference. They offer the ability to customize processing pipelines and memory hierarchies to achieve lower latency and higher energy efficiency compared to general-purpose CPUs and GPUs, at a fraction of the development time and cost of custom chips. Their diverse high-speed IOs also enable directly interfacing the FPGA to the network and/or a variety of external sensors, making them suitable for both datacenter and edge use cases. As DL has become an ever more important workload, FPGA architectures are evolving to enable higher DL performance. In this article, we survey both academic and industrial FPGA architecture enhancements for DL. First, we give a brief introduction on the basics of FPGA architecture and how its components lead to strengths and weaknesses for DL applications. Next, we discuss different styles of DL inference accelerators on FPGA, ranging from model-specific dataflow styles to software-programmable overlay styles. We survey DL-specific enhancements to traditional FPGA building blocks such as logic blocks, arithmetic circuitry, and on-chip memories, as well as new in-fabric DL-specialized blocks for accelerating tensor computations. Finally, we discuss hybrid devices that combine processors and coarse-grained accelerator blocks with FPGA-like interconnect and networks-on-chip, and highlight promising future research directions.
Pablo Fondo-Ferreiro, Felipe Gil-Castiñeira, Francisco Javier González-Castaño et al.
Mobile core networks handle critical control functions for delivering services in modern cellular networks. Traditional point-to-point architectures, where network functions are directly connected through standardized interfaces, are being substituted by service-based architectures (SBAs), where core functionalities are finer-grained microservices decoupled from the underlying infrastructure. In this way, network functions and services can be distributed, with scaling and fail-over mechanisms, and can be dynamically deployed, updated, or removed to support slicing. A myriad of network functions can be deployed or removed according to traffic flows, thereby increasing the complexity of connection management. In this context, 3GPP Release 16 defines the service communication proxy (SCP) as a unified communication interface for a set of network functions. In this paper, we propose a novel software-defined networking (SDN)-based solution with the same role for a service mesh architecture where network functions can be deployed anywhere in the infrastructure. We demonstrated its efficiency in comparison with alternative architectures.
Huiling Zhou, Kaixuan Tang, Longfang Huang et al.
Cultivating tourists’ environmental responsibility behavior is an effective way to relieve the pressure of ecological environment in tourist destinations. Based on the value-attitude-behavior theory, this paper constructs a relationship model of values, ecotourism attitude, social responsibility awareness and environmental responsibility behavior, and explores the mechanism of tourists’ values influence on environmental responsibility behavior. Taking three scenic areas (Mao’er Mountain, Yulong River and Xingping Ancient Town) in the Lijiang River Basin of Guilin, a world-famous tourist destination in China, as a case study, the structural equation model is used to test the theoretical hypotheses of tourists’ environmental responsibility behavior. The results of the three studies show that values have a significant positive impact on tourists’ environmental responsibility behavior, while ecotourism attitude cannot directly affect tourists’ environmental responsibility behavior, but values can indirectly affect tourists’ environmental responsibility behavior through the chain intermediary of ecotourism attitude and social responsibility awareness. The findings of the study can effectively guide tourists’ environmental responsibility behavior, which has far-reaching significance for the sustainable development of tourist destinations.
Yan-Jen Huang, Hsin-Lung Wu, Ching-Chen
In this paper, we introduce block-wise separable convolutions (BlkSConv) to replace the standard convolutions for compressing deep CNN models. First, BlkSConv expresses the standard convolutional kernel as an ordered set of block vectors each of which is a linear combination of fixed basis block vectors. Then it eliminates most basis block vectors and their corresponding coefficients to obtain an approximated convolutional kernel. Moreover, the proposed BlkSConv operation can be efficiently realized via a combination of pointwise and group-wise convolutions. Thus the constructed networks have smaller model size and fewer multiply-adds operations while keeping comparable prediction accuracy. We also develop a hyperparameter search framework based on principal component analysis (PCA) to determine a qualified hyperparameter setting of the block depth and number of basis block vectors. By this search framework, we construct networks which achieve nice prediction performance while simultaneously satisfying the constraints of model size and model efficiency. Our code, data, and models are available at <uri>https://github.com/yanjenhuang/blksconv</uri>.
Matthew Chen, Charana Sonnadara, Sahil Shah
Abstract The floating‐gate transistor is commonly employed as a non‐volatile memory device, leveraging a floating node at its gate to store electrical charge over extended periods. This stored charge effectively alters the threshold voltage of the transistor. Utilizing standard CMOS technologies, floating‐gate transistors can be designed and fabricated using conventional CMOS processes. This study focuses on characterizing the performance of a PMOS‐based floating‐gate transistor, specifically fabricated using the open‐source Skywater 130 nm process. The modulation of charge on the floating node is explored through both hot‐electron injection and Fowler–Nordheim tunnelling, providing insight into the resolution of these programming mechanisms. Additionally, the study includes a preliminary analysis of the retention time of the programmed charge in these devices. This work contributes to the open‐source electronics community by detailing the design and programming techniques of floating‐gate transistors developed with an open‐source process design kit, and makes the corresponding FG cell designs available for public use.
Maali Alabdulhafith, Abduljabbar S. Ba Mahel, Nagwan Abdel Samee et al.
Quality of life is greatly affected by chronic wounds. It requires more intensive care than acute wounds. Schedule follow-up appointments with their doctor to track healing. Good wound treatment promotes healing and fewer problems. Wound care requires precise and reliable wound measurement to optimize patient treatment and outcomes according to evidence-based best practices. Images are used to objectively assess wound state by quantifying key healing parameters. Nevertheless, the robust segmentation of wound images is complex because of the high diversity of wound types and imaging conditions. This study proposes and evaluates a novel hybrid model developed for wound segmentation in medical images. The model combines advanced deep learning techniques with traditional image processing methods to improve the accuracy and reliability of wound segmentation. The main objective is to overcome the limitations of existing segmentation methods (UNet) by leveraging the combined advantages of both paradigms. In our investigation, we introduced a hybrid model architecture, wherein a ResNet34 is utilized as the encoder, and a UNet is employed as the decoder. The combination of ResNet34’s deep representation learning and UNet’s efficient feature extraction yields notable benefits. The architectural design successfully integrated high-level and low-level features, enabling the generation of segmentation maps with high precision and accuracy. Following the implementation of our model to the actual data, we were able to determine the following values for the Intersection over Union (IOU), Dice score, and accuracy: 0.973, 0.986, and 0.9736, respectively. According to the achieved results, the proposed method is more precise and accurate than the current state-of-the-art.
Yoshitaka Nagashima, Yoshio Araki, Kazuki Nishida et al.
Abstract Background The surgical techniques for treatment of chronic subdural hematoma (CSDH), a common neurosurgical condition, have been discussed in a lot of clinical literature. However, the recurrence proportion after CSDH surgery remains high, ranging from 10 to 20%. The standard surgical procedure for CSDH involves a craniostomy to evacuate the hematoma, but irrigating the hematoma cavity during the procedure is debatable. The authors hypothesized that the choice of irrigation fluid might be a key factor affecting the outcomes of surgery. This multicenter randomized controlled trial aims to investigate whether intraoperative irrigation using artificial cerebrospinal fluid (ACF) followed by the placement of a subdural drain would yield superior results compared to the placement of a subdural drain alone for CSDH. Methods The study will be conducted across 19 neurosurgical departments in Japan. The 1186 eligible patients will be randomly allocated to two groups: irrigation using ACF or not. In either group, a subdural drain is to be placed for at least 12 h postoperatively. Similar to what was done in previous studies, we set the proportion of patients that meet the criteria for ipsilateral reoperation at 7% in the irrigation group and 12% in the non-irrigation group. The primary endpoint is the proportion of patients who meet the criteria for ipsilateral reoperation within 6 months of surgery (clinical worsening of symptoms and increased hematoma on imaging compared with the postoperative state). The secondary endpoints are the proportion of reoperations within 6 months, the proportion being stratified by preoperative hematoma architecture by computed tomography (CT) scan, neurological symptoms, patient condition, mortality at 6 months, complications associated with surgery, length of hospital stay from surgery to discharge, and time of the surgical procedure. Discussion We present the study protocol for a multicenter randomized controlled trial to investigate our hypothesis that intraoperative irrigation with ACF reduces the recurrence proportion after the removal of chronic subdural hematomas compared with no irrigation. Trial registration ClinicalTrials.gov jRCT1041220124. Registered on January 13, 2023.
Halaman 39 dari 120623