Hasil untuk "Speculative philosophy"

Menampilkan 20 dari ~1428746 hasil · dari arXiv, DOAJ, Semantic Scholar, CrossRef

JSON API
arXiv Open Access 2026
TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

Minjae Lee, Wonjun Kang, Byeongkeun Ahn et al.

Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at https://github.com/furiosa-ai/TABED.

en cs.LG, cs.CL
arXiv Open Access 2026
Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism

Lingzhe Zhang, Tong Jia, Yunpeng Zhai et al.

Microservice systems have become the backbone of cloud-native enterprise applications due to their resource elasticity, loosely coupled architecture, and lightweight deployment. Yet, the intrinsic complexity and dynamic runtime interactions of such systems inevitably give rise to anomalies. Ensuring system reliability therefore hinges on effective root cause analysis (RCA), which entails not only localizing the source of anomalies but also characterizing the underlying failures in a timely and interpretable manner. Recent advances in intelligent RCA techniques, particularly those powered by large language models (LLMs), have demonstrated promising capabilities, as LLMs reduce reliance on handcrafted features while offering cross-platform adaptability, task generalization, and flexibility. However, existing LLM-based methods still suffer from two critical limitations: (a) limited exploration diversity, which undermines accuracy, and (b) heavy dependence on large-scale LLMs, which results in slow inference. To overcome these challenges, we propose SpecRCA, a speculative root cause analysis framework for microservices that adopts a \textit{hypothesize-then-verify} paradigm. SpecRCA first leverages a hypothesis drafting module to rapidly generate candidate root causes, and then employs a parallel root cause verifier to efficiently validate them. Preliminary experiments on the AIOps 2022 dataset demonstrate that SpecRCA achieves superior accuracy and efficiency compared to existing approaches, highlighting its potential as a practical solution for scalable and interpretable RCA in complex microservice environments.

en cs.SE, cs.AI
arXiv Open Access 2026
SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation

Zhehao Yu, Baoquan Zhang, Bingqi Shan et al.

Autoregressive (AR) image models have recently demonstrated remarkable generative capability, but their sequential nature results in significant inference latency. Existing training-free acceleration methods typically verify tokens independently, overlooking the strong co-occurrence patterns between adjacent visual tokens. This independence assumption often leads to contextual inconsistency and limits decoding efficiency. In this work, we introduce a novel training-free acceleration framework that performs phrase-level speculative verification, enabling the model to jointly validate multiple correlated tokens within each decoding window. To construct such phrase units, we analyze token co-occurrence statistics from the training corpus and group frequently co-occurring tokens into semantically coherent visual phrases. During inference, the proposed phrase-level verification evaluates aggregated likelihood ratios over each phrase, allowing simultaneous acceptance of multiple tokens while preserving generation quality. Extensive experiments on autoregressive text-to-image generation show that our method significantly reduces the number of function evaluations (NFE) and achieves up to 30% faster decoding without compromising visual fidelity. Our findings reveal that modeling short-range token co-occurrence provides an effective and general principle for accelerating autoregressive inference.

en cs.CV
arXiv Open Access 2026
S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Ligong Han, Hao Wang, Han Gao et al.

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

en cs.CL
arXiv Open Access 2026
Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Rahul Thomas, Teo Kitanovski, Micah Goldblum et al.

Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.

en cs.LG
DOAJ Open Access 2026
Dados sintéticos para treinamento de Inteligência Artificial:

Ananda Fernanda de Jesus, Wesley Algarve, José Eduardo Santarem Segundo

Enquanto os dados reais são coletados de diferentes contextos do mundo real — como a observação de fenômenos diversos ou a interação de pessoas com sistemas e máquinas — os dados sintéticos são gerados por meio da aplicação de diferentes técnicas, com o objetivo de ampliar, equilibrar ou diversificar um determinado conjunto de dados. O uso de dados sintéticos não é recente: há muito tempo é empregado em análises estatísticas e, com frequência, no campo da Inteligência Artificial. No entanto, o atual contexto tecnológico, marcado pela proliferação de Grandes Modelos de Linguagem e pela popularização das Inteligências Artificiais Generativas, tem evidenciado a necessidade de volumes massivos de dados para o treinamento e a evolução contínua desses modelos, o que tem ampliado significativamente a adoção de dados sintéticos no treinamento de diferentes tipos de Inteligência Artificial. Reconhecendo a importância dos dados sintéticos nesse cenário, a presente pesquisa busca apresentar e discutir conceitos relacionados a esse tipo de dado, bem como identificar suas possíveis interfaces com as áreas de Organização e Representação da Informação e do Conhecimento. Para isso, parte-se da identificação e discussão dos conceitos pertinentes, por meio de um estudo de terminologia pontual e de uma análise exploratória baseada em um protocolo de pesquisa. Como resultado, espera-se traçar relações entre o contexto dos dados sintéticos e a representação da informação e a organização do conhecimento, identificando tanto os desafios que esses dados impõem à representação quanto as potenciais contribuições que a representação e a organização podem oferecer a esse universo.

Information resources (General), Epistemology. Theory of knowledge
arXiv Open Access 2025
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

Pengfei Hu, Meng Cao, Yingyao Wang et al.

Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm, which alternates between global temporal reasoning and local frame examination, has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft's proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.

en cs.CV
arXiv Open Access 2025
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding

Gabe Guo, Stefano Ermon

In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.

en cs.LG, cs.CL
arXiv Open Access 2025
Formalism 25

Mikhail G. Katz, Karl Kuhlemann, Sam Sanders et al.

Abraham Robinson's philosophical stance has been the subject of several recent studies. Erhardt following Gaifman claims that Robinson was a finitist, and that there is a tension between his philosophical position and his actual mathematical output. We present evidence in Robinson's writing that he is more accurately described as adhering to the philosophical approach of Formalism. Furthermore, we show that Robinson explicitly argued {against} certain finitist positions in his philosophical writings. There is no tension between Robinson's mathematical work and his philosophy because mathematics and metamathematics are distinct fields: Robinson advocates finitism for metamathematics but no such restriction for mathematics. We show that Erhardt's analysis is marred by historical errors, by routine conflation of the generic and the technical meaning of several key terms, and by a philosophical {parti pris}. Robinson's Formalism remains a viable alternative to mathematical Platonism.

arXiv Open Access 2025
Efficient Control Flow Attestation by Speculating on Control Flow Path Representations

Liam Tyler, Adam Caulfield, Ivan De Oliveira Nunes

Control Flow Attestation (CFA) allows remote verification of run-time software integrity in embedded systems. However, CFA is limited by the storage/transmission costs of generated control flow logs (CFlog). Recent work has proposed application-specific optimizations by speculating on likely sub-paths in CFlog and replacing them with reserved symbols at runtime. Albeit effective, prior approaches do not consider the representation of addresses in a control flow path for speculation. This work proposes RESPEC-CFA, an architectural extension for CFA allowing for speculation on (1) the locality of control flows and (2) their Huffman encoding. Alone, RESPEC-CFA reduces CFlog sizes by up to 90.1%. Combined with prior methods, RESPEC-CFA yields reductions of up to 99.7%, representing a significant step toward practical CFA.

en cs.CR
arXiv Open Access 2025
MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Zongle Huang, Lei Zhu, Zongyuan Zhan et al.

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

en cs.LG, cs.AI
arXiv Open Access 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Sukmin Cho, Sangjin Choi, Taeho Hwang et al.

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.

en cs.CL
arXiv Open Access 2025
Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda

Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before final verification, thus utilizing idle time and enhancing parallelism between edge and cloud. Using an NVIDIA Jetson Nano (client) and an A100 GPU (server) with Vicuna-68M (draft) and Llama2-7B (target) models, our method achieves up to a 35% reduction in latency compared to cloud-based autoregressive decoding, with an additional 11% improvement from preemptive drafting. To demonstrate real-world applicability, we deploy our method on the Unitree Go2 quadruped robot using Vision-Language Model (VLM) based control, achieving a 21% speedup over traditional cloud-based autoregressive decoding. These results demonstrate the potential of our framework for real-time LLM and VLM applications on resource-constrained edge devices.

en cs.RO, cs.AI
arXiv Open Access 2025
MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration

Junhyuk So, Hyunho Kook, Chaeyeon Jang et al.

While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.

en cs.CV
DOAJ Open Access 2025
PHILOSOPHICAL DIMENSIONS OF DEMOCRACY AND FREEDOM IN MODERN SOCIETY

Ярослав СЕГЕНЬ

The article explores the impact of informational fragmentation within the contemporary media environment on the formation of citizens’ political identity and its implications for democratic processes. Particular attention is given to the philosophical dimensions of this issue: the relationship between freedom of access to information, the search for truth, and the possibilities of authentic communication in the public sphere. The aim of the study is to identify the mechanisms of interaction between information flows and political socialization, to analyze cognitive polarization, and to assess the risks for public discourse understood as a space of meaning-making and democratic coexistence. To achieve this goal, the study employs methods of analyzing contemporary theoretical research in media ecology, political science, sociology, and the philosophy of communication, as well as a comparative analysis of national and international sources. The findings demonstrate that informational fragmentation generates isolated informational segments that restrict citizens’ access to diverse perspectives and reinforce the effect of “echo chambers.” This leads to cognitive and political polarization, instability of political identity, and the deepening of socio-political division. At the same time, fragmentation creates new opportunities for individualized political participation, personalized access to information, and the expansion of the space of freedom of choice. The practical significance of the study lies in the development of approaches to integrating diverse information flows into a unified public sphere, enhancing media literacy, and fostering mechanisms of intergroup dialogue. Special emphasis is placed on the necessity of combining technological and social strategies with a philosophical reflection on the values of democracy, truth, and communication. The results of the study may be applied to the optimization of state media policy, the design of educational programs, and the development of strategies to support democratic public discourse. The conclusions highlight the dual character of the influence of informational fragmentation: it simultaneously stimulates democratic engagement while deepening citizens’ isolation, thus necessitating the elaboration of comprehensive strategies for integrating information flows and strengthening democratic unity on the basis of the philosophical principles of freedom, truth, and responsibility.

Epistemology. Theory of knowledge
S2 Open Access 2024
Speculative Practicescapes of Learning Design and Dreaming

Eamon Costello, Jason McDonald, Felicitas Macgilchrist et al.

This article addresses a serious issue that besets learning design: its over-reliance on frameworks that promise particular outcomes for individual learners that accord with pre-defined metrics. This is partly a function of the nature of learning design and development itself which is commonly seen as outcome-oriented activity that should benefit individual learners in specific ways. An alternative approach is adopted here which calls attention to other happenings at the heart of education, including positive emotions we experience that are made known through less measurable and more fleeting points of reference. Hence, we draw on sources such as poems and personal reflections in order not just to design learning but to dream it. The concept of a practicescape is invoked which serves not just to situate learning but to remind the learner that their learning experience only happens within the context of their finite lifetime. Seven practicescapes are presented and reflected on by the authors as a conversation framework for interrogating ideas of learning that owe more to dreams, poems, and possibilities than aims, objectives, or outcomes. Drawing on early Buddhist philosophy, the practicescapes attempt to honour particular affective states and conjure a heart-centred framework on which to hang speculative questions and provocations for learning design that are focused on cultivating and sustaining the most positive forms of human experience. These practicescapes are offered as a speculative learning design climbing frame that could take us from dreams of possibility to enlivened and embodied presents.

9 sitasi en
arXiv Open Access 2024
Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference

Libo Zhang, Zhaoning Zhang, Baizhou Xu et al.

With the continuous advancement in the performance of large language models (LLMs), their demand for computational resources and memory has significantly increased, which poses major challenges for efficient inference on consumer-grade devices and legacy servers. These devices typically feature relatively weaker GPUs and stronger CPUs. Although techniques such as parameter offloading and partial offloading can alleviate GPU memory pressure to some extent, their effectiveness is limited due to communication latency and suboptimal hardware resource utilization. To address this issue, we propose Dovetail, a lossless inference acceleration method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding. Dovetail deploys a draft model on the GPU to perform preliminary predictions, while a target model running on the CPU validates these outputs. By reducing the granularity of data transfer, Dovetail significantly minimizes communication overhead. To further improve efficiency, we optimize the draft model specifically for heterogeneous hardware environments by reducing the number of draft tokens to lower parallel verification latency, increasing model depth to enhance predictive capabilities, and introducing a Dynamic Gating Fusion (DGF) mechanism to improve the integration of feature and embedding information. We conduct comprehensive evaluations of Dovetail across various consumer-grade GPUs, covering multiple tasks and mainstream models. Experimental results on 13B models demonstrate that Dovetail achieves inference speedups ranging from 1.79x to 10.1x across different devices, while maintaining consistency and stability in the distribution of generated texts.

en cs.CL
arXiv Open Access 2024
PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

Yifan Tan, Cheng Tan, Zeyu Mi et al.

Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.

en cs.CR, cs.DC
arXiv Open Access 2024
Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

Jiahao Liu, Qifan Wang, Jingang Wang et al.

The recent advancements in large language models (LLMs) have been extraordinary, yet the escalating inference costs associated with them present challenges in real-world applications. To address these challenges, we propose a novel approach called Early-exiting Speculative Decoding (EESD) with lossless acceleration. Specifically, EESD utilizes a segment of the LLM to generate draft tokens, incorporating Early-exiting structures after the first N layers. To enhance the quality of draft tokens, a self-distillation method is integrated. This early-exiting design not only reduces deployment and training costs but also significantly accelerates the token generation speed. Moreover, we introduce a novel sampling mechanism that leverages Thompson Sampling to regulate the generation processes, automatically determining the quantity of draft tokens in each round. The original LLM is then employed to validate these draft tokens through a single forward pass, and thus guarantees that the final output text maintains a distribution consistent with vanilla auto-regressive decoding. The experimental results on both 13B and 70B models demonstrate that our approach decodes tokens at a markedly accelerated rate compared to prior methods, showing the effectiveness of our approach.

en cs.CL
arXiv Open Access 2024
The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang, Daniele Paliotta, Avner May et al.

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN model. We also find that the distilled model has natural length extrapolation, showing almost perfect accuracy in the needle-in-a-haystack test at 20x the distillation length. Code and pre-trained checkpoints are open-sourced at https://github.com/jxiw/MambaInLlama and https://github.com/itsdaniele/speculative_mamba.

en cs.LG, cs.AI

Halaman 40 dari 71438