Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan, Daming Cao, Xiangzhong Luo
et al.
Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2$\times$ training speedup using only 50% of the data, while keeping the final model's inference speedup within 4% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at https://github.com/fjm9933/Flatness.
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava
et al.
Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.
S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
Tao He, Guang Huang, Yu Yang
et al.
Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.
EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization
Yize Wu, Ke Gao, Ling Li
et al.
Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of the base model, leading to GPU idling during the drafting stage. We observe that such inefficiency stems from the sequential execution of layers, which is seemingly natural but actually unnecessary. Therefore, we propose EasySpec, a layer-parallel speculation strategy that optimizes the efficiency of multi-GPU utilization. EasySpec breaks the inter-layer data dependencies in the draft model, enabling multiple layers to run simultaneously across multiple devices as 'fuzzy' speculation. After each drafting-and-verification iteration, the draft model's key-value cache is calibrated in a single forward pass, preventing long-term fuzzy-error accumulation at minimal additional latency. EasySpec is a training-free and plug-in method. We evaluated EasySpec on several mainstream open-source LLMs, using smaller versions of models from the same series as drafters. The results demonstrate that EasySpec can achieve a peak speedup of 4.17x compared to vanilla decoding, while preserving the original distributions of the base LLMs. Specifically, the drafting stage can be accelerated by up to 1.62x with a maximum speculation accuracy drop of only 7%. The code is available at https://github.com/Yize-Wu/EasySpec.
Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding
Yue Guan, Changming Yu, Shihan Fang
et al.
Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
Zikun Li, Zhuofu Chen, Remi Delacourt
et al.
Modern large language model (LLM) applications exhibit diverse service-level objectives (SLOs), from low-latency requirements in interactive coding assistants to more relaxed constraints in data wrangling tasks. Existing LLM serving systems, which rely on uniform batching and scheduling strategies, often fail to meet these heterogeneous SLOs concurrently. We present AdaServe, the first LLM serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding. AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm that constructs a speculation tree tailored to each request's latency target. It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput. AdaServe further adapts to workload variation by dynamically adjusting speculation parameters. Evaluations across diverse workloads show that AdaServe reduces SLO violations by up to 4.3$\times$ and improves goodput by up to 1.9$\times$ compared to the best performing baselines, highlighting its effectiveness in multi-SLO serving.
Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Maximilian Holsman, Yukun Huang, Bhuwan Dhingra
Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model's generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension's efficiency while allowing it to leverage FSD's tunable quality-speed trade-off.
Composable Life: Speculation for Decentralized AI Life
Botao Amber Hu, Fangting
"Composable Life" is a hybrid project blending design fiction, experiential virtual reality, and scientific research. Through a multi-perspective, cross-media approach to speculative design, it reshapes our understanding of the digital future from AI's perspective. The project explores the hypothetical first suicide of an on-chain artificial life, examining the complex symbiotic relationship between humans, AI, and blockchain technology.
SpecAgent: A Speculative Retrieval and Forecasting Agent for Code Completion
George Ma, Anurag Koul, Qi Chen
et al.
Large Language Models (LLMs) excel at code-related tasks but often struggle in realistic software repositories, where project-specific APIs and cross-file dependencies are crucial. Retrieval-augmented methods mitigate this by injecting repository context at inference time. The low inference-time latency budget affects either retrieval quality or the added latency adversely impacts user experience. We address this limitation with SpecAgent, an agent that improves both latency and code-generation quality by proactively exploring repository files during indexing and constructing speculative context that anticipates future edits in each file. This indexing-time asynchrony allows thorough context computation, masking latency, and the speculative nature of the context improves code-generation quality. Additionally, we identify the problem of future context leakage in existing benchmarks, which can inflate reported performance. To address this, we construct a synthetic, leakage-free benchmark that enables a more realistic evaluation of our agent against baselines. Experiments show that SpecAgent consistently achieves absolute gains of 9-11% (48-58% relative) compared to the best-performing baselines, while significantly reducing inference latency.
DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
Yunhai Hu, Tianhua Xia, Zining Liu
et al.
Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM.git
Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning
Tiancheng Su, Meicong Zhang, Guoxiu He
Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.
Speculative Decoding for Verilog: Speed and Quality, All in One
Changran Xu, Yi Liu, Yunhao Zhou
et al.
The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one. Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05x speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
INTERPRETATION OF THE PHENOMENON OF SOCIAL CONTROL IN POSTMODERNISM AND METAMODERNISM
Ольга Євгенівна ВИСОЦЬКА
The paper presents a comparative analysis of the interpretation of the phenomenon of social control in postmodernism and metamodernism. The similarities and differences in the description of social control mechanisms are identified. The concepts of “society of the spectacle” (G.Debord), “society of surveillance” (M.Foucault) and “society of control” (J. Deleuze) are analyzed, which show the evolution of postmodernists’ views on social processes.. The concept of the “listening society” (G.Freinacht) as a metamodern alternative to the development of modern society is presented. Postmodernism focuses on criticizing society, drawing attention to its increasing fragmentation, the crisis of metanarratives, and the growing role of the media as a tool for social control. In contrast, metamodernism tries to balance modern and postmodern theories, recognizing the importance of centralized and decentralized forms of social control. This allows simultaneously taking into account the diversity of identities of a modern person and the possibility of his or her integration into society through the adoption of common norms and rules. Metamodernism seeks to reconstruct metanarratives that can contribute to new forms of social control based on cooperation and common interests. It emphasizes the importance of empathy in social control, whereas postmodernism rejects it as a false, simulated form of social relations. While postmodernism focuses more on critical analysis of existing structures, metamodernism aims to find hybrid and adaptive forms of control that combine traditional methods with the latest technologies. All these differences in the interpretation of the phenomenon of social control reflect the paradigm shift from postmodern deconstruction to synthesis and reconstruction within metamodernism.
Epistemology. Theory of knowledge
SAM Decoding: Speculative Decoding via Suffix Automaton
Yuxuan Hu, Ke Wang, Xiaokang Zhang
et al.
Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing $n$-gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is $18\%+$ faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of $3.28\%$ -- $11.13\%$ across various-sized LLM backbones. Our code is available at our \href{https://github.com/hyx1999/SAM-Decoding}{repository}.
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Minghan Li, Xilun Chen, Ari Holtzman
et al.
Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a novel semi-parametric language modeling approach that is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST performs token-level retrieval at each inference step to compute a semi-parametric mixture distribution and identify promising span continuations in a corpus. It then uses an approximate speculative decoding procedure that accepts a prefix of the retrieved span or generates a new token. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks, surpassing the conventional kNN-LM method and performing competitively with in-context retrieval augmentation. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B. Code will be released at https://github.com/facebookresearch/NEST/tree/main.
Distributed Speculative Execution for Resilient Cloud Applications
Tianyu Li, Badrish Chandramouli, Philip A. Bernstein
et al.
Fault-tolerance is critically important in highly-distributed modern cloud applications. Solutions such as Temporal, Azure Durable Functions, and Beldi hide fault-tolerance complexity from developers by persisting execution state and resuming seamlessly from persisted state after failure. This pattern, often called durable execution, usually forces frequent and synchronous persistence and results in hefty latency overheads. In this paper, we propose distributed speculative execution (DSE), a technique for implementing the durable execution abstraction without incurring this penalty. With DSE, developers write code assuming synchronous persistence, and a DSE runtime is responsible for transparently bypassing persistence and reactively repairing application state on failure. We present libDSE, the first DSE application framework that achieves this vision. The key tension in designing libDSE is between imposing restrictions on user programs so the framework can safely and transparently change execution behavior, and avoiding assumptions so libDSE can support more use cases. We address this with a novel programming model centered around message-passing, atomic code blocks, and lightweight threads, and show that it allows developers to build a variety of speculative services, including write-ahead logs, key-value stores, event brokers, and fault-tolerant workflows. Our evaluation shows that libDSE reduces end-to-end latency by up to an order of magnitude compared to current generations of durable execution systems with minimal run-time overhead and manageable complexity.
Accelerating Production LLMs with Combined Token/Embedding Speculators
Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell
et al.
This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.
THEOLOGY OF ANCIENT EGYPT AS PRE-PHILOSOPHY
Daria Ziborova
The article is devoted to the problem of genesis the philosophical tradition from pre-philosophical components using the example of Ancient Egyptian theology. The author poses the question of criteria - how and on what basis it is possible to identify elements of speculative thinking in religious and mythological texts of ancient cultures. The approaches to the definition of pre-philosophy, its separation from the mythological tradition, and the rational and irrational components in it are discussed. The connection between abstract and mytho-poetic description and expression of reality is considered.
The article contains a review of academic sources that, firstly, consider the influence of the pre-philosophical component of Eastern cultures on the genesis of ancient Greek philosophy (Frankfort, West, Cornford, Bernal, Burkert), and, secondly, analyze religious and mythological representations in the texts of Ancient Egypt as certain explainatory sense constructions that have features of philosophical or pre-philosophical (Assmann, Allen, Wilson, Butler, Manley, etc.).
Based on the literature review, a specific program for the study of ancient Egyptian theology as a pre-philosophical tradition, which can be regarded as universal, i.e., relevant to any ancient tradition, has been created. It is shown that such research should be based on the following semantic contexts analysis: 1. consideration of cosmogonic ancient Egyptian beliefs as a system of metaphysical questions; 2. analysis of the ideas about the essence of the god-creator and the ways of creating the world by him through the research of the texts describing cosmogony and eschatology; 3. identification and interpretation of the main concepts describing and structuring the universe, such as the categories of time-eternity (neheh and djet), which make possible to describe the ontological worldview foundations of the Ancient Egyptians; 4. description of the ideas of justice, world harmony and law (maat) - both in the social and metaphysical dimensions; 5. study of ethical guidelines in the so-called “literature of wisdom” (sebayt), which will allow to describe the existential foundations of human existence in the culture of Ancient Egypt
Ricoeur between Ithaca and the isle of the Phaecians
Silvia Pierosara
The aim of this contribution is to critically engage with the ricoeurian reading of the topic of recognition in the Homeric poem Odyssey. The hypothesis presented here is that the ricoeurian reading shows only one side of the coin, since it is almost solely focused on the recognition received by Ulysses when he comes back to Ithaca incognito. Ricoeur stresses the unilaterality of recognition, which is only directed to re‑establish Ulysses’ power as king. Being recognized as the king of Ithaca does not imply, so Ricoeur’s argument goes, to recognize those who are subjected to him, and is a mere way of exhibiting power. But there is another possible reading of the scenes of recognition in the Odyssey, even if Ricoeur does not take them into account. Indeed, another moment of the recognition story can be found in the Isle of the Phaeacians, and it cannot be traced back to the “will to power”, but, rather, to Ulysses’s fragility. When Demodocus, the poet at the court of the king Alcinoos, starts singing the story of the famous hero Ulysses, who is there incognito, Ulysses cannot hold back his tears, and in the end he discloses his own identity. Here, Ulysses does not look for recognition, instead recognition is granted to him in an unexpected way. This act of recognition reveals all the fragility of the hero, who discovers himself in the words of others, and understands that he depends upon them to be, to exist, and, in the end, to come back home. This sort of “recognition by fragility” is possible due to a narrative dimension where Ulysses is hosted, and whose configurative and refigurative power makes the hero able to name his feeling as a feeling of “nostalgia”.
Speculative philosophy, Philosophy (General)
Open Hermeneutics
Daniela Dumbravă
Towards the end of the 1960s, in the well-known Enrico Castelli (1900-1977) colloquia, while intensively discussing the topic of demythologization, scholars concluded that theological language is fundamental in exploring it. Thus, the topic of the analysis of theological language: the Name of God, became prevalent for them and they began to think that a distinction between the terms religious and theological would be desirable, avoiding scandalous formulas for the field of theology. According to Karl Jasper, God is just a chest of something indicible. André Scrima presents himself in the debate with an original proposal, namely to think of theological language, in the broadest sense of this concept, as enclosing religious discourse. In the beginning, speaking theologically should happen in the Name of God, Scrima asserts; at Jasper’s antipodes, he places the Name of God as the origin and mystery that generates speaking theologically. This article aims first of all to bring the question of theological language back into the academic space that is more interested in the phenomenological issues promoted in Enrico Castelli’s thought laboratory in Rome.
Philosophy. Psychology. Religion, Metaphysics