EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs
Chang Han, Yijie Hu, Jingling Liu
Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where attention masking, KV-cache layouts, and indexing semantics are not interchangeable. We present EAGLE-Pangu, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs. EAGLE-Pangu contributes (i) an explicit branch/commit cache manager built on the Cache API, (ii) accelerator-safe tree tensorization that removes undefined negative indices by construction and validates structural invariants, and (iii) a fused-kernel-compatible teacher verification path with a debuggable eager fallback. On 240 turns from MT-Bench and HumanEval-style prompts, EAGLE-Pangu improves end-to-end decoding throughput by 1.27x on average, up to 2.46x at p99, over teacher-only greedy decoding in the fused-kernel performance path. We also provide a fused-kernel-free reference path with structured traces and invariant checks to support reproducible debugging and ablation across execution modes and tree budgets.
Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution
Yifan Sui, Han Zhao, Rui Ma
et al.
LLM-powered agents are emerging as a dominant paradigm for autonomous task solving. Unlike standard inference workloads, agents operate in a strictly serial "LLM-tool" loop, where the LLM must wait for external tool execution at every step. This execution model introduces severe latency bottlenecks. To address this problem, we propose PASTE, a Pattern-Aware Speculative Tool Execution method designed to hide tool latency through speculation. PASTE is based on the insight that although agent requests are semantically diverse, they exhibit stable application level control flows (recurring tool-call sequences) and predictable data dependencies (parameter passing between tools). By exploiting these properties, PASTE improves agent serving performance through speculative tool execution. Experimental results against state of the art baselines show that PASTE reduces average task completion time by 48.5% and improves tool execution throughput by 1.8x.
FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching
Divya Jyoti Bajpai, Shubham Agarwal, Apoorv Saxena
et al.
Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional time cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves $>2.5\times$ speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Fengze Yu, Leshu Li, Brad McDanel
et al.
Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
Jinhui Wei, Ye Huang, Yuhui Zhou
et al.
In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6x speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX.
Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding
Shrenik Bhansali, Larry Heck
Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KL$\rightarrow$RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.
Speculative Design of Equitable Robotics: Queer Fictions and Futures
Minja Axelsson
This paper examines the speculative topic of equitable robots through an exploratory essay format. It focuses specifically on robots by and for LGBTQ+ populations. It aims to provoke thought and conversations in the field about what aspirational queer robotics futures may look like, both in the arts and sciences. First, it briefly reviews the state-of-the-art of queer robotics in fiction and science, drawing together threads from each. Then, it discusses queering robots through three speculative design proposals for queer robot roles: 1) reflecting the queerness of their ''in-group'' queer users, building and celebrating ''in-group'' identity, 2) a new kind of queer activism by implementing queer robot identity performance to interact with ''out-group'' users, with a goal of reducing bigotry through familiarisation, and 3) a network of queer-owned robots, through which the community could reach each other, and distribute and access important resources. The paper then questions whether robots should be queered, and what ethical implications this raises. Finally, the paper makes suggestions for what aspirational queer robotics futures may look like, and what would be required to get there.
Principled Coarse-Grained Acceptance for Speculative Decoding in Speech
Moran Yanuka, Paul Dixon, Eyal Finkelshtein
et al.
Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model's embedding space. By splitting each token's probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
Guanzheng Chen, Qilong Feng, Jinjie Ni
et al.
The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.
Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
Ruixiao Li, Fahao Chen, Peng Li
Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.
Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
Shengyin Sun, Yiming Li, Xing Li
et al.
Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.
FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
Xing Liu, Lizhuo Luo, Ming Tang
et al.
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accepted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.37$\times$-1.73$\times$ compared to baselines. Our code is publicly available at \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec\#}.
Grouped Speculative Decoding for Autoregressive Image Generation
Junhyuk So, Juncheol Shin, Hyunho Kook
et al.
Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at https://github.com/junhyukso/GSD
THE ARCHETYPE OF THE MOTHER: THE PSYCHOANALYTICAL BASIS OF FAMILY AND COMMUNITY SOCIAL WORK
Оксана ОСЕТРОВА
Periods of crisis and social cataclysms are accompanied by an exacerbation of social problems, which, in particular, are of an existential and psychoanalytic nature (the search for the meaning of life, overcoming despair, understanding of thanatological issues, the essence of the struggle between Eros and Thanatos, etc.) In this context, the problem of motherhood is also relevant, the secrets of which, in particular, are rooted in the archetypes of Mother, Great Mother, and ‘double mother’. In the situation of the Russian-Ukrainian war, the loss of a family in general or one of the parents in particular is particularly acute for a modern person. Given that, on the one hand, children are the future of the country, and on the other hand, the heroes and heroines are the children of Ukraine, the role of the Mother and her influence on the formation and development of the Child’s personality is now becoming more and more important to every person. In this regard, the research attention in this article is focused on the essence of the Mother archetype as one of the innate universal structures, which is the purpose of this article.Highlighting the key points of C.G.Jung’s teachings on the archetype and the mother complex will help to reveal the psychoanalytic foundations of family studies, which are important for the understanding of both recipients (individuals/families) and providers (social work professionals) of social services. It is important to note that C.G.Jung emphasised the lack of grasp of the studied archetype - from specific examples to generalisations, from direct understanding to figurative meaning. The scientist, recognising the universality of the folk psychological image of the mother, at the same time focused on the importance of the figure of his own mother for each individual. The analysis of C. G. Jung’s views on the essence of the Mother archetype and the mother complex derived from it, which has different effects on the son and daughter, allows us to conclude that it has an ambivalent nature, which includes both positive and negative aspects. Taking into account the entire range of these diametrically opposed aspects by both recipients (individuals/families, children/parents) and providers of social services (social work professionals) will contribute to progress in the development of family and maternity institutions in the changing realities of today.
Epistemology. Theory of knowledge
From S-matrix theory to strings: Scattering data and the commitment to non-arbitrariness
Robert van Leeuwen
The early history of string theory is marked by a shift from strong interaction physics to quantum gravity. The first string models and associated theoretical framework were formulated in the late 1960s and early 1970s in the context of the S-matrix program for the strong interactions. In the mid-1970s, the models were reinterpreted as a potential theory unifying the four fundamental forces. This paper provides a historical analysis of how string theory was developed out of S-matrix physics, aiming to clarify how modern string theory, as a theory detached from experimental data, grew out of an S-matrix program that was strongly dependent upon observable quantities. Surprisingly, the theoretical practice of physicists already turned away from experiment before string theory was recast as a potential unified quantum gravity theory. With the formulation of dual resonance models (the "hadronic string theory"), physicists were able to determine almost all of the models' parameters on the basis of theoretical reasoning. It was this commitment to "non-arbitrariness", i.e., a lack of free parameters in the theory, that initially drove string theorists away from experimental input, and not the practical inaccessibility of experimental data in the context of quantum gravity physics. This is an important observation when assessing the role of experimental data in string theory.
en
physics.hist-ph, gr-qc
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
Zhenglin Wang, Jialong Wu, Yilong Lai
et al.
Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of chain-of-thought prompting, encouraging exploration of intermediate steps. However, such methods introduce significant inference latency due to the systematic exploration and evaluation of multiple thought paths. This paper introduces SeeD, a novel and efficient inference framework to optimize runtime speed and GPU memory management concurrently. By employing a scheduled speculative execution, SeeD efficiently handles multiple iterations for the thought generation and the state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching. Extensive experimental evaluations on three reasoning datasets demonstrate superior speedup performance of SeeD, providing a viable path for batched inference in training-free speculative decoding.
Noisy Positive-Unlabeled Learning with Self-Training for Speculative Knowledge Graph Reasoning
Ruijie Wang, Baoyu Li, Yichen Lu
et al.
This paper studies speculative reasoning task on real-world knowledge graphs (KG) that contain both \textit{false negative issue} (i.e., potential true facts being excluded) and \textit{false positive issue} (i.e., unreliable or outdated facts being included). State-of-the-art methods fall short in the speculative reasoning ability, as they assume the correctness of a fact is solely determined by its presence in KG, making them vulnerable to false negative/positive issues. The new reasoning task is formulated as a noisy Positive-Unlabeled learning problem. We propose a variational framework, namely nPUGraph, that jointly estimates the correctness of both collected and uncollected facts (which we call \textit{label posterior}) and updates model parameters during training. The label posterior estimation facilitates speculative reasoning from two perspectives. First, it improves the robustness of a label posterior-aware graph encoder against false positive links. Second, it identifies missing facts to provide high-quality grounds of reasoning. They are unified in a simple yet effective self-training procedure. Empirically, extensive experiments on three benchmark KG and one Twitter dataset with various degrees of false negative/positive cases demonstrate the effectiveness of nPUGraph.
Methodological Reflections on the MOND/Dark Matter Debate
Patrick M. Duerr, William J. Wolf
The paper re-examines the principal methodological questions, arising in the debate over the cosmological standard model's postulate of Dark Matter vs. rivalling proposals that modify standard (Newtonian and general-relativistic) gravitational theory, the so-called Modified Newtonian Dynamics (MOND) and its subsequent extensions. What to make of such seemingly radical challenges of cosmological orthodoxy? In the first part of our paper, we assess MONDian theories through the lens of key ideas of major 20th century philosophers of science (Popper, Kuhn, Lakatos, and Laudan), thereby rectifying widespread misconceptions and misapplications of these ideas common in the pertinent MOND-related literature. None of these classical methodological frameworks, which render precise and systematise the more intuitive judgements prevalent in the scientific community, yields a favourable verdict on MOND and its successors -- contrary to claims in the MOND-related literature by some of these theories' advocates; the respective theory appraisals are largely damning. Drawing on these insights, the paper's second part zooms in on the most common complaint about MONDian theories, their ad-hocness. We demonstrate how the recent coherentist model of ad-hocness captures, and fleshes out, the underlying -- but too often insufficiently articulated -- hunches underlying this critique. MONDian theories indeed come out as severely ad hoc: they do not cohere well with either theoretical or empirical-factual background knowledge. In fact, as our complementary comparison with the cosmological standard model's Dark Matter postulate shows, with respect to ad-hocness, MONDian theories fare worse than the cosmological standard model.
en
physics.hist-ph, astro-ph.CO
Grace de Laguna as Continental Philosopher?
Marguerite La Caze
ABSTRACT Joel Katzav’s article describes and explains the realist perspectivist views of Grace de Laguna, showing the distinctiveness of her positions in a number of fields. My focus will be on her views of the self and persons, and how they are embedded in their communities, experience emotions, and develop morality. As Katzav outlines, de Laguna’s position can be characterized as a form of speculative philosophy that develops an ontology of modes of being. Katzav sees speculative philosophy and naturalism intertwining in her work as de Laguna extends and criticizes science (the speculative aspect) as well as drawing on it (the naturalistic aspect). De Laguna’s project is to create a philosophical system that can be communicated to others. This system, according to Katzav, uses imagination, metaphor, and paradoxical formulations, such as ‘qualitative uniqueness’ to describe individuality, in a way that is common in continental philosophy. My commentary examines de Laguna’s views as they are set out by Katzav and suggests how his article opens up lines of philosophical inquiry concerning the self, community and morality. De Laguna’s views on these issues can also usefully be compared to those of Hannah Arendt to see how her articulation of these themes intersects with and differs from Arendt’s ideas.
Grace Andrus de Laguna: A Perspective from the History of Linguistics
Brigitte Nerlich
ABSTRACT Grace de Laguna was a philosopher working in the first part of the twentieth century on analytic and speculative philosophy, as well as on the psychology and philosophy of language, especially the social function of language. Joel Katzav’s lead essay focuses mainly on the former part of her work, while my commentary focuses mostly on the latter. Katzav shows how her work played a role in the development of analytic philosophy, I try to show how her work played a role in the development of pragmatic linguistics, whose development overlapped with some aspects of analytic philosophy. I situate it in the context of an emerging ‘pragmatic Weltanschauung’ that was embraced by many working in linguistics, philology, philosophy, psychology, anthropology and what would later be called linguistic pragmatics.