Lennette Kipper
Hasil untuk "Speculative philosophy"
Menampilkan 20 dari ~1427575 hasil · dari DOAJ, arXiv, CrossRef, Semantic Scholar
Shenggui Li, Chao Wang, Yikai Zhu et al.
Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source, production-oriented framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B. In addition, we release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs. Through a systematic study of speculative decoding training recipes, SpecBundle addresses the scarcity of high-quality drafts in the community, and our draft models achieve up to 4.48x end-to-end inference speedup on SGLang, establishing SpecForge as a practical foundation for real-world speculative decoding deployment.
Yongchang Hao, Lili Mou
Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.
Rui Pan, Zhuofu Chen, Hongyi Liu et al.
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.7$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
Mingyu Yang, Jae-Young Choi, Kihyo Moon et al.
Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.
Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu
Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49\% speedup over standard speculative decoding while limiting accuracy degradation to under 2\%, making it a practical solution for efficient and adaptive LLM inference.
Sergey Pankratov, Dan Alistarh
Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (μ+ μ_{(2)})\log(P )/μ^2 + O(1)$, where $P$ is the verifier's capacity, $μ$ is the expected entropy of the verifier's output distribution, and $μ_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.
Hongyi Liu, Jiaji Huang, Zhen Jia et al.
Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the best draft model in hindsight for each query in terms of either the token acceptance probability or expected acceptance length. In particular, we show that we can accurately evaluate all draft models, instead of only the chosen model without incurring additional queries to the target model, which allows us to improve exponentially over the existing bandit-based approach as the number of draft models increases. Our approach is generically applicable with any speculative decoding methods (single draft, multi-drafts and draft-trees). Moreover, we design system-efficient versions of online learners and demonstrate that the overhead in computation and latency can be substantially reduced. We conduct extensive experiments on open-source LLMs and diverse datasets, demonstrating that our methods substantially outperform the state-of-the-art EAGLE3 and the BanditSpec baseline in a variety of domains where specialized domain-expert drafters are available, especially when long reasoning chains are required.
Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu et al.
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
Linfeng Zhong, Songqiang Xu, Huifeng Wen et al.
The growing demand for efficient long-sequence modeling on edge devices has propelled widespread adoption of State Space Models (SSMs) like Mamba, due to their superior computational efficiency and scalability. As its autoregressive generation process remains memory-bound, speculative decoding has been proposed that incorporates draft model generation and target model verification. However, directly applying speculative decoding to SSMs faces three key challenges: (1) hidden state backtracking difficulties, (2) tree-based parallel verification incompatibility, and (3) hardware workload mismatch. To address these challenges, we propose SpecMamba, the first FPGA-based accelerator for Mamba with speculative decoding, which features system, algorithm, and hardware co-design. At the system level, we present a memory-aware hybrid backtracking strategy to coordinate both models. At the algorithm level, we propose first-in-first-out (FIFO)-based tree verification with tiling to minimize memory access. At the hardware level, we customize a dataflow that computes linear layers in parallel and SSM layers in series to enable maximal overlapping. Implemented on AMD FPGA platforms (VHK158 and VCK190), SpecMamba achieves a 2.27x speedup over GPU baselines and a 2.85x improvement compared to prior FPGA solutions, while demonstrating 5.41x and 1.26x higher energy efficiency, respectively.
Fenglu Hong, Ravi Raju, Jonathan Lingjie Li et al.
Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.
Siqi Wang, Hailong Yang, Xuezhu Wang et al.
Large language models (LLM) have recently attracted surging interest due to their outstanding capabilities across various domains. However, enabling efficient LLM inference is challenging due to its autoregressive decoding that generates tokens only one at a time. Although research works apply pruning or quantization to speed up LLM inference, they typically require fine-tuning the LLM, incurring significant time and economic costs. Meanwhile, speculative decoding has been proposed to use small speculative models (SSMs) to accelerate the inference of LLM. However, the low acceptance rate of SSM and the high verification cost of LLM prohibit further performance improvement of inference. In this paper, we propose Minions, an LLM inference system that accelerates LLM inference with a collective and adaptive speculative generation. Specifically, Minions proposes a majority-voted mechanism to leverage multiple SSMs to jointly speculate the outputs of LLM, which improves the inference performance without introducing prohibitive computation costs for LLM. To better trade off the number of tokens speculated from SSM and the verification cost of LLM, Minions proposes an adaptive mechanism to dynamically determine the optimal speculation length of SSM, which can achieve better inference performance across different models, datasets, and hyper-parameters. In addition, Minions decouples the SSM decoding and LLM verification efficiently and adopts a pipelined execution mechanism to further improve the inference performance of LLM. By comparing with the state-of-the-art LLM inference systems, we demonstrate that Minions can achieve higher inference throughput and lower inference time.
Gabriele Oliaro, Zhihao Jia, Daniel Campos et al.
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3$\times$, outperforming state-of-the-art methods -- 2.8$\times$ faster than model-based approaches like EAGLE-2/3 and 1.9$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference
Weilin Zhao, Yuxiang Huang, Xu Han et al.
Speculative decoding is a widely used method that accelerates the generation process of large language models (LLMs) with no compromise in model performance. It achieves this goal by using an existing smaller model for drafting and then employing the target LLM to verify the draft in a low-cost parallel manner. Under such a drafting-verification framework, drafting efficiency has become a bottleneck in the final speedup of speculative decoding. Therefore, generating longer drafts at less cost can lead to better decoding speedup. To achieve this, we introduce Ouroboros, which can generate draft phrases to parallelize the drafting process and meanwhile lengthen drafts in a training-free manner. The experimental results on various typical text generation tasks show that Ouroboros can achieve speedups of up to $2.8\times$ over speculative decoding and $3.9\times$ over vanilla decoding, without fine-tuning draft and target models. The source code of Ouroboros is available at https://github.com/thunlp/Ouroboros.
G. Maspero
In this volume, Giulio Maspero explores both the ontology and the epistemology of the Cappadocians from historical and speculative points of view. He shows how the Cappadocians developed a real Trinitarian Ontology through their reshaping of the Aristotelian category of relation, which they rescued from the accidental dimension and inserted into the immanence of the one divine and eternal substance. This perspective made possible a new conception of individuation. No longer exclusively linked to substantial difference, as in classical Greek philosophy, the concept was instead founded on the mutual relation of the divine Persons. The Cappadocians' metaphysical reshaping was also closely linked to a new epistemological conception based on apophaticism, which shattered the logical closure of their opponents, and anticipated results that modern research has subsequently highlighted, Bridging the late antique philosophy with Patristics, Maspero' s study allows us to find the relational traces within the Trinity in the world and in history.
Jeannette Schaeffer
This is a sweeping and provocative work of aesthetic theory: a trenchant critique of the philosophy of art as it developed from the eighteenth century to the early twentieth century, combined with a carefully reasoned plea for a new and more flexible approach to art. Jean-Marie Schaeffer, one of France's leading aestheticians, explores the writings of Kant, Schlegel, Novalis, Hegel, Schopenhauer, Nietzsche, and Heidegger to show that these diverse thinkers shared a common approach to art, which he calls the "speculative theory." According to this theory, art offers a special kind of intuitive, quasi-mystical knowledge, radically different from the rational knowledge acquired by science. This view encouraged theorists to consider artistic geniuses the high-priests of humanity, creators of works that reveal the invisible essence of the world. Philosophers came to regard inexpressibility as the aim of art, refused to consider second-tier creations genuine art, and helped to create conditions in which the genius was expected to shock, puzzle, and mystify the public. Schaeffer shows that this speculative theory helped give birth to romanticism, modernism, and the avant-garde, and paved the way for an unfortunate divorce between art and enjoyment, between "high art" and popular art, and between artists and their public. Rejecting the speculative approach, Schaeffer concludes by defending a more tolerant theory of art that gives pleasure its due, includes popular art, tolerates less successful works, and accounts for personal tastes.
Silvia Ruzanka
Can plants play games? Considering what play might mean for “the more-than-human” can provide alternate perspective for what play means for ourselves as humans. It can also open a way for rethinking relations between ourselves and other entities that share our world. Kara Stone argues for the importance of considering play for non-human players, drawing on feminist and posthumanist philosophy. In this paper I take a similar approach of speculative imagining of plants as players, our relation to plants and to plant-being, and present an example of a videogame artwork designed for a plant to play.
E. S. St. Pierre, Elliott Kuecker
Abstract This introduction to the special issue titled, “Reading: The Long Preparation for Inquiry,” asks why reading has not been as prominent as writing in educational and social science research. The authors suggest reading may not seem as empirical as writing and so has been assigned fairly limited roles in the research process—reading for the literature review at the beginning of a research project and reading to review research reports at the end. Reading, especially reading philosophy, which the authors encourage, can be considered speculative and ephemeral and bound more to the rational than the empirical side of the rational/empirical binary that structures empirical research’s epistemological goals. Also, it can be difficult to control reading—who knows what a reader might read? The authors recommend that researchers commit to reading as a necessary, lengthy preparation for scholarship and research.
E. P. Kharchenko
The article discusses the properties of the pandemic strain XBB.1.5 S protein in comparison with the 1918 and 2009 pandemic strains hemagglutinin H1. The S-protein XBB.1.5 already contains more than 40 mutations realized by substituting different amino acids through single and dinucleotide substitutions, deletions and the use of predominantly transversions. The variability of H1N1 influenza virus hemagglutinin is associated with single nucleotide substitutions at a constant length. Conditional extrapolation of influenza virus hemagglutinin variability data on coronavirus S-protein sizes suggests that new pandemic strains will emerge in the next 2-3 years, avoiding the immune defense formed by vaccination against the strains preceding them. The inability to create through the adaptive immune system a long-term immunity to pandemic coronaviruses, as well as to other respiratory viruses with a short incubation cycle, puts on the agenda the need to find new vaccine designs that provide a combination of long-term adaptive and trained immunity. The problem in the search for such vaccines is associated with the regulation of the activity of the innate immune system and ensuring the stability of trained immunity.
Gisela Fabbian, Maximiliano Crespi
El presente trabajo constituye un primer avance de estudio sobre la obra del escritor, historiador de arte, crítico y anarquista alemán Carl Einstein. Propone un análisis de la matriz teórico metodológica sobre la que instituye su singular concepción historiográfica del arte del siglo XX con foco en el periodo que va desde el despuntar del movimiento expresionista alemán a la consolidación vanguardista del cubismo. La hipótesis planteada sostiene que Einstein encuentra en el cubismo (en especial en las obras de Picasso, Braque y Gris) no un “objeto” de investigación sino un “método” para su modelo de historización crítica del arte del siglo XX.
Halaman 15 dari 71379