Hasil untuk "Speculative philosophy"

Menampilkan 20 dari ~1427569 hasil · dari arXiv, DOAJ, CrossRef, Semantic Scholar

JSON API
arXiv Open Access 2025
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

Zhaoxuan Wu, Zijian Zhou, Arun Verma et al.

We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.

en cs.CL, cs.AI
arXiv Open Access 2025
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences

Jungyoub Cha, Hyunjong Kim, Sungzoon Cho

Speculative decoding is a widely used technique for accelerating inference in large language models (LLMs), but its performance degrades as input length grows, with significant drops even at moderate lengths. Yet, this early degradation has remained largely underexplored. We introduce SpecExtend, a drop-in enhancement that improves speculative decoding on long sequences without additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention to accelerate prefill and verification steps. To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that leverages the target model's attention scores to dynamically select relevant context for the smaller draft model. Extensive evaluations show that SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long document summarization and up to 3.86x on long-form reasoning, while preserving the short-input performance of state-of-the-art frameworks. Our code is available at https://github.com/jycha98/SpecExtend .

en cs.CL, cs.AI
arXiv Open Access 2025
GoodSpeed: Optimizing Fair Goodput with Adaptive Speculative Decoding in Distributed Edge Inference

Phuong Tran, Tzu-Hao Liu, Long Tan Le et al.

Large language models (LLMs) have revolutionized natural language processing, yet their high computational demands pose significant challenges for real-time inference, especially in multi-user server speculative decoding and resource-constrained environments. Speculative decoding has emerged as a promising technique to accelerate LLM inference by using lightweight draft models to generate candidate tokens, which are subsequently verified by a larger, more accurate model. However, ensuring both high goodput (the effective rate of accepted tokens) and fairness across multiple draft servers cooperating with a central verification server remains an open challenge. This paper introduces GOODSPEED, a novel distributed inference framework that optimizes goodput through adaptive speculative decoding. GOODSPEED employs a central verification server that coordinates a set of heterogeneous draft servers, each running a small language model to generate speculative tokens. To manage resource allocation effectively, GOODSPEED incorporates a gradient scheduling algorithm that dynamically assigns token verification tasks, maximizing a logarithmic utility function to ensure proportional fairness across servers. By processing speculative outputs from all draft servers in parallel, the framework enables efficient collaboration between the verification server and distributed draft generators, streamlining both latency and throughput. Through rigorous fluid sample path analysis, we show that GOODSPEED converges to the optimal goodput allocation in steady-state conditions and maintains near-optimal performance with provably bounded error under dynamic workloads. These results demonstrate that GOODSPEED provides a scalable, fair and efficient solution for multi-server speculative decoding in distributed LLM inference systems.

en cs.DC
arXiv Open Access 2025
Automatic Task Detection and Heterogeneous LLM Speculative Decoding

Danying Ge, Jianhua Gao, Qizhi Jiang et al.

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate and decoding speed in downstream tasks due to the limited capacity of the draft model, making it difficult to ensure efficiency across diverse tasks. To address this problem, we propose a speculative decoding algorithm tailored for downstream task optimization. It includes an automatic task partitioning and assigning method, which automatically categorizes downstream tasks into different sub-tasks and assigns them to a set of heterogeneous draft models. Each draft model is aligned with the target model using task-specific data, thereby enhancing the consistency of inference results. In addition, our proposed method incorporates an online lightweight prompt classifier to dynamically route prompts to the appropriate draft model. Experimental results demonstrate that the proposed method improves draft accuracy by 6% to 50% over vanilla speculative decoding, while achieving a speedup of 1.10x to 2.64x in LLM inference.

en cs.CL
arXiv Open Access 2025
Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis

We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $π_S(y\mid x)$. We provably approximate both the optimal tilted policy $π_{β,B}(y\mid x) \propto π_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the base model $π_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K), our method achieves higher accuracy than standard soft best-of-$n$ with $π_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $π_B$. The code is available at https://github.com/j-geuter/GSI .

en cs.LG, stat.ML
arXiv Open Access 2025
AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving

Kaiyu Huang, Hao Wu, Zhubo Shi et al.

Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to fluctuating workloads and dynamic system environments, resulting in impaired performance and SLO violations. In this paper, we introduce AdaSpec, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. AdaSpec proposes a theoretical model to analyze and predict the efficiency of speculative strategies across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to maximize performance while ensuring high SLO attainment. Experimental results on real-world LLM service traces demonstrate that AdaSpec consistently meets SLOs and achieves substantial performance improvements, delivering up to 66% speedup compared to state-of-the-art speculative inference systems. The source code is publicly available at https://github.com/cerebellumking/AdaSpec

arXiv Open Access 2025
Fast LLM Post-training via Decoupled and Fastest-of-N Speculation

Rongxin Cheng, Kai Zhou, Xingda Wei et al.

Rollout dominates the training time in large language model (LLM) post-training, where the trained model is used to generate tokens given a batch of prompts. This work, SpecActor, achieves fast rollout with speculative decoding that deploys a fast draft path to accelerate the unparallelizable generation, while the correctness is guaranteed by fast parallel verification of the outputs with the original model. SpecActor addresses two foundational challenges that hinder speculation efficiency: (1) a Decoupled speculation method that overcomes the computation inefficiency issue when executing speculative decoding with relative large per-worker batch size -- a common configuration in training but unfriendly to speculation, and (2) a Fastest-of-N speculation method that selects and combines different draft methods according to the rollout progress to approximate the optimal draft method even when the best one is unknown a priori. Extensive evaluations on production traces show that SpecActor accelerates mean rollout speed by 2.0--2.4x, with up to 2.7x speedup, over common post-training baselines. The results are consistent across both dense and MoE models and across different RL algorithms. Notably, SpecActor is 1.1--2.6x faster compared to vanilla speculative rollout in different traces. The accelerated rollout achieves 1.4--2.3x faster end-to-end training time.

en cs.DC, cs.AI
arXiv Open Access 2025
Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta

Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

en cs.CL
arXiv Open Access 2025
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Luohe Shi, Zuchao Li, Lefei Zhang et al.

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.

en cs.CL
DOAJ Open Access 2025
Helen’s argumentative coherence and the didactic element of Gorgias’ rhetoric

Maicon Reus-Engler

I argue in this paper that Gorgias’ Helen is a coherent epideictic speech with a strong didactic element. This didactic element refers to the fact that, by using antilogic and making the weaker argument the stronger, Gorgias conducts the audience’s opinion from one perspective to another. The coherence of the speech comes from the fact that Gorgias employs a commonsensical pattern of argumentation in the first two arguments to prepare the reader for the digressions on logos and love. Even though he never states it explicitly, Gorgias holds the endoxic idea that no one is responsible for an action com-mitted under coercion. I argue that the reasoning structure of the digressions depends on the two previous arguments, i.e., that Gorgias transforms logos and love into a sort of violence. To conclude, I show that Helen is both a coherent and a didactic speech that imparts an antilogical education to the audience.

Speculative philosophy, Philosophy (General)
arXiv Open Access 2024
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

Nadav Timor, Jonathan Mamou, Daniel Korat et al.

This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI--but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI--given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages speculation parallelism (SP), a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Our simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks. We open-source all our code.

en cs.DC, cs.AI
arXiv Open Access 2024
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Kaixuan Huang, Xudong Guo, Mengdi Wang

Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively. The code of this paper is available at https://github.com/Kaffaljidhmah2/SpecDec_pp.

en cs.CL, cs.AI
arXiv Open Access 2024
Constrained Decoding with Speculative Lookaheads

Nishanth Nakshatri, Shamik Roy, Rajarshi Das et al.

Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.

en cs.CL, cs.AI
DOAJ Open Access 2024
The ‘brown envelope syndrome’: Culture of bribery and ethics at the crossroads

Kelebogile T. Resane

This article unravels the destructive practice of bribery, which is ingrained in South African culture. Through the scenarios in the public transport industry and literature review, the article reveals that indeed bribery is a reality to reckon with. It is observed in all tiers of society such as politics, businesses and religious circles. Bribery explained as a ‘brown envelope syndrome’ is rife in the South African socio-political landscape. The objective of this article is to enlighten the readers that bribery from socio-theological perspectives is ethically unacceptable. It promotes corrupt political accountability, economic inequality, social disrespect and religious hypocrisy. The article starts with personal scenarios and then delves into the definition of bribery as payment in a form of money, favours or gifts, to influence a decision, to receive favours in turn, facilitate fast delivery of goods, gain an unfair advantage, avoid legal prosecution or to supply substandard products or services. The second part of the artilce discusses how bribery clashes with Christian ethics as pointed by Jesus in Mark 7:21–22, as bribery lies buried in human hearts and manifests itself in different sins. There is a clear biblical condemnation of bribery as detestable, and that by all means possible, it should be proscribed. Bribery in political circles is known as ‘brown envelope’, which means the currency exchanges between the corrupt and the corrupted. Solutions suggested is that citizens in their national democracies should break out of the shell and become vocal against bribery. The church should emerge behind the trees and come up with prophetic discernment, wisdom and be vocal against bribery. Personally, those exposed to bribery situations must maintain wisdom and learn how to refuse offers without endangering themselves. This research unravels bribery and its devastating effects of promoting corrupt political accountability, economic inequality, social disrespect and religious hypocrisy. Contribution: This article calls for communities and churches to be vocal against bribery and encourages personal accountability when confronted with bribery offers.

Epistemology. Theory of knowledge
S2 Open Access 2023
Global Salvation Inc.: Sir Michael Barber’s education for the apocalypse and the church of Deliverology®

E. Auld, P. Morris

ABSTRACT Drawing on insights from philosophy and theology, we explore the relationship between religion, data and global education policy through an analysis of the career of Sir Michael Barber, widely regarded as an authority on the reform of public services and an influential policy entrepreneur. The analysis provides a novel perspective which illuminates how secularised salvation narratives and apocalyptic symbolism have become more prominent in his work as he rose to become a global actor. The story is entwined with the turn towards New Public Management (NPM) in education, the rise of international large-scale assessments (ILSAs), and the transition from speculative faith to salvation through the ‘science of delivery’. We follow Barber's role in formalising the faith as a secular political theology, and promoting it globally through Delivery Units and the management doctrine of Deliverology®. The analysis closes by reflecting on the practical implications of the movement, which is now woven into the institutional ideologies and reform strategies of major international organisations, agencies and corporations as part of a wider shift towards New Global Management (NGM) in global education governance.

12 sitasi en
arXiv Open Access 2023
REST: Retrieval-Based Speculative Decoding

Zhenyu He, Zexuan Zhong, Tianle Cai et al.

We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at https://github.com/FasterDecoding/REST.

en cs.CL, cs.AI
DOAJ Open Access 2023
ОСОБЛИВОСТІ БУТТЯ І СВОБОДИ ЗА НАСТРОЄМ LANGEWEILE У ФІЛОСОФІЇ М. ГАЙДЕҐҐЕРА

Yurii MARINCHUK

У статті розглядається проблема впливу феномену нудьги з творчості А.Шопенгауера на творчість М.Гайдеґґера. Вітчизняні і світові дослідники відмічають, що Шопенгауер є основоположником екзистенціалізму, хоча в історії філософії займає нішу філософії життя. Для розкриття концептів світу, волі та нудьги він використовує закони логіки, достатні раціональні підстави, основи сущого та вищі ідеї. Не можна сказати, що М.Гайдеґґер нехтує таким підходом, хоча б тому, що провідний метод дослідження традиції, – деструкція, – вводить його вчення до неї. За матеріалом статті аналізуються опозиції: метафізичне вчення про волю як річ у собі – свобода, тільки яка й здатна ставити питання до себе (у Шопенгауера людина керується мотивами до вольового акту; у Гайдеггера людина є носієм якості свободи); нудьга як апріорі настрій пізнання – нудьга як екзистенціальний настрій Dasein. Стаття має за свої задачі: 1) дослідження вчення про нудьгу з метою вияву в її структурі аспектів свободи; 2) дослідження паралелей, зв’язку і протилежностей в концепті людини, на яку орієнтовані світогляди Шопенгауера та Гайдеґґера в проблемі можливості пізнання буття. Зроблено висновки, що у творчості Гайдеґґера одного періоду нудьга є фундаментальним настроєм Dasein, який виражений актуальним часу життя М.Г. духом постіндустріального робітничого суспільства. За пізнішою версією – нудьга є певним способом буття-тут, який характеризується тотальним підходом до відкриття собі своєї справжності. Таким чином, Гайдеґґер далеко наперед закидує актуальність проблеми нудьги в її широкому значенні, яке перегукується з шопенгауерівським – рутина, щоденність, коло бажань і страждань, ведення не свого життя, das man. Результатом дослідження, який представляється у статті є розділення гайдеггерівських інтуїцій про нудьгу на дві окремі позиції. Перша залишається у шопенгауерівському дусі, коли вчення про сенси буття, головним настроєм якого є нудьга, сприймаються як засновки ведення життя людей типів da-sein і das man. Друга представляє розрив від екзистенціальної дисципліни до фундаментальної онтології, герменевтики фактичності, в темі якої глибинною нудьгою пояснюється тотальність (постійна доступність до можливостей бути через мислення).

Epistemology. Theory of knowledge
S2 Open Access 2021
What is ‘critical’ in critical studies of edtech? Three responses

Felicitas Macgilchrist

We describe our focus in Learning, Media and Technology as the ‘ critical analysis of the social, cultural and political aspects of digital media production, consumption, technology and culture in educational contexts ’ . But what work is the word ‘ critical ’ doing in current research on education and technology? The meaning of ‘ critical ’ has never been stable. It varies across the fi elds that contribute to the conversation on edtech – from education to sociology, media studies, cultural studies, history, philosophy and others – and it changes over time. Some scholars wonder if we are facing a ‘ fatigue of critique ’ or suggest we move towards ‘ post-critique ’ (Editorial Team 2020; Hodgson, Vlieghe, and Zamojski 2018). Yet at the same time, I have been intrigued recently by the emergence of generative, speculative and utopian approaches to criticality in the most mainstream of places: funded research projects. Given this ‘ mainstreaming ’ of what was previously marginal in the fi eld of education and technology, now seems like an opportune moment to take stock of the priorities in current critical research on learning, media and technology, and to highlight areas for the future of this conversation.

54 sitasi en
S2 Open Access 2021
Software, Sovereignty and the Post-Neoliberal Politics of Exit

Harrison Smith, R. Burrows

This paper examines the impact of neoreactionary (NRx) thinking – that of Curtis Yarvin, Nick Land, Peter Thiel and Patri Friedman in particular – on contemporary political debates manifest in ‘architectures of exit’. We specifically focus on Urbit, as an NRx digital architecture that captures how post-neoliberal politics imagines notions of freedom and sovereignty through a micro-fracturing of nation-states into ‘gov-corps’. We trace the development of NRx philosophy – and situate this within contemporary political and technological change to theorize the significance of exit manifest within the notion of ‘dynamic geographies’. While technological programmes such as Urbit may never ultimately succeed, we argue that these, and other speculative investments such as ‘seasteading’, reflect broader post-neoliberal NRx imaginaries that were, perhaps, prefigured a quarter of a century ago in The Sovereign Individual.

50 sitasi en Political Science

Halaman 13 dari 71379