Hasil untuk "Speculative philosophy"

Menampilkan 20 dari ~1427515 hasil · dari CrossRef, arXiv, DOAJ, Semantic Scholar

JSON API
arXiv Open Access 2026
StarSD: One-for-Many Speculative Decoding

Junhao He, Feiran You, Hongyang Du

Speculative decoding accelerates autoregressive generation by separating token proposal from verification, but most existing approaches are designed for single-node execution and do not scale well to multi-accelerator clusters used for serving modern Large Language Models (LLMs). We present StarSD, a one-for-many speculative decoding framework that uses a single draft model to serve multiple target models across distributed nodes via a star topology. StarSD decouples drafting and verification, enabling effective sharing of draft computation, and preventing distributed accelerators from remaining idle under bursty workloads. We provide a system-level analysis that characterizes when and why a single draft model can remain fully utilized by multiple verifiers, yielding predictable latency and utilization gains. Extensive experiments in real-world distributed inference settings demonstrate that StarSD simplifies deployment and supports flexible resource allocation across heterogeneous accelerators, while maintaining output quality. These results indicate that StarSD is a practical and scalable framework for bringing speculative decoding to modern cloud and edge inference infrastructures.

en eess.SY
arXiv Open Access 2025
SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding

Haofei Yin, Mengbai Xiao, Tinghong Li et al.

The demand for large language model inference is rapidly increasing. Pipeline parallelism offers a cost-effective deployment strategy for distributed inference but suffers from high service latency. While incorporating speculative decoding to pipeline parallelism improves performance, it still faces challenges of low hardware utilization and narrow speculative window. Inspired by branch prediction in instruction pipelining, we introduce SpecPipe, which fills the pipeline with speculative tokens of a request step-by-step. By maximizing the hardware utilization, SpecPipe decodes one token per pipeline step ideally. Specifically, SpecPipe comprises a dynamic speculative token tree and a pipelined inference framework. The tree dynamically accepts tokens from a speculative token source and outputs the tokens to the inference pipeline. Since the speculative window relaxed in our framework, a high-accuracy draft model is integrated without fine-tuning. The pipeline inference framework follows node-wise computation, pruning propagation, and inter-node communication stages. We implement SpecPipe and a variant SpecPipe-DB with dynamic batching for single- and multi-request inference, respectively. On an 8-stage pipeline, SpecPipe improves time between tokens on diverse single-request workloads by $4.19\times$-$5.53\times$ over standard pipeline parallelism and by $2.08\times$-$2.38\times$ over prior tree-based speculative decoding methods. For multi-request workloads, SpecPipe-DB achieves $1.64\times$-$2.08\times$ higher throughput and $1.61\times$-$2.06\times$ lower time between tokens than vLLM.

en cs.LG
arXiv Open Access 2025
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding

Sanjit Neelam, Daniel Heinlein, Vaclav Cvicek et al.

Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3x for small batch sizes. However, increasing throughput and therefore reducing the cost per token requires decoding with large batch sizes. Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model's KV cache is sparse. We introduce SPIRe, a draft model that combines static sparse attention, pruned initialization, and feedback memory to increase the modeled throughput of speculative decoding by over 100% compared to speculation with a much smaller draft model and by over 35% compared to the strong baseline of sparse self-speculation. Our approach is particularly effective when context lengths vary significantly across requests.

en cs.LG
arXiv Open Access 2025
STree: Speculative Tree Decoding for Hybrid State-Space Models

Yangchao Wu, Zongyue Qin, Alex Wong et al.

Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead relative to current SSM implementations. Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code can be found at: https://github.com/wyc1997/stree.

en cs.LG, cs.AI
arXiv Open Access 2025
FSLH: Flexible Mechanized Speculative Load Hardening

Jonathan Baumann, Roberto Blanco, Léon Ducruet et al.

The Spectre speculative side-channel attacks pose formidable threats for security. Research has shown that code following the cryptographic constant-time discipline can be efficiently protected against Spectre v1 using a selective variant of Speculative Load Hardening (SLH). SLH was, however, not strong enough for protecting non-cryptographic code, leading to the introduction of Ultimate SLH, which provides protection for arbitrary programs, but has too large overhead for general use, since it conservatively assumes that all data is secret. In this paper we introduce a flexible SLH notion that achieves the best of both worlds by generalizing both Selective and Ultimate SLH. We give a suitable security definition for such transformations protecting arbitrary programs: any transformed program running with speculation should not leak more than what the source program leaks sequentially. We formally prove using the Rocq prover that two flexible SLH variants enforce this relative security guarantee. As easy corollaries we also obtain that, in our setting, Ultimate SLH enforces our relative security notion, and two selective SLH variants enforce speculative constant-time security.

en cs.CR, cs.PL
arXiv Open Access 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov et al.

Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.

en cs.CL, cs.AI
arXiv Open Access 2024
A Theoretical Perspective for Speculative Decoding Algorithm

Ming Yin, Minshuo Chen, Kaixuan Huang et al.

Transformer-based autoregressive sampling has been the major bottleneck for slowing down large language model inferences. One effective way to accelerate inference is \emph{Speculative Decoding}, which employs a small model to sample a sequence of draft tokens and a large model to validate. Given its empirical effectiveness, the theoretical understanding of Speculative Decoding is falling behind. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, \emph{output quality and inference acceleration}, from a theoretical perspective. Our analysis covers the theoretical limits of speculative decoding, batch algorithms, and output quality-inference acceleration tradeoffs. Our results reveal the fundamental connections between different components of LLMs via total variation distances and show how they jointly affect the efficiency of decoding algorithms.

en cs.LG, cs.AI
arXiv Open Access 2024
Graph-Structured Speculative Decoding

Zhuocheng Gong, Jiahao Liu, Ziyue Wang et al.

Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This structure enables us to efficiently predict and merge recurring token sequences, vastly reducing the computational demands of the draft model. We term this approach Graph-structured Speculative Decoding (GSD). We apply GSD across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and observe a remarkable speedup of 1.73$\times$ to 1.96$\times$, significantly surpassing standard speculative decoding.

en cs.CL
arXiv Open Access 2024
Multi-Candidate Speculative Decoding

Sen Yang, Shujian Huang, Xinyu Dai et al.

Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a sequence of tokens) from a fast draft model that is then verified in parallel by the target model. However, the acceptance rate of candidate tokens receives limitations from several factors, such as the model, the dataset, and the decoding setup. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model. Our approach shows significant improvements in acceptance rates on multiple datasets and models, consistently outperforming standard speculative decoding.

en cs.CL
arXiv Open Access 2024
Optimized Speculative Sampling for GPU Hardware Accelerators

Dominik Wagner, Seanie Lee, Ilja Baumann et al.

In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

en cs.LG, cs.CL
arXiv Open Access 2024
PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Tianyu Liu, Yun Li, Qitan Lv et al.

Speculative decoding (SD), where an extra draft model is employed to provide multiple draft tokens first, and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is guessing tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL). Specifically, PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$\times$ and 1.50$\times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively. Our code is available at https://github.com/smart-lty/ParallelSpeculativeDecoding.

en cs.CL
arXiv Open Access 2024
Problems in AI, their roots in philosophy, and implications for science and society

Max Velthoven, Eric Marcus

Artificial Intelligence (AI) is one of today's most relevant emergent technologies. In view thereof, this paper proposes that more attention should be paid to the philosophical aspects of AI technology and its use. It is argued that this deficit is generally combined with philosophical misconceptions about the growth of knowledge. To identify these misconceptions, reference is made to the ideas of the philosopher of science Karl Popper and the physicist David Deutsch. The works of both thinkers aim against mistaken theories of knowledge, such as inductivism, empiricism, and instrumentalism. This paper shows that these theories bear similarities to how current AI technology operates. It also shows that these theories are very much alive in the (public) discourse on AI, often called Bayesianism. In line with Popper and Deutsch, it is proposed that all these theories are based on mistaken philosophies of knowledge. This includes an analysis of the implications of these mistaken philosophies for the use of AI in science and society, including some of the likely problem situations that will arise. This paper finally provides a realistic outlook on Artificial General Intelligence (AGI) and three propositions on A(G)I and philosophy (i.e., epistemology).

en cs.AI, cs.ET
arXiv Open Access 2023
SafeBet: Secure, Simple, and Fast Speculative Execution

Conor Green, Cole Nelson, Mithuna Thottethodi et al.

Spectre attacks exploit microprocessor speculative execution to read and transmit forbidden data outside the attacker's trust domain and sandbox. Recent hardware schemes allow potentially-unsafe speculative accesses but prevent the secret's transmission by delaying most access-dependent instructions even in the predominantly-common, no-attack case, which incurs performance loss and hardware complexity. Instead, we propose SafeBet which allows only, and does not delay most, safe accesses, achieving both security and high performance. SafeBet is based on the key observation that speculatively accessing a destination location is safe if the location's access by the same static trust domain has been committed previously; and potentially unsafe, otherwise. We extend this observation to handle inter trust-domain code and data interactions. SafeBet employs the Speculative Memory Access Control Table (SMACT) to track non-speculative trust domain code region-destination pairs. Disallowed accesses wait until reaching commit to trigger well-known replay, with virtually no change to the pipeline. Software simulations using SpecCPU benchmarks show that SafeBet uses an 8.3-KB SMACT per core to perform within 6% on average (63% at worst) of the unsafe baseline behind which NDA-restrictive, a previous scheme of security and hardware complexity comparable to SafeBet's, lags by 83% on average.

en cs.AR, cs.CR
arXiv Open Access 2023
The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Qidong Su, Christina Giannoula, Gennady Pekhimenko

Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus incurring low hardware utilization on modern GPUs. Batching and speculative decoding are two techniques to improve GPU hardware utilization in LLM inference. To study their synergy, we implement a prototype implementation and perform an extensive characterization analysis on various LLM models and GPU architectures. We observe that the optimal speculation length depends on the batch size used. We analyze the key observation and build a quantitative model to explain it. Based on our analysis, we propose a new adaptive speculative decoding strategy that chooses the optimal speculation length for different batch sizes. Our evaluations show that our proposed method can achieve equal or better performance than the state-of-the-art speculation decoding schemes with fixed speculation length.

en cs.LG, cs.DC
DOAJ Open Access 2023
Narrativas de viagem, história natural e circulação de saberes na primeira época moderna

Moreno Stedile

O artigo apresenta um estudo de caso sobre a obra Etiópia Oriental de frei João dos Santos, tendo em conta sua singularidade na produção de conhecimento sobre a África centro-oriental. Observamos os sentidos atribuídos ao saber que provém da experiência e as formas de validação epistêmica na alta idade moderna. A produção de saberes que articula os centros metropolitanos às fronteiras dos impérios marítimos desvela dinâmicas das zonas de contato e regimes de circulação próprios dessas redes imperiais. Tomando como objeto os registros sobre animais africanos, pontuamos as formas de agenciamento e redes locais, os modos de mediação, tradução e a dimensão negociada dessa produção textual.

Epistemology. Theory of knowledge, Science
DOAJ Open Access 2022
El miedo al oprimido: el caso de Estamos muertos. Fear of the oppressed: the case of All of us are dead.

Pablo Esteban Romero Medina

¿Qué es peor que tener que sobrevivir a un apocalipsis zombi? Hacerlo y que aún así la sociedad te margine si suspendes Selectividad. Así al menos lo temen los chicos de Estamos muertos, serie surcoreana estrenada en Netflix. En este artículo reflexionamos a partir de las tesis del realismo capitalista de Mark Fisher distintos conflictos sociales que la serie nos transmite y que suelen exacerbarse en el género zombi. Con ellos nos permitimos también una reflexión sobre si este género pertenece a lo distópico y qué situación actual tiene la distopía en pleno siglo XXI. La serie nos permitirá reflexionar sobre la extraña situación de la distopía mainstream que no parece acusar la gran cantidad de transformaciones políticas de los últimos años y que parece atrapada en la primera década del siglo. What's worse than having to survive a zombie apocalypse? Doing so and still being ostracised by society if you fail the university entrance exams. At least that's what the guys in All of us are Dead, a South Korean series released on Netflix, fear. In this article, based on Mark Fisher's thesis of capitalist realism, we reflect on different social conflicts that the series transmits to us and which are usually exacerbated in the zombie genre. Through them we also reflect on whether this genre belongs to the dystopian and what is the current situation of dystopia in the 21st century. The series will allow us to reflect on the strange situation of mainstream dystopia, which does not seem to be affected by the many political transformations of the last few years and which seems to be trapped in the first decade of the century.

Speculative philosophy, Sociology (General)
DOAJ Open Access 2022
De Lost a Yellowjackets: O sobre cómo (no) cambia la deriva distópica cuando las subjetividades son Otras. De Lost a Yellowjackets: O sobre cómo (no) cambia la deriva distópica cuando las subjetividades son Otras.

Delicia Aguado Peláez, Patricia Martínez García

Este artículo se detiene en el análisis de contenido desde una perspectiva interseccional de dos series de televisión del presente siglo, como son Lost (Perdidos, –Abrams, Lieber y Lindelof 2004-2010–) y Yellowjackets (Lyle y Nickerson, 2021-presente). Se eligen porque parten de premisas similares: tras un accidente de avión, y sin la espera de un rescate, un grupo tiene que organizarse para sobrevivir en un ambiente hostil. En este sentido, nos interesa conocer, ¿cómo se imagina la nueva sociedad?, ¿se repiten los sistemas de dominación de las sociedades de origen o se crean nuevas fórmulas? Por otro lado, ambas narraciones cuentan con diferencias considerables en cuanto a contexto, producción y reparto que nos llevan a inquirir: ¿Dónde se pone el foco cuando las personas protagonistas (no) forman parte del sujeto político normativo?, ¿aquellas marcadas como otras se representan desde otras formas de hacer? Como conclusiones, destacar que Yellowjackets introduce miradas no hegemónicas que permiten incluir papeles que rehúyen de estereotipos y roles, así como temáticas feminizadas invisibilizadas. Sin embargo, este cambio no permite superar la parálisis melancólica individual y colectiva que atraviesa a ambas producciones, repitiendo las estructuras de poder y dominación que dinamitan una oportunidad de avanzar hacia una sociedad utópica. This article looks at the content analysis from an intersectional perspective of two television series of the present century: Lost (Abrams, Lieber y Lindelof 2004-2010) and Yellowjackets(Lyle y Nickerson, 2021-present). They are chosen because they startfrom similar premises: after a plane crash, and without the expectation of a rescue, a group has to organise itself to survive in a hostile environment. In this sense, we are interested in knowing how the new society is imagined: are the systems of domination of the original societies repeated or are new formulas created? Furthermore, both narratives have considerable differences in terms of context, production and casting that lead us to ask: Where is the focus placed when the protagonists are (not) part of the normative political subject? Are these people marked as others represented from other ways of doing? As conclusions, it is worth highlighting that Yellowjackets introduces non-hegemonic gazes that allow for the inclusion of roles that shun stereotypes and roles, as well as feminised themes that have been made invisible. However, this change does not allow us to overcome the individual and collective melancholic paralysis that runs through both productions, repeating the structures of power and domination that dynamite an opportunity to advance towards a utopian society.

Speculative philosophy, Sociology (General)
arXiv Open Access 2021
Cats vs. Spectre: An Axiomatic Approach to Modeling Speculative Execution Attacks

Hernán Ponce-de-León, Johannes Kinder

The Spectre family of speculative execution attacks have required a rethinking of formal methods for security. Approaches based on operational speculative semantics have made initial inroads towards finding vulnerable code and validating defenses. However, with each new attack grows the amount of microarchitectural detail that has to be integrated into the underlying semantics. We propose an alternative, light-weight and axiomatic approach to specifying speculative semantics that relies on insights from memory models for concurrency. We use the CAT modeling language for memory consistency to specify execution models that capture speculative control flow, store-to-load forwarding, predictive store forwarding, and memory ordering machine clears. We present a bounded model checking framework parametrized by our speculative CAT models and evaluate its implementation against the state of the art. Due to the axiomatic approach, our models can be rapidly extended to allow our framework to detect new types of attacks and validate defenses against them.

en cs.CR, cs.PL

Halaman 9 dari 71376