PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length
Situo Zhang, Yifan Zhang, Zichen Zhu
et al.
Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.
Blaise Pascal: La política como différance de la voluntad divina
Stephane Vinolo
La obra de Pascal evidencia una paradoja filosófica. Por un lado, consta de muy pocos textos explícitamente políticos; por otro lado, presenta una preocupación política constante. Con el fin de proponer una solución a esta paradoja, el autor muestra que Pascal nos impone repensar la misma definición de la política, así como su alcance y su límite. Pensada a partir de una doble naturaleza humana marcada por la Caída, la violencia, en el estado de naturaleza, es un problema ontológico que surge a raíz de la necesidad de conservar un objeto infinito para el amor humano. Por este motivo la solución política a esta yace en una autorregulación afectiva que hace de la política un campo exclusivamente comportamentalista que, si bien puede exigir ciertas acciones, no puede entrar en la intimidad de los individuos que pertenece a otro orden. Así, la política pascaliana no hace más que servir el mismo gesto de la creación divina del ser humano y la voluntad divina de mantenerlo en vida con el fin que lo pueda alabar; la política pascaliana es, entonces, un camino indirecto de realización de la voluntad divina: su différance.
Speculative philosophy, Philosophy (General)
Um percurso pela tratadística do padre Alexandre de Gusmão (SJ, 1629-1724)
Isabel Scremin da Silva
Este artigo propõe um resgate da obra impressa de Alexandre de Gusmão, considerando-o não só como fundador do Seminário de Belém da Cachoeira, na Bahia, mas também como jesuíta letrado de seu tempo. Responsável por diferentes escritos espirituais voltados a um amplo auditório aquém e além-mar, Gusmão classificou seis de seus impressos enquanto tratados, a saber: Escola de Bethlem (1678); Arte de crear bem os Filhos (1685); Rosa de Nazareth (1715); Eleyçam entre o Bem, & Mal Eterno (1720); O Corvo, e a Pomba da Arca de Noé (1734); Arvore da Vida, Jesus Crucificado (1734). Com o fito de identificarmos diálogos de Gusmão com autoridades antigas ou coevas, serão destacados aspectos de invenção, disposição e elocução, com destaque à variedade que um mesmo gênero, o tratado, assume nas diferentes espécies da produção gusmaniana.
Epistemology. Theory of knowledge, History (General)
Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Maximilian Holsman, Yukun Huang, Bhuwan Dhingra
Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model's generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension's efficiency while allowing it to leverage FSD's tunable quality-speed trade-off.
Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning
Tiancheng Su, Meicong Zhang, Guoxiu He
Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.
Speculative Evolution Through 3D Cellular Automata
Amir Hossein Khazaei
This project explores speculative evolution through a 3D implementation of Conway's Game of Life, using procedural simulation to generate unfamiliar extraterrestrial organic forms. By applying a volumetric optimized workflow, the raw cellular structures are smoothed into unified, bone-like geometries that resemble hypothetical non-terrestrial morphologies. The resulting forms, strange yet organic, are 3D printed as fossil-like artifacts, presenting a tangible representation of generative structures. This process situates the work at the intersection of artificial life, evolutionary modeling, and digital fabrication, illustrating how simple rules can simulate complex biological emergence and challenge conventional notions of organic form.
Speculative Decoding for Verilog: Speed and Quality, All in One
Changran Xu, Yi Liu, Yunhao Zhou
et al.
The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one. Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05x speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
Linye Wei, Shuzhang Zhong, Songqiang Xu
et al.
Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that dynamically modifies the draft sequence length to maximize the token acceptance length. SpecASR further proposes a draft sequence recycling strategy that reuses the previously generated draft sequence to reduce the draft ASR model latency. Moreover, a two-pass sparse token tree generation algorithm is also proposed to balance the latency of draft and target ASR models. With extensive experimental results, we demonstrate SpecASR achieves 3.04x-3.79x and 1.25x-1.84x speedup over the baseline autoregressive decoding and speculative decoding, respectively, without any loss in recognition accuracy.
SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification
Zhendong Tan, Xingjun Zhang, Chaoyi Hu
et al.
Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
Lujun Gui, Bin Xiao, Lei Su
et al.
Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the features preventing the draft model from obtaining the specific token output by the target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken the draft model's connection between features and logits, aiming to reduce the conflict between feature alignment and logit confidence during training. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. The results show that FSPAD outperforms the state-of-the-art method across all the aforementioned tasks and target LLMs.
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
Raghavv Goel, Mukul Gagrani, Wonseok Jeon
et al.
Text generation with Large Language Models (LLMs) is known to be memory bound due to the combination of their auto-regressive nature, huge parameter counts, and limited memory bandwidths, often resulting in low token rates. Speculative decoding has been proposed as a solution for LLM inference acceleration. However, since draft models are often unavailable in the modern open-source LLM families, e.g., for Llama 2 7B, training a high-quality draft model is required to enable inference acceleration via speculative decoding. In this paper, we propose a simple draft model training framework for direct alignment to chat-capable target models. With the proposed framework, we train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64\% of the original size. Our training framework only consists of pretraining, distillation dataset generation, and finetuning with knowledge distillation, with no additional alignment procedure. For the finetuning step, we use instruction-response pairs generated by target model for distillation in plausible data distribution, and propose a new Total Variation Distance++ (TVD++) loss that incorporates variance reduction techniques inspired from the policy gradient method in reinforcement learning. Our empirical results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4$\times$ speed-up relative to autoregressive decoding on various tasks with no further task-specific fine-tuning.
Preregistration does not improve the transparent evaluation of severity in Popper's philosophy of science or when deviations are allowed
Mark Rubin
One justification for preregistering research hypotheses, methods, and analyses is that it improves the transparent evaluation of the severity of hypothesis tests. In this article, I consider two cases in which preregistration does not improve this evaluation. First, I argue that, although preregistration may facilitate the transparent evaluation of severity in Mayo's error statistical philosophy of science, it does not facilitate this evaluation in Popper's theory-centric approach. To illustrate, I show that associated concerns about Type I error rate inflation are only relevant in the error statistical approach and not in a theory-centric approach. Second, I argue that a test procedure that is preregistered but that also allows deviations in its implementation (i.e., "a plan, not a prison") does not provide a more transparent evaluation of Mayoian severity than a non-preregistered procedure. In particular, I argue that sample-based validity-enhancing deviations cause an unknown inflation of the test procedure's Type I error rate and, consequently, an unknown reduction in its capability to license inferences severely. I conclude that preregistration does not improve the transparent evaluation of severity (a) in Popper's philosophy of science or (b) in Mayo's approach when deviations are allowed.
TikTag: Breaking ARM's Memory Tagging Extension with Speculative Execution
Juhee Kim, Jinbum Park, Sihyeon Roh
et al.
ARM Memory Tagging Extension (MTE) is a new hardware feature introduced in ARMv8.5-A architecture, aiming to detect memory corruption vulnerabilities. The low overhead of MTE makes it an attractive solution to mitigate memory corruption attacks in modern software systems and is considered the most promising path forward for improving C/C++ software security. This paper explores the potential security risks posed by speculative execution attacks against MTE. Specifically, this paper identifies new TikTag gadgets capable of leaking the MTE tags from arbitrary memory addresses through speculative execution. With TikTag gadgets, attackers can bypass the probabilistic defense of MTE, increasing the attack success rate by close to 100%. We demonstrate that TikTag gadgets can be used to bypass MTE-based mitigations in real-world systems, Google Chrome and the Linux kernel. Experimental results show that TikTag gadgets can successfully leak an MTE tag with a success rate higher than 95% in less than 4 seconds. We further propose new defense mechanisms to mitigate the security risks posed by TikTag gadgets.
Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation
Ziyin Zhang, Jiahao Xu, Tian Liang
et al.
Conventional speculative decoding (SD) methods utilize a predefined length policy for proposing drafts, which implies the premise that the target model smoothly accepts the proposed draft tokens. However, reality deviates from this assumption: the oracle draft length varies significantly, and the fixed-length policy hardly satisfies such a requirement. Moreover, such discrepancy is further exacerbated in scenarios involving complex reasoning and long-form generation, particularly under test-time scaling for reasoning-specialized models. Through both theoretical and empirical estimation, we establish that the discrepancy between the draft and target models can be approximated by the draft model's prediction entropy: a high entropy indicates a low acceptance rate of draft tokens, and vice versa. Based on this insight, we propose SVIP: Self-Verification Length Policy for Long-Context Speculative Decoding, which is a training-free dynamic length policy for speculative decoding systems that adaptively determines the lengths of draft sequences by referring to the draft entropy. Experimental results on mainstream SD benchmarks as well as reasoning-heavy benchmarks demonstrate the superior performance of SVIP, achieving up to 17% speedup on MT-Bench at 8K context compared with fixed draft lengths, and 22% speedup for QwQ in long-form reasoning.
Reckoning with the wicked problems of nuclear technology: Philosophy, design, and pedagogical method underlying a course on Nuclear Technology, Policy, and Society
Aditi Verma
This paper describes the underlying philosophy, design, and implementation of a course on "Nuclear Technology, Policy, and Society" taught in the Department of Nuclear Engineering and Radiological Sciences at the University of Michigan. The course explores some of nuclear technology's most pressing challenges or its 'wicked problems'. Through this course students explore the origins of these problems be they social or technical and, they are offered tools, both conceptual and methodological to make sense of these problems, and guided through a semester-long exploration of how scientists engineers can work towards their resolution, and to what degree these problems can be solved through institutional transformation or a transformation in our own practices and norms as a field. The underlying pedagogical philosophy, implementation, and response to the course are described here for other instructors who might wish to create a similar course, or for non-academic nuclear scientists and engineers, who might perhaps, in these pages, find a vocabulary for articulating and reflecting on the nature of these problems as encountered in their praxis.
Fundamental Challenges in Cybersecurity and a Philosophy of Vulnerability-Guided Hardening
Marcel Böhme
Research in cybersecurity may seem reactive, specific, ephemeral, and indeed ineffective. Despite decades of innovation in defense, even the most critical software systems turn out to be vulnerable to attacks. Time and again. Offense and defense forever on repeat. Even provable security, meant to provide an indubitable guarantee of security, does not stop attackers from finding security flaws. As we reflect on our achievements, we are left wondering: Can security be solved once and for all? In this paper, we take a philosophical perspective and develop the first theory of cybersecurity that explains what precisely and *fundamentally* prevents us from making reliable statements about the security of a software system. We substantiate each argument by demonstrating how the corresponding challenge is routinely exploited to attack a system despite credible assurances about the absence of security flaws. To make meaningful progress in the presence of these challenges, we introduce a philosophy of cybersecurity.
Recurrent Stochastic Fluctuations with Financial Speculation
Tomohiro Hirano
Throughout history, many countries have repeatedly experienced large swings in asset prices, which are usually accompanied by large fluctuations in macroeconomic activity. One of the characteristics of the period before major economic fluctuations is the emergence of new financial products; the situation prior to the 2008 financial crisis is a prominent example of this. During that period, a variety of structured bonds, including securitized products, appeared. Because of the high returns on such financial products, many economic agents were involved in them for speculative purposes, even if they were riskier, producing macro-scale effects. With this motivation, we present a simple macroeconomic model with financial speculation. Our model illustrates two points. First, stochastic fluctuations in asset prices and macroeconomic activity are driven by the repeated appearance and disappearance of risky financial assets, rather than expansions and contractions in credit availability. Second, in an economy with sufficient borrowing and lending, the appearance of risky financial assets leads to decreased productive capital, while in an economy with severely limited borrowing and lending, it leads to increased productive capital.
Mitigating Speculation-based Attacks through Configurable Hardware/Software Co-design
Ali Hajiabadi, Archit Agarwal, Andreas Diavastos
et al.
New speculation-based attacks that affect large numbers of modern systems are disclosed regularly. Currently, CPU vendors regularly fall back to heavy-handed mitigations like using barriers or enforcing strict programming guidelines resulting in significant performance overhead. What is missing is a solution that allows for efficient mitigation and is flexible enough to address both current and future speculation vulnerabilities, without additional hardware changes. In this work, we present SpecControl, a novel hardware/software co-design, that enables new levels of security while reducing the performance overhead that has been demonstrated by state-of-the-art methodologies. SpecControl introduces a communication interface that allows compilers and application developers to inform the hardware about true branch dependencies, confidential control-flow instructions, and fine-grained instruction constraints in order to apply restrictions only when necessary. We evaluate SpecControl against known speculative execution attacks and in addition, present a new speculative fetch attack variant on the Pattern History Table (PHT) in branch predictors that shows how similar previously reported vulnerabilities are more dangerous by enabling unprivileged attacks, especially with the state-of-the-art branch predictors. SpecControl provides stronger security guarantees compared to the existing defenses while reducing the performance overhead of two state-of-the-art defenses from 51% and 43% to just 23%.
Being Adolescent Inmates in Juvenile Penal Institutions During the Pandemic
Maria Rita Mancaniello
Durante il tempo della pandemia, uno dei contesti di vita che è passato sotto il silenzio collettivo è stato quello penitenziario. Il tempo della reclusione è un tempo dedicato alla rieducazione, alla possibilità di sviluppare conoscenze e competenze che possano offrire una nuova prospettiva con cui guardare il proprio ruolo nel mondo. Nell’età adolescenziale il carcere è una misura estrema che mira proprio a cercare di dare una risposta educativa e trasformativa, attraverso le attività culturali, artistiche, scolastiche. Un contatto con il mondo della formazione in senso ampio e della scuola più nello specifico, che permette agli adolescenti reclusi di entrare in dinamica con adulti significativi e di vivere processi di insegnamento-apprendimento di valore, soprattutto attraverso la relazione educativa e le competenze emotivo-affettive che la caratterizzano. Durante la pandemia questo delicato percorso rieducativo si è fermato e con esso tutte le diverse dimensioni relazionali, lasciando gli adolescenti in uno stato di deprivazione emotiva profonda e in uno stato di isolamento doloroso. Se ciò che è avvenuto non si può cambiare, può essere motivo di riflessione per rileggere il valore dell’affettività nelle relazioni e da qui ripartire per definire un nuovo modo di costruire le dinamiche educative.
Parole chiave: pandemia e isolamento in carcere, affettività nella relazione educativa, processi di insegnamento-apprendimento nel carcere minorile
Speculative philosophy, Ethics
Hybrid Animism: The Sensing Surfaces of Planetary Computation
B. Marenko
Abstract:This article proposes to examine animism through the perspective provided by a notion of immanent matter drawn on process philosophy (Spinoza, Deleuze and Guattari), and quantum physics (Bohm, Rovelli). It then deploys this perspective to illuminate how planetary computation-the impact of digital media technologies on a planetary scale-is rewiring the cognitive, affective, perceptual capacities of the human. The article puts forward the notion of hybrid animism, as a speculative and imaginative philosophical fiction ('philoso-fiction') to grasp planetary computation as a sensorial pan-affective event, and to account for the hybrid techno-digital ecologies humans already inhabit, characterised by ongoing modulation, sensorial intensification and pervasive distribution of computational matter across a plethora of screens, surfaces and surroundings. The value of this proposition, the article explains, is to eschew dominant techno-deterministic narratives: not only techno-euphoria and techno-dystopia, but also the notion of technology as enchantment, with its in-built mystification. By deploying the philoso-fiction of hybrid animism and the un-mediated intuitive sensorial grasp it fosters, planetary computation can begin to be immediately perceived as the expression of new modes of co-habitation and co-evolution of the human and the nonhuman. Finally, the article brings together the nonhuman mutating surfaces of digital matter with cephalopods' skins to vividly and speculatively illustrate hybrid animism as a thought experiment of sorts.