Hasil untuk "Speculative philosophy"

Menampilkan 20 dari ~1427566 hasil · dari CrossRef, DOAJ, arXiv, Semantic Scholar

JSON API
arXiv Open Access 2025
BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Yunlong Hou, Fengzhuo Zhang, Cunxiao Du et al.

Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

en cs.LG, cs.AI
arXiv Open Access 2025
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang, Carl Chengyan Fu, Fei Kou et al.

Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.

en cs.CL
arXiv Open Access 2025
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding

Aayush Gautam, Susav Shrestha, Narasimha Reddy

Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textit{GammaTune} and \textit{GammaTune+}, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% ($\pm$5\%) with \textit{GammaTune} and 16\% ($\pm$3\%) with \textit{GammaTune+}, while reducing performance variance. This makes \textit{GammaTune} a robust and efficient solution for real-world deployment.

en cs.CL, cs.AI
arXiv Open Access 2025
Speculative Decoding Reimagined for Multimodal Large Language Models

Luxi Lin, Zhihang Lin, Zhanpeng Zeng et al.

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to $2.29\times$ for LLaVA-1.5-7B and up to $2.46\times$ for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.

en cs.CV, cs.AI
arXiv Open Access 2025
Confidence-Modulated Speculative Decoding for Large Language Models

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

en cs.CL, cs.AI
S2 Open Access 2024
Matter, affect, life: A Whiteheadian intervention into ‘more-than-human’ geographies

Tom Roberts

Geographic theorisations of the ‘non-’ or ‘more-than-human’ continue to play a significant role in disrupting anthropocentrism within the humanities and social sciences. This article explores how Alfred North Whitehead's philosophy can contribute to geography's more-than-human aspirations, focussing on his radically non-anthropocentric theory of experience. Situating his work within geography's recent speculative turn, I unpack the implications of Whitehead's philosophy in relation to three key areas of concern in more-than-human geographies, namely new materialism, affect theory, and (neo-)vitalism. In doing so, I show how geographical critiques of anthropocentric thinking stand to gain from a deeper engagement with Whitehead's work.

6 sitasi en
S2 Open Access 2024
Challenges and Opportunities Surrounding Catholic Education

John Haldane

Catholic education faces a number of serious challenges including cultural and political disrespect for, and hostility towards religion in general and Catholicism in particular, and lack of knowledge of, and commitment to, Catholic beliefs and values among Catholic educational administrators, school managers, teachers, and other staff, as well as the diminishing percentage of even nominally Catholic staff. I set these matters within the context of broader challenges surrounding Catholic education, deriving from three cultural movements: the reformation, the emergence of liberalism, and the scientific revolution, which undermined the synthesis of scripture, theology, and speculative and practical philosophy achieved in the high middle-ages. I propose in response a creative critique showing that what is of authentic value in modernity can be accommodated within the traditional synthesis. I also connect that tradition with strands of eastern philosophy suggesting that the movement of people, ideas, and traditions from Eastern cultures into historically Western societies provides an opportunity for further synthesis of a wisdom-based approach to education.

4 sitasi en
DOAJ Open Access 2024
O conceito de arquétipo na Imunologia

Selma Giorgio

A introdução do conceito de arquétipo na Imunologia foi proposta em 2019. Considerando-se as amplas e complexas funções do sistema imune, a coleção dos diversos tipos celulares e de citocinas, os programas de expressão gênica e a plasticidade celular, a aplicação do conceito de arquétipo poderá desempenhar papel organizador e norteador para a geração de novas práticas e teorias imunológicas. Neste artigo, analisarei o conceito de arquétipo imune e farei uma breve incursão pela origem do termo e seus significados em alguns contextos não biológicos e em contextos biológicos significativos, como a morfoanatomia, a neurobiologia e a ecologia. À luz dessas considerações, eu proporei um arquétipo imune, o granuloma, estrutura celular complexa e conservada evolutivamente. Concluirei este trabalho retomando os significados de arquétipo nos diferentes contextos, comparando-os com aquele proposto para os fenômenos imunológicos e, assim, contribuindo para a reflexão sobre temas de interesse da filosofia da imunologia.

Biology (General), Epistemology. Theory of knowledge
DOAJ Open Access 2024
Seroprevalence to Hepatitis B Virus among Prisoners Taking into Account Age, HIV Status, and Injection Drug Use

M. V. Piterskiy, A. A. Storozhev, Yu. A. Zakharova et al.

Relevance. Prisoners have a high risk of contracting hemocontact viral infections (including HIV, viral hepatitis B and C, etc.), which creates an additional infectious burden on the entire population living in the territory. Aims. To study the level of immune protection to viral hepatitis B in risk groups (age, HIV status, and injection drug use) of persons held in places of detention to identify those in need of vaccine prophylaxis. Materials & Methods. 343 blood serum samples obtained in 2021 from males with negative HBsAg status were studied. Anti-HBs antibodies to HBsAg were determined using a set of reagents “VectoHBsAg-antibodies” (Vector-Best, Russia). Results and discussion. The protective titer of anti-HBs antibodies was detected in 44.0% (n = 151) of cases, and was absent in 56.0% of the subjects. At the same time, anti-HBs was significantly more often detected in people living with HIV/AIDS (p = 0.038), injecting drug users (p = 0.002), as well as young people born after 1984 (p = 0.019). Conclusion. The lack of a significant level of collective immunity among prisoners, primarily the older age group before 1984, their risky behaviors (sexual, injection) indicate the need for active identification of seronegative persons serving sentences in places of detention and specific immunoprophylaxis.

Epistemology. Theory of knowledge
arXiv Open Access 2024
SSSD: Simply-Scalable Speculative Decoding

Michele Marzollo, Jiawei Zhuang, Niklas Roemer et al.

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.

en cs.CL, cs.AI
arXiv Open Access 2024
When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs

Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang et al.

Deployed large language models (LLMs) often rely on speculative decoding, a technique that generates and verifies multiple candidate tokens in parallel, to improve throughput and latency. In this work, we reveal a new side-channel whereby input-dependent patterns of correct and incorrect speculations can be inferred by monitoring per-iteration token counts or packet sizes. In evaluations using research prototypes and production-grade vLLM serving frameworks, we show that an adversary monitoring these patterns can fingerprint user queries (from a set of 50 prompts) with over 75% accuracy across four speculative-decoding schemes at temperature 0.3: REST (100%), LADE (91.6%), BiLD (95.2%), and EAGLE (77.6%). Even at temperature 1.0, accuracy remains far above the 2% random baseline - REST (99.6%), LADE (61.2%), BiLD (63.6%), and EAGLE (24%). We also show the capability of the attacker to leak confidential datastore contents used for prediction at rates exceeding 25 tokens/sec. To defend against these, we propose and evaluate a suite of mitigations, including packet padding and iteration-wise token aggregation.

en cs.CL, cs.AI
arXiv Open Access 2024
AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration

Bradley McDanel

Large language models typically generate tokens autoregressively, using each token as input for the next. Recent work on Speculative Decoding has sought to accelerate this process by employing a smaller, faster draft model to more quickly generate candidate tokens. These candidates are then verified in parallel by the larger (original) verify model, resulting in overall speedup compared to using the larger model by itself in an autoregressive fashion. In this work, we introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that further accelerates generation by decoupling the draft and verify phases into a continuous, asynchronous approach. Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices (e.g., GPUs). We evaluate our approach over multiple datasets and show that AMUSD achieves an average 29% improvement over speculative decoding and up to 1.96$\times$ speedup over conventional autoregressive decoding, while achieving identical output quality. Our system is open-source and available at https://github.com/BradMcDanel/AMUSD/.

en cs.CL, cs.DC
arXiv Open Access 2024
Modeling Speculative Trading Patterns in Token Markets: An Agent-Based Analysis with TokenLab

Mengjue Wang, Stylianos Kampakis

This paper presents the application of Tokenlab, an agent-based modeling framework designed to analyze price dynamics and speculative behavior within token-based economies. By decomposing complex token systems into discrete agent interactions governed by fundamental behavioral rules, Tokenlab simplifies the simulation of otherwise intricate market scenarios. Its core innovation lies in its ability to model a range of speculative strategies and assess their collective influence on token price evolution. Through a novel controller mechanism, Tokenlab facilitates the simulation of multiple speculator archetypes and their interactions, thereby providing valuable insights into market sentiment and price formation. This method enables a systematic exploration of how varying degrees of speculative activity and evolving strategies across different market stages shape token price trajectories. Our findings enhance the understanding of speculation in token markets and present a quantitative framework for measuring and interpreting market heat indicators.

en cs.MA
arXiv Open Access 2024
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Hyun Ryu, Eric Kim

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token generation process. Speculative decoding addresses this bottleneck by introducing a two-stage framework: drafting and verification. A smaller, efficient model generates a preliminary draft, which is then refined by a larger, more sophisticated model. This paper provides a comprehensive survey of speculative decoding methods, categorizing them into draft-centric and model-centric approaches. We discuss key ideas associated with each method, highlighting their potential for scaling LLM inference. This survey aims to guide future research in optimizing speculative decoding and its integration into real-world LLM applications.

en cs.CL, cs.AI
arXiv Open Access 2024
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

Bin Xiao, Chunan Shi, Xiaonan Nie et al.

Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.

en cs.CL, cs.AI
DOAJ Open Access 2023
P.I. Ogarkov

article Editorial

25 февраля 2023 г после тяжелой и продолжительной болезни на 68 году жизни ушел из жизни Заслуженный работник высшей школы РФ, доктор медицинских наук, профессор, полковник медицинской службы в отставке, всеми нами уважаемый Павел Иванович ОГАРКОВ.

Epistemology. Theory of knowledge
arXiv Open Access 2023
SoK: Hardware Defenses Against Speculative Execution Attacks

Guangyuan Hu, Zecheng He, Ruby Lee

Speculative execution attacks leverage the speculative and out-of-order execution features in modern computer processors to access secret data or execute code that should not be executed. Secret information can then be leaked through a covert channel. While software patches can be installed for mitigation on existing hardware, these solutions can incur big performance overhead. Hardware mitigation is being studied extensively by the computer architecture community. It has the benefit of preserving software compatibility and the potential for much smaller performance overhead than software solutions. This paper presents a systematization of the hardware defenses against speculative execution attacks that have been proposed. We show that speculative execution attacks consist of 6 critical attack steps. We propose defense strategies, each of which prevents a critical attack step from happening, thus preventing the attack from succeeding. We then summarize 20 hardware defenses and overhead-reducing features that have been proposed. We show that each defense proposed can be classified under one of our defense strategies, which also explains why it can thwart the attack from succeeding. We discuss the scope of the defenses, their performance overhead, and the security-performance trade-offs that can be made.

en cs.CR, cs.AR
arXiv Open Access 2023
A Classical Model of Speculative Asset Price Dynamics

Sabiou Inoua, Vernon Smith

In retrospect, the experimental findings on competitive market behavior called for a revival of the old, classical, view of competition as a collective higgling and bargaining process (as opposed to price-taking behaviors) founded on reservation prices (in place of the utility function). In this paper, we specialize the classical methodology to deal with speculation, an important impediment to price stability. The model involves typical features of a field or lab asset market setup and lends itself to an experimental test of its specific predictions; here we use the model to explain three general stylized facts, well established both empirically and experimentally: the excess, fat-tailed, and clustered volatility of speculative asset prices. The fat tails emerge in the model from the amplifying nature of speculation, leading to a random-coefficient autoregressive return process (and power-law tails); the volatility clustering is due to the traders' long memory of news; bubbles are a persistent phenomenon in the model, and, assuming the standard lab present value pattern, the bubble size increases with the proportion of speculators and decreases with the trading horizon.

en q-fin.GN, econ.TH
arXiv Open Access 2023
An Attack on The Speculative Vectorization: Leakage from Higher Dimensional Speculation

Sayinath Karuppanan, Samira Mirbagher Ajorpaz

This paper argues and shows that speculative vectorization, where a loop with rare or unknown memory dependencies are still vectorized, is fundamentally vulnerable and cannot be mitigated by existing defenses. We implement a simple proof of concept and show the leakage in Apple M2 SoC. We describe the source of leakage using Microarchitectural Leakage Descriptors MLD and we additionally describe principles to extend MLD for other optimization. Also as part of implementation we reverse engineer the M2 cache size and use threaded timer to differentiate between cache hit and miss.

en cs.CR

Halaman 11 dari 71379