Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR -> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path -- a duplex S2S model -- speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path -- a cascaded ASR -> LLM pipeline -- generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow-path model scales. Because the prefix handoff requires no architectural modification to either component, RelayS2S serves as a lightweight, drop-in addition to existing cascaded pipelines. Our code and data are publicly available at: https://github.com/mailong25/relays2s
Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.
Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
Raja Gond, Aditya K Kamath, Ramachandran Ramjee
et al.
In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42, a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape-consistent reductions. Leveraging these insights, LLM-42 decodes tokens using a non-deterministic fast path and enforces determinism via a lightweight verify-rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM-42 mostly re-uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism.
From 2000 to 2015, the UN's Millennium Development Goals guided global priorities. The subsequent Sustainable Development Goals (SDGs) adopted a more dynamic approach, with annual indicator updates. As 2030 nears and progress lags, innovative acceleration strategies are critical. This study develops an AI-powered knowledge graph system to analyze SDG interconnections, discover potential new goals, and visualize them online. Using official SDG texts, Elsevier's keyword dataset, and 1,127 TED Talk transcripts (2020.01-2024.04), a pilot on 269 talks from 2023 applies AI-speculative design, large language models, and retrieval-augmented generation. Key findings include: (1) Heatmap analysis reveals strong associations between Goal 10 and Goal 16, and minimal coverage of Goal 6. (2) In the knowledge graph, simulated dialogue over time reveals new central nodes, showing how richer data supports divergent thinking and goal clarity. (3) Six potential new goals are proposed, centered on equity, resilience, and technology-driven inclusion. This speculative-AI framework offers fresh insights for policymakers and lays groundwork for future multimodal and cross-system SDG applications.
Drawing on infrastructure studies in HCI and CSCW, this paper introduces Protocol Futuring, a methodological framework that extends design futuring by foregrounding protocols -- rules, standards, and coordination mechanisms -- as the primary material of speculative inquiry. Rather than imagining discrete future artifacts, Protocol Futuring examines how protocol rules accumulate drift, jam, and other second-order effects over long temporal horizons. We demonstrate the method through a case study of Knowledge Futurama, a multi-team participatory workshop exploring millennial-scale knowledge preservation. Using a relay format in which teams inherited and reinterpreted partially formed designs, the workshop revealed how ambiguous handovers, adversarial reinterpretations, shifting cultural norms, and crisis dynamics transform protocols as they move across communities and epochs. The case shows how Protocol Futuring makes infrastructural politics and long-run consequences analytically visible. We discuss the method's strengths, limitations, and implications for researchers investigating emergent sociotechnical systems whose impacts unfold over extended timescales.
This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPSS, which emphasizes interpretive methodologies and understands meaning as context-dependent, ambiguous, and historically situated. We argue that HPSS is uniquely positioned not only to benefit from LLMs' capabilities but also to interrogate their epistemic assumptions and infrastructural implications. To this end, we first offer a concise primer on LLM architectures and training paradigms tailored to non-technical readers. We frame LLMs not as neutral tools but as epistemic infrastructures that encode assumptions about meaning, context, and similarity, conditioned by their training data, architecture, and patterns of use. We then examine how computational techniques enhanced by LLMs, such as structuring data, detecting patterns, and modeling dynamic processes, can be applied to support interpretive research in HPSS. Our analysis compares full-context and generative models, outlines strategies for domain and task adaptation (e.g., continued pretraining, fine-tuning, and retrieval-augmented generation), and evaluates their respective strengths and limitations for interpretive inquiry in HPSS. We conclude with four lessons for integrating LLMs into HPSS: (1) model selection involves interpretive trade-offs; (2) LLM literacy is foundational; (3) HPSS must define its own benchmarks and corpora; and (4) LLMs should enhance, not replace, interpretive methods.
Xingmeng Zhao, Tongnian Wang, Dan Schumacher
et al.
Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 17 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (79.1%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.
Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-bit KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.
Raffi Khatchadourian, Tatiana Castro Vélez, Mehdi Bagherzadeh
et al.
Efficiency is essential to support ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code -- supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. Though hybrid approaches aim for the "best of both worlds," using them effectively requires subtle considerations. Our key insight is that, while DL programs typically execute sequentially, hybridizing imperative DL code resembles parallelizing sequential code in traditional systems. Inspired by this, we present an automated refactoring approach that assists developers in determining which otherwise eagerly-executed imperative DL functions could be effectively and efficiently executed as graphs. The approach features novel static imperative tensor and side-effect analyses for Python. Due to its inherent dynamism, analyzing Python may be unsound; however, the conservative approach leverages a speculative (keyword-based) analysis for resolving difficult cases that informs developers of any assumptions made. The approach is: (i) implemented as a plug-in to the PyDev Eclipse IDE that integrates the WALA Ariadne analysis framework and (ii) evaluated on nineteen DL projects consisting of 132 KLOC. The results show that 326 of 766 candidate functions (42.56%) were refactorable, and an average relative speedup of 2.16x on performance tests was observed with negligible differences in model accuracy. The results indicate that the approach is useful in optimizing imperative DL code to its full potential.
Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%.
Bell's inequality is derived from three assumptions: measurement independence, outcome independence, and parameter independence. Among these, measurement independence, often taken for granted, holds that hidden variables are statistically uncorrelated with measurement settings. Under this assumption, the violation of Bell's inequality implies that either outcome independence or parameter independence fails to hold, meaning that local hidden variables do not exist. In this paper, we refer to this interpretive stance as the nonfactorizable position. In contrast, superdeterminism represents the view that measurement independence does not hold. Despite its foundational role, this assumption has received relatively little philosophical scrutiny. This paper offers a philosophical reassessment of measurement independence through three major frameworks in the philosophy of science: de Regt's contextual theory of scientific understanding, Kuhn's criteria for theory choice, and Lakatos's methodology of scientific research programmes. Using these lenses, we evaluate the two major responses to the violation of Bell's inequality, the nonfactorizable position and superdeterminism, and argue that the nonfactorizable position currently fares better across all three criteria. Beyond this binary, we introduce a spectrum of intermediate positions that allow for partial violations of measurement independence, modeled via mutual information. These positions modify the ``positive heuristic'' of superdeterminism, a crucial component in Lakatos's definition of research programmes, offering avenues for progressive research. This analysis reframes the debate surrounding Bell's inequality and illustrates how methodological tools can effectively guide theory evaluation in physics.
The article analyzes the notion of philosophical education in Étienne Gilson’s essay History of Philosophy and Philosophical Education. The French thinker presents philosophy not as a collection of doctrines but as a life devoted to the pursuit of wisdom, requiring a personal act of knowledge and inner formation. The teaching of philosophy makes sense only insofar as it leads to authentic philosophizing rather than merely the transmission of knowledge. A key role in this process is played by the history of philosophy, which makes it possible to encounter great masters and enter into dialogue with their works. The master–disciple relationship, also through the reading of the classics, constitutes the foundation of philosophical formation. Gilson emphasizes the communal dimension of philosophy, understood as participation in the community of philosophers built over the centuries.
En el parágrafo §7 de Ser y Tiempo Heidegger distingue entre «vulgar», «formal» y «fenomenológico» a propósito del concepto de «fenómeno». El objetivo del presente artículo es mostrar, en primer lugar, que esta diferencia puede entenderse únicamente desde una previa aclaración de los motivos metodológicos del término. Se sostendrá, en segundo lugar, que todo tratamiento temático es al mismo tiempo un tratamiento metodológico, esto es, que la fenomenología hermenéutica de la existencia humana desarrolla y oculta un particular discurso del método.
Augmented reality (AR) technologies function to ‘augment’ normal perception by superimposing virtual objects onto an agent’s visual field. The philosophy of augmented reality is a small but growing subfield within the philosophy of technology. Existing work in this subfield includes research on the phenomenology of augmented experiences, the metaphysics of virtual objects, and different ethical issues associated with AR systems, including (but not limited to) issues of privacy, property rights, ownership, trust, and informed consent. This paper addresses some epistemological issues posed by AR systems. I focus on a near-future version of AR technology called the Real-World Web, which promises to radically transform the nature of our relationship to digital information by mixing the virtual with the physical. I argue that the Real-World Web (RWW) threatens to exacerbate three existing epistemic problems in the digital age: the problem of digital distraction, the problem of digital deception, and the problem of digital divergence. The RWW is poised to present new versions of these problems in the form of what I call the augmented attention economy, augmented skepticism, and the problem of other augmented minds. The paper draws on a range of empirical research on AR and offers a phenomenological analysis of virtual objects as perceptual affordances to help ground and guide the speculative nature of the discussion. It also considers a few policy-based and designed-based proposals to mitigate the epistemic threats posed by AR technology.
Solving the AI alignment problem requires having clear, defensible values towards which AI systems can align. Currently, targets for alignment remain underspecified and do not seem to be built from a philosophically robust structure. We begin the discussion of this problem by presenting five core, foundational values, drawn from moral philosophy and built on the requisites for human existence: survival, sustainable intergenerational existence, society, education, and truth. We show that these values not only provide a clearer direction for technical alignment work, but also serve as a framework to highlight threats and opportunities from AI systems to both obtain and sustain these values.
This article takes as its starting point the notion of the "new time of the world," proposed by the philosopher Paulo Arantes, around which I articulate the material and subjective transformations of post-covid forms of life. I work with the proposition that these changes are inseparable, a consequence of the perfect fit between platform capitalism and the changes brought about by the covid -19 pandemic. Finally, drawing on the philosophy of Judith Butler, I discuss the difficulty of finding a path for mourning work in relation to the collective losses that are, indeed, the loss of a certain form of human existence.
It is necessary to quantitatively describe or illustrate the characteristics of abnormal stock price fluctuations in order to prevent and control financial risks. This paper studies the fractal structure of China’s stock market by calculating the fractal dimension and scaling behavior on the timeline of its eight big slumps, the results show that the slumps have multifractal characteristics, which are correlated with the policy intervention, institutional arrangements, and investors’ rationality. The empirical findings are a perfect match with the anomalous features of the stock prices. The fractal dimensions of the eight stock collapses are between 0.84 and 0.98. The fractal dimension distribution of the slumps is sensitive to market conditions and the active degree of speculative trading. The more mature market conditions and the more risk-averse investors correspond to the higher fractal dimension and the fall which is less deep. Therefore, the fractal characteristics could reflect the evolution characteristics of the stock market and investment philosophy. The parameter set calculated in this paper could be used as an effective tool to foresee the slumps on the horizon.