Offline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, S. Levine
Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.
1341 sitasi
en
Computer Science
Science Requirements and Detector Concepts for the Electron-Ion Collider
R. A. Khalek, A. Accardi, J. Adam
et al.
This report describes the physics case, the resulting detector requirements, and the evolving detector concepts for the experimental program at the Electron-Ion Collider (EIC). The EIC will be a powerful new high-luminosity facility in the United States with the capability to collide high-energy electron beams with high-energy proton and ion beams, providing access to those regions in the nucleon and nuclei where their structure is dominated by gluons. Moreover, polarized beams in the EIC will give unprecedented access to the spatial and spin structure of the proton, neutron, and light ions. The studies leading to this document were commissioned and organized by the EIC User Group with the objective of advancing the state and detail of the physics program and developing detector concepts that meet the emerging requirements in preparation for the realization of the EIC. The effort aims to provide the basis for further development of concepts for experimental equipment best suited for the science needs, including the importance of two complementary detectors and interaction regions. This report consists of three volumes. Volume I is an executive summary of our findings and developed concepts. In Volume II we describe studies of a wide range of physics measurements and the emerging requirements on detector acceptance and performance. Volume III discusses general-purpose detector concepts and the underlying technologies to meet the physics requirements. These considerations will form the basis for a world-class experimental program that aims to increase our understanding of the fundamental structure of all visible matter
1033 sitasi
en
Physics, Computer Science
Software for the frontiers of quantum chemistry: An overview of developments in the Q-Chem 5 package
E. Epifanovsky, A. Gilbert, Xintian Feng
et al.
This article summarizes technical advances contained in the fifth major release of the Q-Chem quantum chemistry program package, covering developments since 2015. A comprehensive library of exchange–correlation functionals, along with a suite of correlated many-body methods, continues to be a hallmark of the Q-Chem software. The many-body methods include novel variants of both coupled-cluster and configuration-interaction approaches along with methods based on the algebraic diagrammatic construction and variational reduced density-matrix methods. Methods highlighted in Q-Chem 5 include a suite of tools for modeling core-level spectroscopy, methods for describing metastable resonances, methods for computing vibronic spectra, the nuclear–electronic orbital method, and several different energy decomposition analysis techniques. High-performance capabilities including multithreaded parallelism and support for calculations on graphics processing units are described. Q-Chem boasts a community of well over 100 active academic developers, and the continuing evolution of the software is supported by an “open teamware” model and an increasingly modular design.
Event generator tunes obtained from underlying event and multiparton scattering measurements
V. Khachatryan, A. Sirunyan, A. Tumasyan
et al.
New sets of parameters ("tunes") for the underlying-event (UE) modeling of the PYTHIA8, PYTHIA6 and HERWIG++ Monte Carlo event generators are constructed using different parton distribution functions. Combined fits to CMS UE proton-proton (pp) data at sqrt(s) = 7 TeV and to UE proton-antiproton (p p-bar) data from the CDF experiment at lower sqrt(s), are used to study the UE models and constrain their parameters, providing thereby improved predictions for proton-proton collisions at 13 TeV. In addition, it is investigated whether the values of the parameters obtained from fits to UE observables are consistent with the values determined from fitting observables sensitive to double-parton scattering processes. Finally, comparisons of the UE tunes to"minimum bias"(MB) events, multijet, and Drell-Yan (q q-bar to Z / gamma* to lepton-antilepton + jets) observables at 7 and 8 TeV are presented, as well as predictions for MB and UE observables at 13 TeV.
Rapid blood-pressure lowering in patients with acute intracerebral hemorrhage.
Anderson Cs, E. Heeley, Y. Huang
et al.
Jet energy scale and resolution in the CMS experiment in pp collisions at 8 TeV
Khachatryan, A. Sirunyan, A. Tumasyan
et al.
Improved jet energy scale corrections, based on a data sample corresponding to an integrated luminosity of 19.7 inverse-femtobarns collected by the CMS experiment in proton-proton collisions at a center-of-mass energy of 8 TeV, are presented. The corrections as a function of pseudorapidity eta and transverse momentum pT are extracted from data and simulated events combining several channels and methods. They account successively for the effects of pileup, uniformity of the detector response, and residual data-simulation jet energy scale differences. Further corrections, depending on the jet flavor and distance parameter (jet size) R, are also presented. The jet energy resolution is measured in data and simulated events and is studied as a function of pileup, jet size, and jet flavor. Typical jet energy resolutions at the central rapidities are 15-20% at 30 GeV, about 10% at 100 GeV, and 5% at 1 TeV. The studies exploit events with dijet topology, as well as photon+jet, Z+jet and multijet events. Several new techniques are used to account for the various sources of jet energy scale corrections, and a full set of uncertainties, and their correlations, are provided. The final uncertainties on the jet energy scale are below 3% across the phase space considered by most analyses (pT>30 GeV and abs(eta)30 GeV is reached, when excluding the jet flavor uncertainties, which are provided separately for different jet flavors. A new benchmark for jet energy scale determination at hadron colliders is achieved with 0.32% uncertainty for jets with pT of the order of 165-330 GeV, and abs(eta)<0.8.
The CMS experiment at the CERN LHC
S. Chatrchyan, G. Hmayakyan, V. Khachatryan
et al.
Ubiquity and dominance of oxygenated species in organic aerosols in anthropogenically‐influenced Northern Hemisphere midlatitudes
Q. Zhang, J. Jimenez, M. Canagaratna
et al.
2027 sitasi
en
Biology, Geology
The ALICE Collaboration
M. Aggarwal, Z. Ahammed, J. Amonett
et al.
Validation of the NPI-Q, a brief clinical form of the Neuropsychiatric Inventory.
Daniel I. Kaufer, Jeffrey L. Cummings, Jeffrey L. Cummings
et al.
1837 sitasi
en
Psychology, Medicine
The CMS trigger system
U. Bhawandeep, V. Khachatryan, A. Sirunyan
et al.
The CMS trigger system must reduce an input data rate from the LHC bunch-crossing frequency of 40 MHz to a rate which will be written to permanent storage. A detailed study has recently been made of the performance of this system. This paper presents key elements of the results obtained and gives details of a draft “trigger table” for the Level-1 Trigger and the High-Level Trigger selection at a “start-up” luminosity of 2× 1033 cm – 2s – 1. High efficiencies for most physics objects are attainable with a selection that remains inclusive and avoids detailed topological or other requirements on the event.
1196 sitasi
en
Physics, Computer Science
Ultrasensitive terahertz sensing with high-Q Fano resonances in metasurfaces
Ranjan Singh, W. Cao, I. Al-Naib
et al.
High quality factor resonances are extremely promising for designing ultra-sensitive refractive index label-free sensors, since it allows intense interaction between electromagnetic waves and the analyte material. Metamaterial and plasmonic sensing have recently attracted a lot of attention due to subwavelength confinement of electromagnetic fields in the resonant structures. However, the excitation of high quality factor resonances in these systems has been a challenge. We excite an order of magnitude higher quality factor resonances in planar terahertz metamaterials that we exploit for ultrasensitive sensing. The low-loss quadrupole and Fano resonances with extremely narrow linewidths enable us to measure the minute spectral shift caused due to the smallest change in the refractive index of the surrounding media. We achieve sensitivity levels of 7.75 × 103 nm/refractive index unit (RIU) with quadrupole and 5.7 × 104 nm/RIU with the Fano resonances which could be further enhanced by using thinner substrates. These findings would facilitate the design of ultrasensitive real time chemical and biomolecular sensors in the fingerprint region of the terahertz regime.
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
Rafael Rafailov, Joey Hejna, Ryan Park
et al.
Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference. We theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
246 sitasi
en
Computer Science
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner
et al.
Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.
253 sitasi
en
Computer Science
Development and Validation of the Camouflaging Autistic Traits Questionnaire (CAT-Q)
Laura Hull, W. Mandy, Meng-Chuan Lai
et al.
There currently exist no self-report measures of social camouflaging behaviours (strategies used to compensate for or mask autistic characteristics during social interactions). The Camouflaging Autistic Traits Questionnaire (CAT-Q) was developed from autistic adults’ experiences of camouflaging, and was administered online to 354 autistic and 478 non-autistic adults. Exploratory factor analysis suggested three factors, comprising of 25 items in total. Good model fit was demonstrated through confirmatory factor analysis, with measurement invariance analyses demonstrating equivalent factor structures across gender and diagnostic group. Internal consistency (α = 0.94) and preliminary test–retest reliability (r = 0.77) were acceptable. Convergent validity was demonstrated through comparison with measures of autistic traits, wellbeing, anxiety, and depression. The present study provides robust psychometric support for the CAT-Q.
381 sitasi
en
Medicine, Psychology
Cosmology in
f(Q)
geometry
J. Jim'enez, Lavinia Heisenberg, T. Koivisto
et al.
The universal character of the gravitational interaction provided by the equivalence principle motivates a geometrical description of gravity. The standard formulation of General Relativity a la Einstein attributes gravity to the spacetime curvature, to which we have grown accustomed. However, this perception has masked the fact that two alternative, though equivalent, formulations of General Relativity in flat spacetimes exist, where gravity can be fully ascribed either to torsion or to non-metricity. The latter allows a simpler geometrical formulation of General Relativity that is oblivious to the affine spacetime structure. Generalisations along this line permit to generate teleparallel and symmetric teleparallel theories of gravity with exceptional properties. In this work we explore modified gravity theories based on non-linear extensions of the non-metricity scalar. After presenting some general properties and briefly studying some interesting background cosmologies (including accelerating solutions with relevance for inflation and dark energy), we analyse the behaviour of the cosmological perturbations. Tensor perturbations feature a re-scaling of the corresponding Newton's constant, while vector perturbations do not contribute in the absence of vector sources. In the scalar sector we find two additional propagating modes, hinting that $f(Q)$ theories introduce, at least, two additional degrees of freedom. These scalar modes disappear around maximally symmetric backgrounds because of the appearance of an accidental residual gauge symmetry corresponding to a restricted diffeomorphism. We finally discuss the potential strong coupling problems of these maximally symmetric backgrounds caused by the discontinuity in the number of propagating modes.
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions
Yevgen Chebotar, Q. Vuong, A. Irpan
et al.
In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizing each action dimension and representing the Q-value of each action dimension as separate tokens, we can apply effective high-capacity sequence modeling techniques for Q-learning. We present several design decisions that enable good performance with offline RL training, and show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite. The project's website and videos can be found at https://qtransformer.github.io
144 sitasi
en
Computer Science
Risk-Sensitive Option Market Making with Arbitrage-Free eSSVI Surfaces: A Constrained RL and Stochastic Control Bridge
Jian'an Zhang
We formulate option market making as a constrained, risk-sensitive control problem that unifies execution, hedging, and arbitrage-free implied-volatility surfaces inside a single learning loop. A fully differentiable eSSVI layer enforces static no-arbitrage conditions (butterfly and calendar) while the policy controls half-spreads, hedge intensity, and structured surface deformations (state-dependent rho-shift and psi-scale). Executions are intensity-driven and respond monotonically to spreads and relative mispricing; tail risk is shaped with a differentiable CVaR objective via the Rockafellar--Uryasev program. We provide theory for (i) grid-consistency and rates for butterfly/calendar surrogates, (ii) a primal--dual grounding of a learnable dual action acting as a state-dependent Lagrange multiplier, (iii) differentiable CVaR estimators with mixed pathwise and likelihood-ratio gradients and epi-convergence to the nonsmooth objective, (iv) an eSSVI wing-growth bound aligned with Lee's moment constraints, and (v) policy-gradient validity under smooth surrogates. In simulation (Heston fallback; ABIDES-ready), the agent attains positive adjusted P\&L on most intraday segments while keeping calendar violations at numerical zero and butterfly violations at the numerical floor; ex-post tails remain realistic and can be tuned through the CVaR weight. The five control heads admit clear economic semantics and analytic sensitivities, yielding a white-box learner that unifies pricing consistency and execution control in a reproducible pipeline.
Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets
Jian'an Zhang
We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN--CVaR--PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small-$α$ estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints -- ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate -- solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.
Limit Order Book Dynamics in Matching Markets: Microstructure, Spread, and Execution Slippage
Yao Wu
Conventional models of matching markets assume that monetary transfers can clear markets by compensating for utility differentials. However, empirical patterns show that such transfers often fail to close structural preference gaps. This paper introduces a market microstructure framework that models matching decisions as a limit order book system with rigid bid ask spreads. Individual preferences are represented by a latent preference state matrix, where the spread between an agent's internal ask price (the unconditional maximum) and the market's best bid (the reachable maximum) creates a structural liquidity constraint. We establish a Threshold Impossibility Theorem showing that linear compensation cannot close these spreads unless it induces a categorical identity shift. A dynamic discrete choice execution model further demonstrates that matches occur only when the market to book ratio crosses a time decaying liquidity threshold, analogous to order execution under inventory pressure. Numerical experiments validate persistent slippage, regional invariance of preference orderings, and high tier zero spread executions. The model provides a unified microstructure explanation for matching failures, compensation inefficiency, and post match regret in illiquid order driven environments.