Hasil "q-fin.RM" - JURNALIN

S2 Open Access 2025

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI, Daya Guo, Dejian Yang et al.

General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)1,2 and chain-of-thought (CoT) prompting3, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models. A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.

5346 sitasi en Medicine, Computer Science

Detail DOI Sumber

S2 Open Access 2021

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, S. Levine

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

1343 sitasi en Computer Science

Detail Sumber

S2 Open Access 2020

Conservative Q-Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou, G. Tucker et al.

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

2400 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2019

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Aviral Kumar, Justin Fu, G. Tucker et al.

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.

1241 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2019

Multi-messenger Observations of a Binary Neutron Star Merger

B. Abbott, R. Abbott, T. Abbott et al.

On 2017 August 17 a binary neutron star coalescence candidate (later designated GW170817) with merger time 12:41:04 UTC was observed through gravitational waves by the Advanced LIGO and Advanced Virgo detectors. The Fermi Gamma-ray Burst Monitor independently detected a gamma-ray burst (GRB 170817A) with a time delay of ∼ 1.7 s with respect to the merger time. From the gravitational-wave signal, the source was initially localized to a sky region of 31 deg2 at a luminosity distance of 40 − 8 + 8 Mpc and with component masses consistent with neutron stars. The component masses were later measured to be in the range 0.86 to 2.26 M ⊙ . An extensive observing campaign was launched across the electromagnetic spectrum leading to the discovery of a bright optical transient (SSS17a, now with the IAU identification of AT 2017gfo) in NGC 4993 (at ∼ 40 Mpc ) less than 11 hours after the merger by the One-Meter, Two Hemisphere (1M2H) team using the 1 m Swope Telescope. The optical transient was independently detected by multiple teams within an hour. Subsequent observations targeted the object and its environment. Early ultraviolet observations revealed a blue transient that faded within 48 hours. Optical and infrared observations showed a redward evolution over ∼10 days. Following early non-detections, X-ray and radio emission were discovered at the transient’s position ∼ 9 and ∼ 16 days, respectively, after the merger. Both the X-ray and radio emission likely arise from a physical process that is distinct from the one that generates the UV/optical/near-infrared emission. No ultra-high-energy gamma-rays and no neutrino candidates consistent with the source were found in follow-up searches. These observations support the hypothesis that GW170817 was produced by the merger of two neutron stars in NGC 4993 followed by a short gamma-ray burst (GRB 170817A) and a kilonova/macronova powered by the radioactive decay of r-process nuclei synthesized in the ejecta.

3078 sitasi en Physics, Political Science

Detail DOI Sumber

S2 Open Access 2019

Drivers of improved PM2.5 air quality in China from 2013 to 2017

Qiang Zhang, Yixuan Zheng, D. Tong et al.

Significance The high frequency of haze pollution in China has attracted broad attention and triggered, in 2013, the promulgation of the toughest-ever clean air policy in the country. In this study, we quantified the air quality and health benefits from specific clean air actions by combining a chemical transport model with a detailed emission inventory. As tremendous efforts and resources are needed for mitigating emissions from various sources, evaluation of the effectiveness of these measures can provide crucial information for developing air quality policies in China as well as in other developing and highly polluting countries. Based on measure-specific analysis, our results bear out several important implications for designing future clean air policies. From 2013 to 2017, with the implementation of the toughest-ever clean air policy in China, significant declines in fine particle (PM2.5) concentrations occurred nationwide. Here we estimate the drivers of the improved PM2.5 air quality and the associated health benefits in China from 2013 to 2017 based on a measure-specific integrated evaluation approach, which combines a bottom-up emission inventory, a chemical transport model, and epidemiological exposure-response functions. The estimated national population–weighted annual mean PM2.5 concentrations decreased from 61.8 (95%CI: 53.3–70.0) to 42.0 µg/m3 (95% CI: 35.7–48.6) in 5 y, with dominant contributions from anthropogenic emission abatements. Although interannual meteorological variations could significantly alter PM2.5 concentrations, the corresponding effects on the 5-y trends were relatively small. The measure-by-measure evaluation indicated that strengthening industrial emission standards (power plants and emission-intensive industrial sectors), upgrades on industrial boilers, phasing out outdated industrial capacities, and promoting clean fuels in the residential sector were major effective measures in reducing PM2.5 pollution and health burdens. These measures were estimated to contribute to 6.6- (95% CI: 5.9–7.1), 4.4- (95% CI: 3.8–4.9), 2.8- (95% CI: 2.5–3.0), and 2.2- (95% CI: 2.0–2.5) µg/m3 declines in the national PM2.5 concentration in 2017, respectively, and further reduced PM2.5-attributable excess deaths by 0.37 million (95% CI: 0.35–0.39), or 92% of the total avoided deaths. Our study confirms the effectiveness of China’s recent clean air actions, and the measure-by-measure evaluation provides insights into future clean air policy making in China and in other developing and polluting countries.

1686 sitasi en Environmental Science, Medicine

Detail DOI Sumber

S2 Open Access 2017

Deep Q-learning From Demonstrations

Todd Hester, Matej Vecerík, O. Pietquin et al.

Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.

1237 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2016

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

Wei Shen, Shuai Le, Yan Li et al.

FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This paper describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations. SeqKit is open source and available on Github at https://github.com/shenwei356/seqkit.

2598 sitasi en Computer Science, Medicine

Detail DOI Sumber

S2 Open Access 2016

Analysis of D-Q Small-Signal Impedance of Grid-Tied Inverters

B. Wen, D. Boroyevich, R. Burgos et al.

1074 sitasi en Engineering

Detail DOI Sumber

S2 Open Access 2016

Continuous Deep Q-Learning with Model-based Acceleration

S. Gu, T. Lillicrap, I. Sutskever et al.

Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of modelfree algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.

1063 sitasi en Computer Science

Detail Sumber

S2 Open Access 2018

Is Q-learning Provably Efficient?

Chi Jin, Zeyuan Allen-Zhu, Sébastien Bubeck et al.

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [Deisenroth and Rasmussen 2011, Schulman et al. 2015]. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tilde{O}(\sqrt{H^3 SAT})$, where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps. This sample efficiency matches the optimal regret that can be achieved by any model-based approach, up to a single $\sqrt{H}$ factor. To the best of our knowledge, this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."

895 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2019

An approach toward decision-making and medical diagnosis problems using the concept of spherical fuzzy sets

T. Mahmood, Kifayat Ullah, Q. Khan et al.

778 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2019

A Theoretical Analysis of Deep Q-Learning

Zhuoran Yang, Yuchen Xie, Zhaoran Wang

Despite the great empirical success of deep reinforcement learning, its theoretical foundation is less well understood. In this work, we make the first attempt to theoretically understand the deep Q-network (DQN) algorithm (Mnih et al., 2015) from both algorithmic and statistical perspectives. In specific, we focus on a slight simplification of DQN that fully captures its key features. Under mild assumptions, we establish the algorithmic and statistical rates of convergence for the action-value functions of the iterative policy sequence obtained by DQN. In particular, the statistical error characterizes the bias and variance that arise from approximating the action-value function using deep neural network, while the algorithmic error converges to zero at a geometric rate. As a byproduct, our analysis provides justifications for the techniques of experience replay and target network, which are crucial to the empirical success of DQN. Furthermore, as a simple extension of DQN, we propose the Minimax-DQN algorithm for zero-sum Markov game with two players. Borrowing the analysis of DQN, we also quantify the difference between the policies obtained by Minimax-DQN and the Nash equilibrium of the Markov game in terms of both the algorithmic and statistical rates of convergence.

736 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2019

Quantization

Yun Q. Shi, Huifang Sun

. Let { f i } Ni =1 be a set of equi-contractive similitudes on R 1 satisfying the ﬁnite-type condition. We study the asymptotic quantization error for the self-similar measures µ associated with { f i } Ni =1 and a positive probability vector. With a veriﬁable assumption, we prove that the upper and lower quantization coeﬃcient for µ are both bounded away from zero and inﬁnity. This can be regarded as an extension of Graf and Luschgy’s result on self-similar measures with the open set condition. Our result is applicable to a signiﬁcant class of self-similar measures with overlaps, including Erd¨os measure, the 3-fold convolution of the classical Cantor measure and the self-similar measures on some λ -Cantor sets.

673 sitasi en

Detail DOI Sumber

S2 Open Access 2019

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Sheng Shen, Zhen Dong, Jiayu Ye et al.

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use Hessian-based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

649 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2020

QPLEX: Duplex Dueling Multi-Agent Q-Learning

Jianhao Wang, Zhizhou Ren, Terry Liu et al.

We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or lead to poor performance. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.

582 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2023

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang et al.

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

465 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2023

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner et al.

Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.

253 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2023

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

Haoning Wu, Zicheng Zhang, Erli Zhang et al.

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.

251 sitasi en Computer Science

Detail DOI Sumber

arXiv Open Access 2024

A note on robust convex risk measures

Marcelo Righi, Fernanda Müller

In this paper, we refine and generalize closed forms for worst-case law invariant convex risk measures with uncertainty sets based on: i) closed balls under $p$-norms and Wasserstein distance; and ii) moment constraints involving mean and variance. We also characterize the argmax of the worst-case problem in both settings. From such general results, we illustrate our framework by developing explicit closed forms for concrete examples of convex risk measures. Furthermore, we use extensive numerical simulations in order to assess the impact of robustness on capital determination and portfolio optimization.

en q-fin.RM, q-fin.MF

Detail Sumber

Hasil untuk "q-fin.RM"