Hasil "q-bio.TO" - JURNALIN

S2 Open Access 2025

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI, Daya Guo, Dejian Yang et al.

General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)1,2 and chain-of-thought (CoT) prompting3, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models. A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.

5358 sitasi en Medicine, Computer Science

Detail DOI Sumber

S2 Open Access 2023

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland et al.

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are"Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

2260 sitasi en Computer Science

Detail Sumber

S2 Open Access 2021

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, S. Levine

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

1349 sitasi en Computer Science

Detail Sumber

S2 Open Access 2020

Conservative Q-Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou, G. Tucker et al.

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

2406 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2019

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Aviral Kumar, Justin Fu, G. Tucker et al.

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.

1244 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2019

Multi-messenger Observations of a Binary Neutron Star Merger

B. Abbott, R. Abbott, T. Abbott et al.

On 2017 August 17 a binary neutron star coalescence candidate (later designated GW170817) with merger time 12:41:04 UTC was observed through gravitational waves by the Advanced LIGO and Advanced Virgo detectors. The Fermi Gamma-ray Burst Monitor independently detected a gamma-ray burst (GRB 170817A) with a time delay of ∼ 1.7 s with respect to the merger time. From the gravitational-wave signal, the source was initially localized to a sky region of 31 deg2 at a luminosity distance of 40 − 8 + 8 Mpc and with component masses consistent with neutron stars. The component masses were later measured to be in the range 0.86 to 2.26 M ⊙ . An extensive observing campaign was launched across the electromagnetic spectrum leading to the discovery of a bright optical transient (SSS17a, now with the IAU identification of AT 2017gfo) in NGC 4993 (at ∼ 40 Mpc ) less than 11 hours after the merger by the One-Meter, Two Hemisphere (1M2H) team using the 1 m Swope Telescope. The optical transient was independently detected by multiple teams within an hour. Subsequent observations targeted the object and its environment. Early ultraviolet observations revealed a blue transient that faded within 48 hours. Optical and infrared observations showed a redward evolution over ∼10 days. Following early non-detections, X-ray and radio emission were discovered at the transient’s position ∼ 9 and ∼ 16 days, respectively, after the merger. Both the X-ray and radio emission likely arise from a physical process that is distinct from the one that generates the UV/optical/near-infrared emission. No ultra-high-energy gamma-rays and no neutrino candidates consistent with the source were found in follow-up searches. These observations support the hypothesis that GW170817 was produced by the merger of two neutron stars in NGC 4993 followed by a short gamma-ray burst (GRB 170817A) and a kilonova/macronova powered by the radioactive decay of r-process nuclei synthesized in the ejecta.

3082 sitasi en Physics, Political Science

Detail DOI Sumber

S2 Open Access 2019

Drivers of improved PM2.5 air quality in China from 2013 to 2017

Qiang Zhang, Yixuan Zheng, D. Tong et al.

Significance The high frequency of haze pollution in China has attracted broad attention and triggered, in 2013, the promulgation of the toughest-ever clean air policy in the country. In this study, we quantified the air quality and health benefits from specific clean air actions by combining a chemical transport model with a detailed emission inventory. As tremendous efforts and resources are needed for mitigating emissions from various sources, evaluation of the effectiveness of these measures can provide crucial information for developing air quality policies in China as well as in other developing and highly polluting countries. Based on measure-specific analysis, our results bear out several important implications for designing future clean air policies. From 2013 to 2017, with the implementation of the toughest-ever clean air policy in China, significant declines in fine particle (PM2.5) concentrations occurred nationwide. Here we estimate the drivers of the improved PM2.5 air quality and the associated health benefits in China from 2013 to 2017 based on a measure-specific integrated evaluation approach, which combines a bottom-up emission inventory, a chemical transport model, and epidemiological exposure-response functions. The estimated national population–weighted annual mean PM2.5 concentrations decreased from 61.8 (95%CI: 53.3–70.0) to 42.0 µg/m3 (95% CI: 35.7–48.6) in 5 y, with dominant contributions from anthropogenic emission abatements. Although interannual meteorological variations could significantly alter PM2.5 concentrations, the corresponding effects on the 5-y trends were relatively small. The measure-by-measure evaluation indicated that strengthening industrial emission standards (power plants and emission-intensive industrial sectors), upgrades on industrial boilers, phasing out outdated industrial capacities, and promoting clean fuels in the residential sector were major effective measures in reducing PM2.5 pollution and health burdens. These measures were estimated to contribute to 6.6- (95% CI: 5.9–7.1), 4.4- (95% CI: 3.8–4.9), 2.8- (95% CI: 2.5–3.0), and 2.2- (95% CI: 2.0–2.5) µg/m3 declines in the national PM2.5 concentration in 2017, respectively, and further reduced PM2.5-attributable excess deaths by 0.37 million (95% CI: 0.35–0.39), or 92% of the total avoided deaths. Our study confirms the effectiveness of China’s recent clean air actions, and the measure-by-measure evaluation provides insights into future clean air policy making in China and in other developing and polluting countries.

1688 sitasi en Environmental Science, Medicine

Detail DOI Sumber

S2 Open Access 2017

Offloading in Mobile Edge Computing: Task Allocation and Computational Frequency Scaling

T. Dinh, Jianhua Tang, Q. La et al.

843 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2023

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang et al.

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

467 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2019

Future Physics Programme of BESIII

M. Ablikim, M. Achasov, P. Adlarson et al.

There has recently been a dramatic renewal of interest in hadron spectroscopy and charm physics. This renaissance has been driven in part by the discovery of a plethora of charmonium-like XYZ states at BESIII and B factories, and the observation of an intriguing proton-antiproton threshold enhancement and the possibly related X(1835) meson state at BESIII, as well as the threshold measurements of charm mesons and charm baryons. We present a detailed survey of the important topics in tau-charm physics and hadron physics that can be further explored at BESIII during the remaining operation period of BEPCII. This survey will help in the optimization of the data-taking plan over the coming years, and provides physics motivation for the possible upgrade of BEPCII to higher luminosity.

474 sitasi en Physics

Detail DOI Sumber

S2 Open Access 2017

Direct detection of a break in the teraelectronvolt cosmic-ray spectrum of electrons and positrons

G. Ambrosi, Q. An, R. Asfandiyarov et al.

High-energy cosmic-ray electrons and positrons (CREs), which lose energy quickly during their propagation, provide a probe of Galactic high-energy processes and may enable the observation of phenomena such as dark-matter particle annihilation or decay. The CRE spectrum has been measured directly up to approximately 2 teraelectronvolts in previous balloon- or space-borne experiments, and indirectly up to approximately 5 teraelectronvolts using ground-based Cherenkov γ-ray telescope arrays. Evidence for a spectral break in the teraelectronvolt energy range has been provided by indirect measurements, although the results were qualified by sizeable systematic uncertainties. Here we report a direct measurement of CREs in the energy range 25 gigaelectronvolts to 4.6 teraelectronvolts by the Dark Matter Particle Explorer (DAMPE) with unprecedentedly high energy resolution and low background. The largest part of the spectrum can be well fitted by a ‘smoothly broken power-law’ model rather than a single power-law model. The direct detection of a spectral break at about 0.9 teraelectronvolts confirms the evidence found by previous indirect measurements, clarifies the behaviour of the CRE spectrum at energies above 1 teraelectronvolt and sheds light on the physical origin of the sub-teraelectronvolt CREs.

516 sitasi en Physics, Medicine

Detail DOI Sumber

S2 Open Access 2024

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Rafael Rafailov, Joey Hejna, Ryan Park et al.

Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference. We theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

247 sitasi en Computer Science

Detail Sumber

S2 Open Access 2023

Q-Diffusion: Quantizing Diffusion Models

Xiuyu Li, Long Lian, Yijia Liu et al.

Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks, it does not work out-of-the-box on diffusion models. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models, which compresses the noise estimation network to accelerate the generation process. We identify the key difficulty of diffusion model quantization as the changing output distributions of noise estimation networks over multiple time steps and the bimodal activation distribution of the shortcut layers within the noise estimation network. We tackle these challenges with timestep-aware calibration and split shortcut quantization in this work. Experimental results show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance (small FID change of at most 2.34 compared to >100 for traditional PTQ) in a training-free manner. Our approach can also be applied to text-guided image generation, where we can run stable diffusion in 4-bit weights with high generation quality for the first time.

271 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2011

Deterministic design of wavelength scale, ultra-high Q photonic crystal nanobeam cavities.

Q. Quan, M. Lončar

Photonic crystal nanobeam cavities are versatile platforms of interest for optical communications, optomechanics, optofluidics, cavity QED, etc. In a previous work [Appl. Phys. Lett. 96, 203102 (2010)], we proposed a deterministic method to achieve ultrahigh Q cavities. This follow-up work provides systematic analysis and verifications of the deterministic design recipe and further extends the discussion to air-mode cavities. We demonstrate designs of dielectric-mode and air-mode cavities with Q > 10⁹, as well as dielectric-mode nanobeam cavities with both ultrahigh-Q (> 10⁷) and ultrahigh on-resonance transmissions (T > 95%).

431 sitasi en Physics, Medicine

Detail DOI Sumber

S2 Open Access 2014

Q-compensated reverse-time migration

T. Zhu, J. Harris, B. Biondi

257 sitasi en Physics

Detail DOI Sumber

S2 Open Access 1992

Investment and Tobin's Q: Evidence from company panel data

R. Blundell, Stephen R. Bond, M. Devereux et al.

474 sitasi en Economics, Mathematics

Detail DOI Sumber

S2 Open Access 2010

Pressure-induced superconductivity in topological parent compound Bi2Te3

J. Zhang, S. J. Zhang, H. Weng et al.

We report a successful observation of pressure-induced superconductivity in a topological compound Bi2Te3 with Tc of ∼3 K between 3 to 6 GPa. The combined high-pressure structure investigations with synchrotron radiation indicated that the superconductivity occurred at the ambient phase without crystal structure phase transition. The Hall effects measurements indicated the hole-type carrier in the pressure-induced superconducting Bi2Te3 single crystal. Consequently, the first-principles calculations based on the structural data obtained by the Rietveld refinement of X-ray diffraction patterns at high pressure showed that the electronic structure under pressure remained topologically nontrivial. The results suggested that topological superconductivity can be realized in Bi2Te3 due to the proximity effect between superconducting bulk states and Dirac-type surface states. We also discuss the possibility that the bulk state could be a topological superconductor.

269 sitasi en Materials Science, Medicine

Detail DOI Sumber

S2 Open Access 2014

Chronic Q Fever in the Netherlands 5 Years after the Start of the Q Fever Epidemic: Results from the Dutch Chronic Q Fever Database

L. Kampschreur, C. Delsing, R. Groenwold et al.

114 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 2014

High sensitivity and high Q-factor nanoslotted parallel quadrabeam photonic crystal cavity for real-time and label-free sensing

Daquan Yang, S. Kita, F. Liang et al.

108 sitasi en Materials Science

Detail DOI Sumber

S2 Open Access 2012

Identification of Risk Factors for Chronic Q Fever, the Netherlands

L. Kampschreur, S. Dekker, J. Hagenaars et al.

Previous cardiac valvular surgery, vascular prosthesis, aortic aneurysm, renal insufficiency, and older age increased risk.

119 sitasi en Medicine

Detail DOI Sumber

Hasil untuk "q-bio.TO"