DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
DeepSeek-AI, Daya Guo, Dejian Yang
et al.
General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)1,2 and chain-of-thought (CoT) prompting3, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models. A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.
5344 sitasi
en
Medicine, Computer Science
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland
et al.
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are"Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
2229 sitasi
en
Computer Science
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
Rafael Rafailov, Joey Hejna, Ryan Park
et al.
Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference. We theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
247 sitasi
en
Computer Science
Membrane Interactions in Alzheimer`s Treatment Strategies with Multitarget Molecules
Pablo Zambrano
Addressing Alzheimer's disease (AD) requires innovative strategies beyond current single-target drugs. This Letter to the Editor suggests that multitarget molecules, especially those targeting neuronal membrane protection, could offer a comprehensive approach to AD therapy, advocating for further research into their mechanisms and therapeutic potential.
Applications of Computer Vision in Analysis of the Clock-Drawing Test as a Metric of Cognitive Impairment
Luzhou Zhang
The Clock-Drawing test is a well known and widely used neuropsychological metric to assess basic cognitive function. My objective is to combine methods of machine learning in computer vision and image analysis to predict a subject's level of cognitive impairment.
Graph Theory in Brain Networks
Moo K. Chung
Recent developments in graph theoretic analysis of complex networks have led to deeper understanding of brain networks. Many complex networks show similar macroscopic behaviors despite differences in the microscopic details. Probably two most often observed characteristics of complex networks are scale-free and small-world properties. In this paper, we will explore whether brain networks follow scale-free and small-worldness among other graph theory properties.
Cross-time functional connectivity analysis
Ze Wang
A large body of literature has shown the substantial inter-regional functional connectivity in the mammal brain. One important property remaining un-studied is the cross-time interareal connection. This paper serves to provide a tool to characterize the cross-time functional connectivity. The method is extended from the temporal embedding based brain temporal coherence analysis. Both synthetic data and in-vivo data were used to evaluate the various properties of the cross-time functional connectivity matrix, which is also called the cross-regional temporal coherence matrix.
How does neural activity encode spontaneous motor behavior in zebrafish larvae ?
Selma Mehyaoui
The origins of spontaneous movements have been investigated in human as well as in other vertebrates. Studies have reported an increase in neuronal activity one second before the onset of a given movement: this is known as readiness potential. The mechanisms underlying this increase are still unclear. Zebrafish larva is an ideal animal model to study the neuronal basis of spontaneous movements. Because of its small size and transparency, this vertebrate is an ideal candidate to apply optical recording methods. In order to understand what neuronal activity causes the execution of a specific tail movement at a given time, we will mainly use a prediction approach.
A High Statistics Measurement of the Proton Structure Functions F(2) (x, Q**2) and R from Deep Inelastic Muon Scattering at High Q**2
A. Benvenuti, D. Bollini, G. Bruni
et al.
First Result from the Alpha Magnetic Spectrometer on the International Space Station : Precision Measurement of the Positron Fraction in Primary Cosmic Rays of 0 . 5 – 350 GeV
M. Aguilar, G. Alberti, B. Alpat
et al.
Pressure-induced superconductivity in topological parent compound Bi2Te3
J. Zhang, S. J. Zhang, H. Weng
et al.
We report a successful observation of pressure-induced superconductivity in a topological compound Bi2Te3 with Tc of ∼3 K between 3 to 6 GPa. The combined high-pressure structure investigations with synchrotron radiation indicated that the superconductivity occurred at the ambient phase without crystal structure phase transition. The Hall effects measurements indicated the hole-type carrier in the pressure-induced superconducting Bi2Te3 single crystal. Consequently, the first-principles calculations based on the structural data obtained by the Rietveld refinement of X-ray diffraction patterns at high pressure showed that the electronic structure under pressure remained topologically nontrivial. The results suggested that topological superconductivity can be realized in Bi2Te3 due to the proximity effect between superconducting bulk states and Dirac-type surface states. We also discuss the possibility that the bulk state could be a topological superconductor.
269 sitasi
en
Materials Science, Medicine
Chronic Q Fever in the Netherlands 5 Years after the Start of the Q Fever Epidemic: Results from the Dutch Chronic Q Fever Database
L. Kampschreur, C. Delsing, R. Groenwold
et al.
Myelin and saltatory conduction
Maurizio De Pittà
Essential tutorial on myelin, oligodendrocytes and their functional relevance in the pathophysiology of the brain.
Glia
Maurizio De Pittà
Essential introduction to glial cells with emphasis on astrocytes, microglia and their interplay in reactive astrogliosis.
Epidemiologic features and clinical presentation of acute Q fever in hospitalized patients: 323 French cases.
H. Dupont, D. Raoult, P. Brouqui
et al.
How a Generation Was Misled About Natural Selection
Liane Gabora
This article explains how natural selection works and how it has been inappropriately applied to the description of cultural change. It proposes an alternative evolutionary explanation for cultural evolution that describes it in terms of communal exchange.
Significance of the Diagnostic Q Wave of Myocardial Infarction
L. Horan, N. Flowers, Jennifer C. Johnson
Elliptic flow from two- and four-particle correlations in Au + Au collisions at sqrt{s_{NN}} = 130 GeV
C. Adler, Z. Ahammed, C. Allgower
et al.
Elliptic flow holds much promise for studying the early-time thermalization attained in ultrarelativistic nuclear collisions. Flow measurements also provide a means of distinguishing between hydrodynamic models and calculations which approach the low density (dilute gas) limit. Among the effects that can complicate the interpretation of elliptic flow measurements are azimuthal correlations that are unrelated to the reaction plane (non-flow correlations). Using data for Au + Au collisions at sqrt{s_{NN}} = 130 GeV from the STAR TPC, it is found that four-particle correlation analyses can reliably separate flow and non-flow correlation signals. The latter account for on average about 15% of the observed second-harmonic azimuthal correlation, with the largest relative contribution for the most peripheral and the most central collisions. The results are also corrected for the effect of flow variations within centrality bins. This effect is negligible for all but the most central bin, where the correction to the elliptic flow is about a factor of two. A simple new method for two-particle flow analysis based on scalar products is described. An analysis based on the distribution of the magnitude of the flow vector is also described.
Australia's national Q fever vaccination program.
H. Gidding, Cate Wallace, G. Lawrence
et al.
Beam Energy Dependence of Moments of the Net-charge Multiplicity Distributions in Au + Au Collisions at Rhic Accessed Terms of Use Detailed Terms
L. Adamczyk, J. K. Adkins, G. Agakishiev
et al.