arXiv Open Access 2025

Wasserstein Policy Optimization

David Pfau Ian Davies Diana Borsa Joao G. M. Araujo Brendan Tracey +1 lainnya
Lihat Sumber

Abstrak

We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.

Topik & Kata Kunci

Penulis (6)

D

David Pfau

I

Ian Davies

D

Diana Borsa

J

Joao G. M. Araujo

B

Brendan Tracey

H

Hado van Hasselt

Format Sitasi

Pfau, D., Davies, I., Borsa, D., Araujo, J.G.M., Tracey, B., Hasselt, H.v. (2025). Wasserstein Policy Optimization. https://arxiv.org/abs/2505.00663

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓