arXiv Open Access 2025

Wasserstein Policy Optimization

David Pfau Ian Davies Diana Borsa Joao G. M. Araujo Brendan Tracey +1 lainnya

Lihat Sumber

Abstrak

We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.

Topik & Kata Kunci

cs.LG cs.AI

Penulis (6)

David Pfau

Ian Davies

Diana Borsa

Joao G. M. Araujo

Brendan Tracey

Hado van Hasselt

Format Sitasi

APA MLA BibTeX

Pfau, D., Davies, I., Borsa, D., Araujo, J.G.M., Tracey, B., Hasselt, H.v. (2025). Wasserstein Policy Optimization. https://arxiv.org/abs/2505.00663

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓