TOP-ERL: Transformer-based Off-policy Episodic RL (ICLR25 Spotlight)

Episodic RL, What and Why?

Episodic Reinforcement Learning (ERL) [1, 4, 5] is a distinct RL family that emphasizes the maximization of returns over entire episodes, typically lasting several seconds, rather than optimizing the intermediate states during environment interactions. Unlike Step-based RL (SRL) [2, 3], ERL shifts the solution search from per-step actions to a parameterized trajectory space, leveraging techniques like Movement Primitives (MPs) [6, 7, 8] for generating action sequences. This approach enables a broader exploration horizon [4], captures trajectory statistics [9], and ensures smooth transitions between re-planning phases [10].

Exploration Strategies Comparison, SRL vs. ERL [9]

Step-based RL explores per action step

Episodic RL has consistent exploration

Use Movement Primitives for Trajectory Generation

Episodic RL often uses the movement primitves (MPs) as a paramterized trajectory generator. In TOP-ERL, we use the ProDMP [8] for fast computation and better initial condition enforcement. A simple illustration of using MPs can be seen as follows:

MP predicts a trajectory (upper curve) by adjusting the weights of the basis functions (lower curves)

Use Transformer as an Action Sequence Critic

In the literature, most of the combinations of RL and Transformers focus on offline, model-based and POMDP settings. Directly using tranformer in online RL for acition sequence value prediction remains highly unexplored. In TOP-ERL, we utilize Transformers as an action sequence value predictor, training it via the N-step future returns. To ensure stable critic learning, we adapt the trajectory segmentation strategy in [9] by splitting the long trajectory into sub-sequences of varying lengths.

TOP-ERL utilizes a transformer critic that predicts the value of executing a sub-sequence of actions from the beginning of the segment state.

Installation Tutorial

We tested our installation using the following PC setup:

	- Ubuntu 22.04
	- RTX 2060 Super GPU
	- git is installed with "sudo apt install git-all"
	- Github account with ssh access

We provide a 12 min long tutorial video to guide your installation step-by-step. This video contains the following steps:

Install Mamba (a faster conda release)
Activate mamba in your teminal

	source .bashrc  #if you use bash

Clone the repository

	mkdir top_erl
	cd top_erl
	git clone git@github.com:BruceGeLi/TOP_ERL_ICLR25_Code.git

Install dependencies

    cd TOP_ERL_ICLR25_Code
    bash conda_env.sh

Wait for 10 min until finish (depend on the internet speed)

Activate the mamba(conda) env by:

  mamba activate top_erl_iclr25

Register a wandb account and login in your local PC:

  wandb login --relogin

Replace the wandb username in the config file, such as "shared_dense.yaml".
Run experiment locally, e.g. box pushing dense reward setting

  python seq_mp_exp_multiprocessing.py config/box_push_random_init/seq/entire/local_dense.yaml -o --nocodecopy

To run experiments in slurm-based HPC, you need to adapt your hpc info in our slurm configs. An example of running code in slurm is:

  python seq_mp_exp_multiprocessing.py config/box_push_random_init/seq/entire/slurm_dense.yaml -o --nocodecopy

We used cw2 to parse our experiment configs into sbatch commands in slurm based HPC system. For more technical details, we refer the documents in cw2.

Cite

If our work benefits your research, please consider citing our paper:

@inproceedings{
li2025toperl,
title={{TOP}-{ERL}: Transformer-based Off-Policy Episodic Reinforcement Learning},
author={Ge Li and Dong Tian and Hongyi Zhou and Xinkai Jiang and Rudolf Lioutikov and Gerhard Neumann},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=N4NhVN30ph}
}

Our previous work for online RL with temporal correlation and movement primitives can be found in the following papers TCE, ICLR 24:

@inproceedings{
li2024open,
title={Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning},
author={Ge Li and Hongyi Zhou and Dominik Roth and Serge Thilges and Fabian Otto and Rudolf Lioutikov and Gerhard Neumann},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=mnipav175N}
}
}

The ProDMP concept can be found in the following paper ProDMP, IEEE RA-L:

@article{li2023prodmp,
  title={ProDMP: A Unified Perspective on Dynamic and Probabilistic Movement Primitives},
  author={Li, Ge and Jin, Zeqi and Volpp, Michael and Otto, Fabian and Lioutikov, Rudolf and Neumann, Gerhard},
  journal={IEEE Robotics and Automation Letters},
  year={2023},
  publisher={IEEE}
}

References

[1] Darrell Whitley, Stephen Dominic, Rajarshi Das, and Charles W Anderson. Genetic reinforcement learning for neurocontrol problems. Machine Learning, 13:259–284, 1993.

[2] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[3] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018a.

[4] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. NIPS, 2008.

[5] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.

[6] Stefan Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pp. 261–280. Springer, 2006.

[7] Alexandros Paraschos, Christian Daniel, Jan Peters, and Gerhard Neumann. Probabilistic movement primitives. Advances in neural information processing systems, 26, 2013.

[8] Ge Li, Zeqi Jin, Michael Volpp, Fabian Otto, Rudolf Lioutikov, and Gerhard Neumann. Prodmp:A unified perspective on dynamic and probabilistic movement primitives. IEEE RA-L, 2023.

[9] Ge Li, Hongyi Zhou, Dominik Roth, Serge Thilges, Fabian Otto, Rudolf Lioutikov, and Gerhard Neumann. Open the black box: Step-based policy updates for temporally-correlated episodic reinforcement learning. ICLR 2024.

[10] Fabian Otto, Hongyi Zhou, Onur Celik, Ge Li, Rudolf Lioutikov, and Gerhard Neumann. Mp3: Movement primitive-based (re-) planning policy. arXiv preprint arXiv:2306.12729, 2023.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
mprl		mprl
web_assets		web_assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda_env.sh		conda_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOP-ERL: Transformer-based Off-policy Episodic RL (ICLR25 Spotlight)

Episodic RL, What and Why?

Use Movement Primitives for Trajectory Generation

Use Transformer as an Action Sequence Critic

Installation Tutorial

Cite

References

About

Uh oh!

Releases

Packages

Languages

License

BruceGeLi/TOP_ERL_ICLR25_Code

Folders and files

Latest commit

History

Repository files navigation

TOP-ERL: Transformer-based Off-policy Episodic RL (ICLR25 Spotlight)

Episodic RL, What and Why?

Use Movement Primitives for Trajectory Generation

Use Transformer as an Action Sequence Critic

Installation Tutorial

Cite

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages