Skip to content

BruceGeLi/TOP_ERL_ICLR25_Code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TOP-ERL: Transformer-based Off-policy Episodic RL (ICLR25 Spotlight)




Episodic RL, What and Why?

Episodic Reinforcement Learning (ERL) [1, 4, 5] is a distinct RL family that emphasizes the maximization of returns over entire episodes, typically lasting several seconds, rather than optimizing the intermediate states during environment interactions. Unlike Step-based RL (SRL) [2, 3], ERL shifts the solution search from per-step actions to a parameterized trajectory space, leveraging techniques like Movement Primitives (MPs) [6, 7, 8] for generating action sequences. This approach enables a broader exploration horizon [4], captures trajectory statistics [9], and ensures smooth transitions between re-planning phases [10].

Exploration Strategies Comparison, SRL vs. ERL [9]


Step-based RL explores per action step

Episodic RL has consistent exploration

Use Movement Primitives for Trajectory Generation

Episodic RL often uses the movement primitves (MPs) as a paramterized trajectory generator. In TOP-ERL, we use the ProDMP [8] for fast computation and better initial condition enforcement. A simple illustration of using MPs can be seen as follows:


MP predicts a trajectory (upper curve) by adjusting the weights of the basis functions (lower curves)

Use Transformer as an Action Sequence Critic

In the literature, most of the combinations of RL and Transformers focus on offline, model-based and POMDP settings. Directly using tranformer in online RL for acition sequence value prediction remains highly unexplored. In TOP-ERL, we utilize Transformers as an action sequence value predictor, training it via the N-step future returns. To ensure stable critic learning, we adapt the trajectory segmentation strategy in [9] by splitting the long trajectory into sub-sequences of varying lengths.


TOP-ERL utilizes a transformer critic that predicts the value of executing a sub-sequence of actions from the beginning of the segment state.

Installation Tutorial

  1. We tested our installation using the following PC setup:
	- Ubuntu 22.04
	- RTX 2060 Super GPU
	- git is installed with "sudo apt install git-all"
	- Github account with ssh access

We provide a 12 min long tutorial video to guide your installation step-by-step. This video contains the following steps:

  1. Install Mamba (a faster conda release)

  2. Activate mamba in your teminal

	source .bashrc  #if you use bash
  1. Clone the repository
	mkdir top_erl
	cd top_erl
	git clone git@github.com:BruceGeLi/TOP_ERL_ICLR25_Code.git
  1. Install dependencies
    cd TOP_ERL_ICLR25_Code
    bash conda_env.sh	

Wait for 10 min until finish (depend on the internet speed)

  1. Activate the mamba(conda) env by:
  mamba activate top_erl_iclr25
  1. Register a wandb account and login in your local PC:
  wandb login --relogin
  1. Replace the wandb username in the config file, such as "shared_dense.yaml".

  2. Run experiment locally, e.g. box pushing dense reward setting

  python seq_mp_exp_multiprocessing.py config/box_push_random_init/seq/entire/local_dense.yaml -o --nocodecopy
  1. To run experiments in slurm-based HPC, you need to adapt your hpc info in our slurm configs. An example of running code in slurm is:
  python seq_mp_exp_multiprocessing.py config/box_push_random_init/seq/entire/slurm_dense.yaml -o --nocodecopy
  1. We used cw2 to parse our experiment configs into sbatch commands in slurm based HPC system. For more technical details, we refer the documents in cw2.

 

Cite

If our work benefits your research, please consider citing our paper:

@inproceedings{
li2025toperl,
title={{TOP}-{ERL}: Transformer-based Off-Policy Episodic Reinforcement Learning},
author={Ge Li and Dong Tian and Hongyi Zhou and Xinkai Jiang and Rudolf Lioutikov and Gerhard Neumann},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=N4NhVN30ph}
}

Our previous work for online RL with temporal correlation and movement primitives can be found in the following papers TCE, ICLR 24:

@inproceedings{
li2024open,
title={Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning},
author={Ge Li and Hongyi Zhou and Dominik Roth and Serge Thilges and Fabian Otto and Rudolf Lioutikov and Gerhard Neumann},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=mnipav175N}
}
}

The ProDMP concept can be found in the following paper ProDMP, IEEE RA-L:

@article{li2023prodmp,
  title={ProDMP: A Unified Perspective on Dynamic and Probabilistic Movement Primitives},
  author={Li, Ge and Jin, Zeqi and Volpp, Michael and Otto, Fabian and Lioutikov, Rudolf and Neumann, Gerhard},
  journal={IEEE Robotics and Automation Letters},
  year={2023},
  publisher={IEEE}
}



References

[1] Darrell Whitley, Stephen Dominic, Rajarshi Das, and Charles W Anderson. Genetic reinforcement learning for neurocontrol problems. Machine Learning, 13:259–284, 1993.

[2] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[3] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018a.

[4] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. NIPS, 2008.

[5] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.

[6] Stefan Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pp. 261–280. Springer, 2006.

[7] Alexandros Paraschos, Christian Daniel, Jan Peters, and Gerhard Neumann. Probabilistic movement primitives. Advances in neural information processing systems, 26, 2013.

[8] Ge Li, Zeqi Jin, Michael Volpp, Fabian Otto, Rudolf Lioutikov, and Gerhard Neumann. Prodmp:A unified perspective on dynamic and probabilistic movement primitives. IEEE RA-L, 2023.

[9] Ge Li, Hongyi Zhou, Dominik Roth, Serge Thilges, Fabian Otto, Rudolf Lioutikov, and Gerhard Neumann. Open the black box: Step-based policy updates for temporally-correlated episodic reinforcement learning. ICLR 2024.

[10] Fabian Otto, Hongyi Zhou, Onur Celik, Ge Li, Rudolf Lioutikov, and Gerhard Neumann. Mp3: Movement primitive-based (re-) planning policy. arXiv preprint arXiv:2306.12729, 2023.

About

ICLR 25 Spotlight, Transformer-based Off-Policy Episodic RL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published