Multi-Task RL

May 30, 2025

Suppose we want to teach a single agent not just one game but an entire suite of challenges. For example, a generalist LLM assistant that can do travel booking, grocery shopping, ...

or a legged robot that can walk, run, dance, crouch, ...

a game player that can play Flappy bird, Pokemon, Pinball, ...

a music recommender system capable of personalization to many distinct users...

How would we do this? Each of these tasks may have different reward functions, different dynamics, different action spaces, and more. The key idea is to condition on the task.

The biggest challenge so far in RL has been data efficiency. Thus, hopefully there is some way to amortize the data complexity across many tasks and scenarios. That is, we want learnings to be able to be shared across tasks.

So instead of teaching an LLM each of these tasks, we can teach it grammar. We can teach a legged robot balance. We can teach a music recommender user similarities.

Different tasks can just be different MDPs. We define a task

Ti:={Si,Ai,pi(s1),pi(ss,a),ri(s,a)}\calt_i := \{\cals_i, \fana_i, p_i(s_1), p_i(s' \mid s, a), r_i(s,a)\}

Then we can identify tasks as just another part of the state for our multitasker agent. That is, our states are s:=(s,zi)s := (\bar s, z_i) where s\bar s is our original state and ziz_i is our task identifier. We can also perform goal-conditioned RL where we have some desired goal state we want to reach zi=sgz_i = s_g.

This is still a MDP, so we can apply all the same RL algorithms as before! The only difference is we have states augmented by task information, and the same goes for reward. In the case of goal-conditioned RL, our reward is r(s)=r(s,sg)=d(s,sg)r(s) = r(\bar s, s_g) = -d(\bar s, s_g) where dd is some distance function (2\ell_2, sparse 0/1, ...).

We can also do better in some cases than standard RL methods.

Multi-Task IL

Let's start with imitation learning. Recall our single-task imitation learning problem is to learn

minθE(s,a)Dlogπθ(as)=:minθL(θ,D)\min_\theta -\bbe_{(s,a) \sim \cald} \log \pi_\theta(a \mid s) =: \min_\theta \fanl(\theta, \cald)

Ok, then for TT tasks why naively why don't we just learn

minθi=1TL(θi,Di)\min_\theta \sum_{i=1}^T \fanl(\theta_i, \cald_i)

For some low-hanging optimization fruit, we can also perform stratified sampling to construct each minibatch with data from each task. In practice this gives us lower variance gradients.

We can do the same thing with replay buffers, where we keep track of per-task replay buffers to perform stratified sampling.

Let's consider the following example. Suppose we want to train a robot to perform language-instructed tasks from video feedback, like "place bottle in the ceramic bowl". Our input channels are live video feed, language, and a human demo video for behavior cloning.

Then we can pass the demo video through a video encoder to get a vector embedding, and similarly we can pass the language through a language encoder to get a vector embedding.

We pass the live feed through a vision network, passing in our vectorized language and demo. Obviously we can have a lot more complicated architectural layers in between, but the output should be some value for each of our available actions (like gripper angle, rotation, xyz movement).

Our task would look something like this

minθi(s,a)DeiwhDhiDeilogπ(as,zi)behavior cloning where zhiq(whi)video encoder,  ziq(wi)language encoder\min_\theta \sum_i \sum_{\substack{(s, a) \sim \cald^i_e \\ w_h \sim \cald^i_h \cup \cald^i_e}} \underbrace{-\log \pi(a \mid s, z^i)}_{\text{behavior cloning}} \text{ where } \underbrace{z^i_h \sim q(\cdot \mid w^i_h)}_{\text{video encoder}},\; \underbrace{z^i_{\ell}\sim q(\cdot \mid w^i_{\ell})}_{\text{language encoder}}

Nowadays we can just pass our language task identifier ziz_i as a prompt into a fine-tuned LLM to get a vectorized representation.

Sharing Data Across Tasks

If we collect policy data when conditioning on z1z_1, can we reuse that data to learn something for z2z_2? This would require relabeling the data with z2z_2!

Suppose our agent is learning to play hockey. The two tasks are passing and shooting goals, and suppose we have a bunch of experience trained passing which we want to translate to shooting goals.

What if our agent accidentally performs a good pass when trying to shoot a goal? Why don't we just store the experience like usual, but relabel it with z2z_2 and the reward for the task of goal shooting (which is likely low for this trajectory).

This is called hindsight relabeling and hindsight experience replay (HER). We can summarize with the following algorithm.

  • Collect data D={(s1:T,a1:T,zi,r1:T)}\cald = \{(s_{1:T}, a_{1:T}, \textcolor{green}{z_i}, r_{1:T})\} for task Ti\textcolor{green}{\calt_i} with some policy
  • Store data in replay buffer RRD\calr \leftarrow \calr \cup \cald
  • Perform hindsight relabeling: relabel experience D\cald for task Tj\textcolor{blue}{\calt_j} and store relabeled data in replay buffer
D={(s1:T,a1:T,zj,r1:T)}rt=rj(st,at)\cald' =\{(s_{1:T}, a_{1:T}, \textcolor{blue}{z_j}, r'_{1:T})\} \qquad r'_t = r_j(s_t, a_t) RRDk\calr \leftarrow \calr \cup \cald_k'
  • Update policy using replay buffer R\calr, repeat Step 1

Which task Tj\calt_j should we choose for relabeling? We can do this either randomly, or we can choose tasks in which the trajectory gets a high reward.

In order to even apply relabeling, we also need either a computable reward and dynamics that are consistent across tasks, or an off-policy algorithm.

In goal-conditioned RL, our data collection D\cald would always be toward the goal task sgs_g. When doing hindsight relabeling, we would relabel D\cald using the last state as the goal and set rt=d(st,sT)r'_t = -d(s_t, s_T). Alternatively, we could use any state as the goal when doing relabeling which alleviates exploration challenges.

Conclusion

Overall, multi‐task RL just folds task labels into our familiar MDP framework—treating each game, motion skill, or recommendation objective as part of the state. We can apply the same algorithms we already know.

By conditioning our policy on a task embedding and using simple tricks like stratified sampling for imitation learning, per‐task replay buffers, and hindsight relabeling (HER), we dramatically boost data efficiency and transfer power across diverse tasks.

That way, balancing on one surface or parsing one language prompt can directly accelerate learning on another, and off-policy methods let us relabel and reuse past trajectories toward new goals. Looking forward, richer architectures—such as task-specific adapters, hypernetworks that generate context-aware weights, and meta-learning schemes—will tighten the sharing of skills and speed up adaptation to unseen tasks.