Ryan Catullo

We assume an underlying MDP

\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)\,,

where at each timestep $t$ the environment is in a "true" state $s_t \in \mathcal{S}$ , the agent takes $a_t \sim \pi_\theta(a_t \mid z_t)$ , and then observes $s_{t+1}\sim P(\cdot \mid s_t,a_t)$ and reward $r_t=R(s_t,a_t)$ .

In our CARLA‐based simulation, we decompose

s_t \;=\; (p_t,\,o_t)\,,

where

$p_t\in \mathcal{P}$ is the privileged portion of the simulator state (e.g., ground‐truth positions of all vehicles, traffic–light timings, etc.);
$o_t\in \mathcal{O}$ is the environmental (or “observable”) information—what your sensors actually see (camera‐images, LiDAR, IMU, etc.).

We have pre‐trained a teacher world model with parameters $\phi$ . Concretely, DreamerV3’s Recurrent State‐Space Model (RSSM) is:

Recurrent update (hidden state)
$h_t \;=\; f_\phi\bigl(h_{t-1},\,z_{t-1},\,a_{t-1}\bigr)\quad\text{(RNN prior)},$
Teacher posterior (privileged)
$z_t \;\sim\; q_\phi\bigl(z_t \mid h_t,\;p_t,\;o_t \bigr), \qquad q_\phi(z\mid\cdot)\;=\;\mathcal{N}\Bigl(\mu^\text{teach}_t,\;\Sigma^\text{teach}_t\Bigr),$
Dynamics predictor (latent prior)
$\hat z_t \;\sim\; p_\phi\bigl(\hat z_t \mid h_t \bigr),$
Reward predictor
$\hat r_t \;\sim\; p_\phi\bigl(\hat r_t \mid h_t,\;z_t\bigr),$
Continue predictor
$\hat c_t \;\sim\; p_\phi\bigl(\hat c_t \mid h_t,\;z_t \bigr),$
Decoder
$\hat o_t \;\sim\; p_\phi\bigl(\hat o_t \mid h_t,\;z_t\bigr).$

All $p_\phi$ and $q_\phi$ distributions are implemented as Gaussians with small neural nets providing means and (diagonal) covariances. The teacher world‐model parameters $\phi$ are frozen once pre‐training finishes.

Student Encoder and Adaptor

At inference time in the real world, we will only have access to $o_t$ . We therefore introduce:

an adaptation module $g_\psi$ , which learns to predict $\hat p_t$ from $o_t$ :
$\hat p_t \;\sim\; g_\psi\bigl(\hat p_t \mid o_t \bigr), \qquad g_\psi(\hat p\mid o)\;=\;\mathcal{N}\bigl(\mu^\text{adapt}(o_t),\,\Sigma^\text{adapt}(o_t)\bigr).$
In other words, $g_\psi$ outputs both a mean $\mu^\text{adapt}(o_t)$ and a diagonal‐covariance $\Sigma^\text{adapt}(o_t)$ , so that we can sample multiple $\hat p_t$ to capture uncertainty.
a student posterior encoder with parameters $\psi$ , which mimics the teacher’s latent but only sees $(\,\hat p_t,\;o_t\,)$ :
$\bar z_t \;\sim\; q_\psi\bigl(\bar z_t \mid h_t,\;\hat p_t,\;o_t \bigr), \qquad q_\psi(\bar z\mid\cdot)\;=\;\mathcal{N}\bigl(\mu^\text{stud}_t,\;\Sigma^\text{stud}_t\bigr).$
In practice, we reuse the teacher’s deterministic decoder/dynamics/reward/continue networks, but we replace the teacher posterior $q_\phi(z_t\mid h_t,p_t,o_t)$ by our student posterior $q_\psi(\bar z_t\mid h_t,\hat p_t,o_t)$ .

Finally, the policy has parameters $\theta$ . At inference we will sample

a_t \;\sim \;\pi_\theta\bigl(a_t\mid \bar z_t \bigr).

Note: During all training phases, the teacher world model $\phi$ and the policy $\theta$ remain frozen in the “latent‐distillation” stage; later we optionally fine‐tune $\theta$ on student latent rollouts.

Training Objectives

Our goal is twofold:

Teach the student encoder $q_\psi(\bar z_t \mid h_t,\hat p_t,o_t)$ to match the teacher’s “privileged” latent $z_t \sim q_\phi(z_t\mid h_t,p_t,o_t)$ .
Train (or fine‐tune) all student‐based dynamics/reward/continue/decoder heads so that using $\bar z_t$ instead of $z_t$ still reconstructs next‐step observations, rewards, and continuation‐flags accurately.

We do this by alternating between two phases on each minibatch of simulator data:

(A) Latent‐Matching Phase (Distillation): We sample $(p_t,o_t)$ from replay, compute the teacher’s posterior $z_t$ , and force the student’s posterior $\bar z_t$ (which depends on $\hat p_t \sim g_\psi(\cdot\mid o_t)$ ) to match $z_t$ .
(B) Student‐World‐Model Phase (Reconstruction): We sample $\hat p_t \sim g_\psi(\cdot\mid o_t)$ , then sample $\bar z_t \sim q_\psi(\bar z_t\mid h_t,\hat p_t,o_t)$ , and train the student’s deterministic dynamics/reward/continue/decoder—exactly as if $\bar z_t$ were the “true” latent—by minimizing the usual negative‐ELBO.

Below we give the precise losses in each phase. In latent matching, we:

Sample a minibatch $\{(p_t,o_t,h_{t-1},z_{t-1},a_{t-1})\}$ from the replay buffer. (Here $h_{t-1}$ and $z_{t-1}$ come from the frozen teacher model when the data were generated.)
Compute teacher posterior
$z_t \;\sim\; q_\phi\bigl(z_t \mid h_t,\,p_t,\,o_t\bigr), \quad q_\phi(z_t\mid \cdot) \;=\; \mathcal{N}\bigl(\mu_t^\text{teach},\,\Sigma_t^\text{teach}\bigr).$
(In practice we only need the posterior’s mean $\mu_t^\text{teach}$ and covariance $\Sigma_t^\text{teach}$ .)
Sample $K$ adaptor‐samples For $i=1,\dots,K$ :
$\hat p_t^{(i)} \;\sim\; g_\psi\bigl(\,\hat p \mid o_t\bigr), \quad g_\psi(\hat p\mid o)\;=\;\mathcal{N}\bigl(\mu^\text{adapt}(o_t),\,\Sigma^\text{adapt}(o_t)\bigr).$
Compute student posterior parameters For each $i$ , let
$(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}) \;=\; \text{Encoder}_\psi\bigl(h_t,\;\hat p_t^{(i)},\;o_t\bigr),$
so that $\bar z_t^{(i)}\sim\mathcal{N}\!\bigl(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}\bigr).$
Latent‐distillation loss We match the student’s Gaussian $\mathcal{N}(\bar\mu_t^{(i)},\bar\Sigma_t^{(i)})$ to the teacher’s $\mathcal{N}(\mu_t^\text{teach},\Sigma_t^\text{teach})$ . A convenient closed‐form is the KL‐divergence between two Gaussians: for each $i$ ,
$\mathrm{KL}\Bigl[\, \mathcal{N}(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}) \;\bigl\|\; \mathcal{N}(\mu_t^\text{teach},\,\Sigma_t^\text{teach}) \Bigr] \;=\; \frac12 \Bigl[ \mathrm{Tr}\bigl((\Sigma_t^\text{teach})^{-1}\,\bar\Sigma_t^{(i)}\bigr) + (\mu_t^\text{teach}-\bar\mu_t^{(i)})^\top(\Sigma_t^\text{teach})^{-1}(\mu_t^\text{teach}-\bar\mu_t^{(i)}) - d + \ln \frac{\det \Sigma_t^\text{teach}}{\det \bar\Sigma_t^{(i)}} \Bigr],$
where $d=\dim(z_t)$ . We also add a penalty on the adaptor’s variance $\Sigma^\text{adapt}(o_t)$ so that it does not blow up arbitrarily. Concretely, one choice is
$\fanl_\text{distill}(\psi) \;=\; \frac1K \sum_{i=1}^K \Bigl[ \mathrm{KL}\bigl(\mathcal{N}(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}) \,\|\, \mathcal{N}(\mu_t^\text{teach},\,\Sigma_t^\text{teach})\bigr) \;+\;\lambda_\rho \,\mathrm{Tr}\bigl(\Sigma^\text{adapt}(o_t)\bigr) \Bigr].$
Here $\lambda_\rho>0$ is a small weight pushing the adaptor’s covariance to remain compact (so it doesn’t say “I can be infinitely uncertain”).
Gradient‐step 1: Update $\psi$ by descending $\nabla_{\psi}\,\fanl_\text{distill}$ . The frozen teacher $\phi$ does not receive gradients here.

In the student‐world‐model phase, we:

Sample the same minibatch $\{(o_t,h_{t-1},a_{t-1})\}$ (we can reuse the same $h_{t-1}$ from replay).
Sample multiple adaptor outputs Again for $i=1,\dots,K$ :
$\hat p_t^{(i)} \;\sim\; g_\psi\bigl(\hat p \mid o_t\bigr), \quad \bar z_t^{(i)} \;\sim\; q_\psi\bigl(\bar z_t \mid h_t,\;\hat p_t^{(i)},\,o_t\bigr).$
We then feed $\bar z_t^{(i)}$ and $h_t$ into the (reused) teacher‐decoder/dynamics/reward/continue modules.
Reconstruction / ELBO losses For each sample $i$ :
- Student representation loss
  $\fanl_{\text{srep}}(\psi) := \frac{1}{K} \sum_{i=1}^K\max\left(1,\mathrm{KL}\Bigl[q_\psi(\bar z_t^{(i)}\mid h_t,\hat p_t^{(i)},o_t)\;\Big\|\;\mathrm{sg}(p_\phi(z_t \mid h_t))\Bigr]\right).$
  Note: $p_\phi(z_t\mid h_t)$ is the teacher’s prior (also Gaussian).
- Student dynamics loss

\fanl_{\text{sdyn}}(\psi) := \frac{1}{K} \sum_{i=1}^K\max\left(1,\mathrm{KL}\Bigl[\mathrm{sg}(q_\psi(\bar z_t^{(i)}\mid h_t,\hat p_t^{(i)},o_t))\;\Big\|\;p_\phi(z_t \mid h_t)\Bigr]\right).

Student prediction loss

\fanl_{\text{spred}}(\psi) := \frac{1}{K} \sum_{i=1}^K\;-\;\ln p_\phi( o_t \mid h_t,\bar z_t^{(i)}) \;-\;\ln p_\phi(r_t \mid h_t,\bar z_t^{(i)}) \;-\;\ln p_\phi(c_t \mid h_t,\bar z_t^{(i)})

Summing these over $i=1\ldots K$ and averaging gives

\fanl_\text{world} \;:=\; \sum_{t=1}^T \left( \beta_\text{srep} \fanl_\text{srep}(\psi) + \beta_\text{sdyn} \fanl_\text{sdyn}(\psi) + \beta_\text{spred} \fanl_\text{spred}(\psi) \right).

Gradient‐step 2: Update $\psi$ (the adaptor and student encoder) and also the student‐world‐model’s decoders/predictors (which share parameters with $\phi$ except for the posterior). Concretely, we descend
$\nabla_{\psi,\phi_{\text{student}}}\;\fanl_\text{world}.$
Here $\phi_{\text{student}}$ denotes any parameters in the teacher model that we allow to fine‐tune on top of student latents—typically, one fine‐tunes only the decoder and reward/continue heads, leaving the teacher’s recurrent prior $f_\phi$ frozen. (But you may optionally leave everything frozen except $\psi$ .)

Putting it all together, for each minibatch $B$ of timesteps $t$ , we perform:

Latent‐matching step
$\psi\;\leftarrow\;\psi\;-\;\alpha\,\nabla_{\psi}\Bigl[\tfrac{1}{|B|}\sum_{t\in B} \fanl_\text{distill}(\psi)\Bigr].$
World‐model‐training step:
$(\psi,\phi_{\text{student}})\;\leftarrow\;(\psi,\phi_{\text{student}})\;-\;\beta\,\nabla_{(\psi,\phi_{\text{student}})}\Bigl[\tfrac{1}{|B|}\sum_{t\in B} \fanl_\text{world}(\psi, \phi_{\text{student}})\Bigr].$

Here $\alpha,\beta$ are chosen learning rates. In practice you alternate them every gradient update (or every few gradient updates), so that the student encoder sees both the “teacher matching” signal and the “reconstruction” signal in roughly equal measure. Over many iterations, $\psi$ converges to a point where

q_\psi\bigl(\bar z_t \mid h_t,\;g_\psi(o_t),\;o_t\bigr) \;\approx\; q_\phi\bigl(z_t \mid h_t,\;p_t,\;o_t\bigr)

(i.e.\ latent‐distillation), while also ensuring $\bar z_t$ can be fed into the reused decoders to reconstruct consistent next observations, rewards, and continuation signals (the “world‐model” part).

Inference (Querying the Policy with Only $o_t$ )

Once training converges:

Receive a new real‐world observation $o_t$ (no privileged $p_t$ available).
Compute $[\mu^\text{adapt}(o_t),\,\Sigma^\text{adapt}(o_t)]$ and sample $\hat p_t\sim\mathcal{N}(\mu^\text{adapt},\,\Sigma^\text{adapt})$ . In many implementations one simply takes $\hat p_t = \mu^\text{adapt}(o_t)$ (the mean) to reduce variance.
Compute the student‐posterior parameters $\bigl(\bar\mu_t,\;\bar\Sigma_t\bigr) = \text{Encoder}_\psi(h_t,\;\hat p_t,\;o_t)$ , then set $\bar z_t = \bar\mu_t$ .
Recurse the recurrent state: $\;h_{t+1} \;=\; f_\phi\bigl(h_t,\,\bar z_t,\,a_t\bigr)$ .
Query the student policy
$a_t \;\sim\;\pi_\theta\bigl(a_t \mid \bar z_t \bigr).$

In practice, if you wish to capture adaptation uncertainty in real time, you can sample multiple $\hat p_t$ and average or pick an action distribution that is robust to that uncertainty. But for many real‐world robotics applications, setting $\hat p_t$ equal to the adaptor‐mean $\mu^\text{adapt}(o_t)$ suffices.

Sim2Real

Student Encoder and Adaptor

Training Objectives

Inference (Querying the Policy with Only oto_tot​)

Inference (Querying the Policy with Only $o_t$ )