Sim2Real

June 3, 2025

We assume an underlying MDP

M=(S,A,P,R,γ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)\,,

where at each timestep tt the environment is in a "true" state stSs_t \in \mathcal{S}, the agent takes atπθ(atzt)a_t \sim \pi_\theta(a_t \mid z_t), and then observes st+1P(st,at)s_{t+1}\sim P(\cdot \mid s_t,a_t) and reward rt=R(st,at)r_t=R(s_t,a_t).

In our CARLA‐based simulation, we decompose

st  =  (pt,ot),s_t \;=\; (p_t,\,o_t)\,,

where

  • ptPp_t\in \mathcal{P} is the privileged portion of the simulator state (e.g., ground‐truth positions of all vehicles, traffic–light timings, etc.);
  • otOo_t\in \mathcal{O} is the environmental (or “observable”) information—what your sensors actually see (camera‐images, LiDAR, IMU, etc.).

We have pre‐trained a teacher world model with parameters ϕ\phi. Concretely, DreamerV3’s Recurrent State‐Space Model (RSSM) is:

  1. Recurrent update (hidden state)

    ht  =  fϕ(ht1,zt1,at1)(RNN prior),h_t \;=\; f_\phi\bigl(h_{t-1},\,z_{t-1},\,a_{t-1}\bigr)\quad\text{(RNN prior)},
  2. Teacher posterior (privileged)

    zt    qϕ(ztht,  pt,  ot),qϕ(z)  =  N(μtteach,  Σtteach),z_t \;\sim\; q_\phi\bigl(z_t \mid h_t,\;p_t,\;o_t \bigr), \qquad q_\phi(z\mid\cdot)\;=\;\mathcal{N}\Bigl(\mu^\text{teach}_t,\;\Sigma^\text{teach}_t\Bigr),
  3. Dynamics predictor (latent prior)

    z^t    pϕ(z^tht),\hat z_t \;\sim\; p_\phi\bigl(\hat z_t \mid h_t \bigr),
  4. Reward predictor

    r^t    pϕ(r^tht,  zt),\hat r_t \;\sim\; p_\phi\bigl(\hat r_t \mid h_t,\;z_t\bigr),
  5. Continue predictor

    c^t    pϕ(c^tht,  zt),\hat c_t \;\sim\; p_\phi\bigl(\hat c_t \mid h_t,\;z_t \bigr),
  6. Decoder

    o^t    pϕ(o^tht,  zt).\hat o_t \;\sim\; p_\phi\bigl(\hat o_t \mid h_t,\;z_t\bigr).

All pϕp_\phi and qϕq_\phi distributions are implemented as Gaussians with small neural nets providing means and (diagonal) covariances. The teacher world‐model parameters ϕ\phi are frozen once pre‐training finishes.

Student Encoder and Adaptor

At inference time in the real world, we will only have access to oto_t. We therefore introduce:

  • an adaptation module gψg_\psi, which learns to predict p^t\hat p_t from oto_t:

    p^t    gψ(p^tot),gψ(p^o)  =  N(μadapt(ot),Σadapt(ot)).\hat p_t \;\sim\; g_\psi\bigl(\hat p_t \mid o_t \bigr), \qquad g_\psi(\hat p\mid o)\;=\;\mathcal{N}\bigl(\mu^\text{adapt}(o_t),\,\Sigma^\text{adapt}(o_t)\bigr).

    In other words, gψg_\psi outputs both a mean μadapt(ot)\mu^\text{adapt}(o_t) and a diagonal‐covariance Σadapt(ot)\Sigma^\text{adapt}(o_t), so that we can sample multiple p^t\hat p_t to capture uncertainty.

  • a student posterior encoder with parameters ψ\psi, which mimics the teacher’s latent but only sees (p^t,  ot)(\,\hat p_t,\;o_t\,):

    zt    qψ(ztht,  p^t,  ot),qψ(z)  =  N(μtstud,  Σtstud).\bar z_t \;\sim\; q_\psi\bigl(\bar z_t \mid h_t,\;\hat p_t,\;o_t \bigr), \qquad q_\psi(\bar z\mid\cdot)\;=\;\mathcal{N}\bigl(\mu^\text{stud}_t,\;\Sigma^\text{stud}_t\bigr).

    In practice, we reuse the teacher’s deterministic decoder/dynamics/reward/continue networks, but we replace the teacher posterior qϕ(ztht,pt,ot)q_\phi(z_t\mid h_t,p_t,o_t) by our student posterior qψ(ztht,p^t,ot)q_\psi(\bar z_t\mid h_t,\hat p_t,o_t).

Finally, the policy has parameters θ\theta. At inference we will sample

at    πθ(atzt).a_t \;\sim \;\pi_\theta\bigl(a_t\mid \bar z_t \bigr).

Note: During all training phases, the teacher world model ϕ\phi and the policy θ\theta remain frozen in the “latent‐distillation” stage; later we optionally fine‐tune θ\theta on student latent rollouts.

Training Objectives

Our goal is twofold:

  1. Teach the student encoder qψ(ztht,p^t,ot)q_\psi(\bar z_t \mid h_t,\hat p_t,o_t) to match the teacher’s “privileged” latent ztqϕ(ztht,pt,ot)z_t \sim q_\phi(z_t\mid h_t,p_t,o_t).
  2. Train (or fine‐tune) all student‐based dynamics/reward/continue/decoder heads so that using zt\bar z_t instead of ztz_t still reconstructs next‐step observations, rewards, and continuation‐flags accurately.

We do this by alternating between two phases on each minibatch of simulator data:

  • (A) Latent‐Matching Phase (Distillation): We sample (pt,ot)(p_t,o_t) from replay, compute the teacher’s posterior ztz_t, and force the student’s posterior zt\bar z_t (which depends on p^tgψ(ot)\hat p_t \sim g_\psi(\cdot\mid o_t)) to match ztz_t.
  • (B) Student‐World‐Model Phase (Reconstruction): We sample p^tgψ(ot)\hat p_t \sim g_\psi(\cdot\mid o_t), then sample ztqψ(ztht,p^t,ot)\bar z_t \sim q_\psi(\bar z_t\mid h_t,\hat p_t,o_t), and train the student’s deterministic dynamics/reward/continue/decoder—exactly as if zt\bar z_t were the “true” latent—by minimizing the usual negative‐ELBO.

Below we give the precise losses in each phase. In latent matching, we:

  1. Sample a minibatch {(pt,ot,ht1,zt1,at1)}\{(p_t,o_t,h_{t-1},z_{t-1},a_{t-1})\} from the replay buffer. (Here ht1h_{t-1} and zt1z_{t-1} come from the frozen teacher model when the data were generated.)

  2. Compute teacher posterior

    zt    qϕ(ztht,pt,ot),qϕ(zt)  =  N(μtteach,Σtteach).z_t \;\sim\; q_\phi\bigl(z_t \mid h_t,\,p_t,\,o_t\bigr), \quad q_\phi(z_t\mid \cdot) \;=\; \mathcal{N}\bigl(\mu_t^\text{teach},\,\Sigma_t^\text{teach}\bigr).

    (In practice we only need the posterior’s mean μtteach\mu_t^\text{teach} and covariance Σtteach\Sigma_t^\text{teach}.)

  3. Sample KK adaptor‐samples For i=1,,Ki=1,\dots,K:

    p^t(i)    gψ(p^ot),gψ(p^o)  =  N(μadapt(ot),Σadapt(ot)).\hat p_t^{(i)} \;\sim\; g_\psi\bigl(\,\hat p \mid o_t\bigr), \quad g_\psi(\hat p\mid o)\;=\;\mathcal{N}\bigl(\mu^\text{adapt}(o_t),\,\Sigma^\text{adapt}(o_t)\bigr).
  4. Compute student posterior parameters For each ii, let

    (μt(i),Σt(i))  =  Encoderψ(ht,  p^t(i),  ot),(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}) \;=\; \text{Encoder}_\psi\bigl(h_t,\;\hat p_t^{(i)},\;o_t\bigr),

    so that zt(i)N ⁣(μt(i),Σt(i)).\bar z_t^{(i)}\sim\mathcal{N}\!\bigl(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}\bigr).

  5. Latent‐distillation loss We match the student’s Gaussian N(μt(i),Σt(i))\mathcal{N}(\bar\mu_t^{(i)},\bar\Sigma_t^{(i)}) to the teacher’s N(μtteach,Σtteach)\mathcal{N}(\mu_t^\text{teach},\Sigma_t^\text{teach}). A convenient closed‐form is the KL‐divergence between two Gaussians: for each ii,

    KL[N(μt(i),Σt(i))    N(μtteach,Σtteach)]  =  12[Tr((Σtteach)1Σt(i))+(μtteachμt(i))(Σtteach)1(μtteachμt(i))d+lndetΣtteachdetΣt(i)],\mathrm{KL}\Bigl[\, \mathcal{N}(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}) \;\bigl\|\; \mathcal{N}(\mu_t^\text{teach},\,\Sigma_t^\text{teach}) \Bigr] \;=\; \frac12 \Bigl[ \mathrm{Tr}\bigl((\Sigma_t^\text{teach})^{-1}\,\bar\Sigma_t^{(i)}\bigr) + (\mu_t^\text{teach}-\bar\mu_t^{(i)})^\top(\Sigma_t^\text{teach})^{-1}(\mu_t^\text{teach}-\bar\mu_t^{(i)}) - d + \ln \frac{\det \Sigma_t^\text{teach}}{\det \bar\Sigma_t^{(i)}} \Bigr],

    where d=dim(zt)d=\dim(z_t). We also add a penalty on the adaptor’s variance Σadapt(ot)\Sigma^\text{adapt}(o_t) so that it does not blow up arbitrarily. Concretely, one choice is

    Ldistill(ψ)  =  1Ki=1K[KL(N(μt(i),Σt(i))N(μtteach,Σtteach))  +  λρTr(Σadapt(ot))].\fanl_\text{distill}(\psi) \;=\; \frac1K \sum_{i=1}^K \Bigl[ \mathrm{KL}\bigl(\mathcal{N}(\bar\mu_t^{(i)},\,\bar\Sigma_t^{(i)}) \,\|\, \mathcal{N}(\mu_t^\text{teach},\,\Sigma_t^\text{teach})\bigr) \;+\;\lambda_\rho \,\mathrm{Tr}\bigl(\Sigma^\text{adapt}(o_t)\bigr) \Bigr].

    Here λρ>0\lambda_\rho>0 is a small weight pushing the adaptor’s covariance to remain compact (so it doesn’t say “I can be infinitely uncertain”).

  6. Gradient‐step 1: Update ψ\psi by descending ψLdistill\nabla_{\psi}\,\fanl_\text{distill}. The frozen teacher ϕ\phi does not receive gradients here.

In the student‐world‐model phase, we:

  1. Sample the same minibatch {(ot,ht1,at1)}\{(o_t,h_{t-1},a_{t-1})\} (we can reuse the same ht1h_{t-1} from replay).

  2. Sample multiple adaptor outputs Again for i=1,,Ki=1,\dots,K:

    p^t(i)    gψ(p^ot),zt(i)    qψ(ztht,  p^t(i),ot).\hat p_t^{(i)} \;\sim\; g_\psi\bigl(\hat p \mid o_t\bigr), \quad \bar z_t^{(i)} \;\sim\; q_\psi\bigl(\bar z_t \mid h_t,\;\hat p_t^{(i)},\,o_t\bigr).

    We then feed zt(i)\bar z_t^{(i)} and hth_t into the (reused) teacher‐decoder/dynamics/reward/continue modules.

  3. Reconstruction / ELBO losses For each sample ii:

    • Student representation loss

      Lsrep(ψ):=1Ki=1Kmax(1,KL[qψ(zt(i)ht,p^t(i),ot)    sg(pϕ(ztht))]).\fanl_{\text{srep}}(\psi) := \frac{1}{K} \sum_{i=1}^K\max\left(1,\mathrm{KL}\Bigl[q_\psi(\bar z_t^{(i)}\mid h_t,\hat p_t^{(i)},o_t)\;\Big\|\;\mathrm{sg}(p_\phi(z_t \mid h_t))\Bigr]\right).

      Note: pϕ(ztht)p_\phi(z_t\mid h_t) is the teacher’s prior (also Gaussian).

    • Student dynamics loss

Lsdyn(ψ):=1Ki=1Kmax(1,KL[sg(qψ(zt(i)ht,p^t(i),ot))    pϕ(ztht)]).\fanl_{\text{sdyn}}(\psi) := \frac{1}{K} \sum_{i=1}^K\max\left(1,\mathrm{KL}\Bigl[\mathrm{sg}(q_\psi(\bar z_t^{(i)}\mid h_t,\hat p_t^{(i)},o_t))\;\Big\|\;p_\phi(z_t \mid h_t)\Bigr]\right).
  • Student prediction loss
Lspred(ψ):=1Ki=1K    lnpϕ(otht,zt(i))    lnpϕ(rtht,zt(i))    lnpϕ(ctht,zt(i)) \fanl_{\text{spred}}(\psi) := \frac{1}{K} \sum_{i=1}^K\;-\;\ln p_\phi( o_t \mid h_t,\bar z_t^{(i)}) \;-\;\ln p_\phi(r_t \mid h_t,\bar z_t^{(i)}) \;-\;\ln p_\phi(c_t \mid h_t,\bar z_t^{(i)})

Summing these over i=1Ki=1\ldots K and averaging gives

Lworld  :=  t=1T(βsrepLsrep(ψ)+βsdynLsdyn(ψ)+βspredLspred(ψ)).\fanl_\text{world} \;:=\; \sum_{t=1}^T \left( \beta_\text{srep} \fanl_\text{srep}(\psi) + \beta_\text{sdyn} \fanl_\text{sdyn}(\psi) + \beta_\text{spred} \fanl_\text{spred}(\psi) \right).
  1. Gradient‐step 2: Update ψ\psi (the adaptor and student encoder) and also the student‐world‐model’s decoders/predictors (which share parameters with ϕ\phi except for the posterior). Concretely, we descend

    ψ,ϕstudent  Lworld.\nabla_{\psi,\phi_{\text{student}}}\;\fanl_\text{world}.

    Here ϕstudent\phi_{\text{student}} denotes any parameters in the teacher model that we allow to fine‐tune on top of student latents—typically, one fine‐tunes only the decoder and reward/continue heads, leaving the teacher’s recurrent prior fϕf_\phi frozen. (But you may optionally leave everything frozen except ψ\psi.)

Putting it all together, for each minibatch BB of timesteps tt, we perform:

  1. Latent‐matching step

    ψ    ψ    αψ[1BtBLdistill(ψ)].\psi\;\leftarrow\;\psi\;-\;\alpha\,\nabla_{\psi}\Bigl[\tfrac{1}{|B|}\sum_{t\in B} \fanl_\text{distill}(\psi)\Bigr].
  2. World‐model‐training step:

    (ψ,ϕstudent)    (ψ,ϕstudent)    β(ψ,ϕstudent)[1BtBLworld(ψ,ϕstudent)].(\psi,\phi_{\text{student}})\;\leftarrow\;(\psi,\phi_{\text{student}})\;-\;\beta\,\nabla_{(\psi,\phi_{\text{student}})}\Bigl[\tfrac{1}{|B|}\sum_{t\in B} \fanl_\text{world}(\psi, \phi_{\text{student}})\Bigr].

Here α,β\alpha,\beta are chosen learning rates. In practice you alternate them every gradient update (or every few gradient updates), so that the student encoder sees both the “teacher matching” signal and the “reconstruction” signal in roughly equal measure. Over many iterations, ψ\psi converges to a point where

qψ(ztht,  gψ(ot),  ot)    qϕ(ztht,  pt,  ot)q_\psi\bigl(\bar z_t \mid h_t,\;g_\psi(o_t),\;o_t\bigr) \;\approx\; q_\phi\bigl(z_t \mid h_t,\;p_t,\;o_t\bigr)

(i.e.\ latent‐distillation), while also ensuring zt\bar z_t can be fed into the reused decoders to reconstruct consistent next observations, rewards, and continuation signals (the “world‐model” part).

Inference (Querying the Policy with Only oto_t)

Once training converges:

  1. Receive a new real‐world observation oto_t (no privileged ptp_t available).

  2. Compute [μadapt(ot),Σadapt(ot)][\mu^\text{adapt}(o_t),\,\Sigma^\text{adapt}(o_t)] and sample p^tN(μadapt,Σadapt)\hat p_t\sim\mathcal{N}(\mu^\text{adapt},\,\Sigma^\text{adapt}). In many implementations one simply takes p^t=μadapt(ot)\hat p_t = \mu^\text{adapt}(o_t) (the mean) to reduce variance.

  3. Compute the student‐posterior parameters (μt,  Σt)=Encoderψ(ht,  p^t,  ot)\bigl(\bar\mu_t,\;\bar\Sigma_t\bigr) = \text{Encoder}_\psi(h_t,\;\hat p_t,\;o_t), then set zt=μt\bar z_t = \bar\mu_t.

  4. Recurse the recurrent state:   ht+1  =  fϕ(ht,zt,at)\;h_{t+1} \;=\; f_\phi\bigl(h_t,\,\bar z_t,\,a_t\bigr).

  5. Query the student policy

    at    πθ(atzt).a_t \;\sim\;\pi_\theta\bigl(a_t \mid \bar z_t \bigr).

In practice, if you wish to capture adaptation uncertainty in real time, you can sample multiple p^t\hat p_t and average or pick an action distribution that is robust to that uncertainty. But for many real‐world robotics applications, setting p^t\hat p_t equal to the adaptor‐mean μadapt(ot)\mu^\text{adapt}(o_t) suffices.