where at each timestep t the environment is in a "true" state st∈S, the agent takes at∼πθ(at∣zt), and then observes st+1∼P(⋅∣st,at) and reward rt=R(st,at).
In our CARLA‐based simulation, we decompose
st=(pt,ot),
where
pt∈P is the privileged portion of the simulator state (e.g., ground‐truth positions of all vehicles, traffic–light timings, etc.);
ot∈O is the environmental (or “observable”) information—what your sensors actually see (camera‐images, LiDAR, IMU, etc.).
We have pre‐trained a teacher world model with parameters ϕ. Concretely, DreamerV3’s Recurrent State‐Space Model (RSSM) is:
All pϕ and qϕ distributions are implemented as Gaussians with small neural nets providing means and (diagonal) covariances. The teacher world‐model parameters ϕ are frozen once pre‐training finishes.
Student Encoder and Adaptor
At inference time in the real world, we will only have access to ot. We therefore introduce:
an adaptation modulegψ, which learns to predict p^t from ot:
In other words, gψ outputs both a mean μadapt(ot) and a diagonal‐covariance Σadapt(ot), so that we can sample multiple p^t to capture uncertainty.
a student posterior encoder with parameters ψ, which mimics the teacher’s latent but only sees (p^t,ot):
In practice, we reuse the teacher’s deterministic decoder/dynamics/reward/continue networks, but we replace the teacher posterior qϕ(zt∣ht,pt,ot) by our student posterior qψ(zt∣ht,p^t,ot).
Finally, the policy has parameters θ. At inference we will sample
at∼πθ(at∣zt).
Note: During all training phases, the teacher world model ϕ and the policy θ remain frozen in the “latent‐distillation” stage; later we optionally fine‐tune θ on student latent rollouts.
Training Objectives
Our goal is twofold:
Teach the student encoder qψ(zt∣ht,p^t,ot) to match the teacher’s “privileged” latent zt∼qϕ(zt∣ht,pt,ot).
Train (or fine‐tune) all student‐based dynamics/reward/continue/decoder heads so that using zt instead of zt still reconstructs next‐step observations, rewards, and continuation‐flags accurately.
We do this by alternating between two phases on each minibatch of simulator data:
(A) Latent‐Matching Phase (Distillation):
We sample (pt,ot) from replay, compute the teacher’s posterior zt, and force the student’s posterior zt (which depends on p^t∼gψ(⋅∣ot)) to match zt.
(B) Student‐World‐Model Phase (Reconstruction):
We sample p^t∼gψ(⋅∣ot), then sample zt∼qψ(zt∣ht,p^t,ot), and train the student’s deterministic dynamics/reward/continue/decoder—exactly as if zt were the “true” latent—by minimizing the usual negative‐ELBO.
Below we give the precise losses in each phase. In latent matching, we:
Sample a minibatch {(pt,ot,ht−1,zt−1,at−1)} from the replay buffer. (Here ht−1 and zt−1 come from the frozen teacher model when the data were generated.)
Compute student posterior parameters
For each i, let
(μt(i),Σt(i))=Encoderψ(ht,p^t(i),ot),
so that
zt(i)∼N(μt(i),Σt(i)).
Latent‐distillation loss
We match the student’s Gaussian N(μt(i),Σt(i)) to the teacher’s N(μtteach,Σtteach). A convenient closed‐form is the KL‐divergence between two Gaussians: for each i,
Gradient‐step 2:
Update ψ (the adaptor and student encoder) and also the student‐world‐model’s decoders/predictors (which share parameters with ϕ except for the posterior). Concretely, we descend
∇ψ,ϕstudentLworld.
Here ϕstudent denotes any parameters in the teacher model that we allow to fine‐tune on top of student latents—typically, one fine‐tunes only the decoder and reward/continue heads, leaving the teacher’s recurrent prior fϕ frozen. (But you may optionally leave everything frozen except ψ.)
Putting it all together, for each minibatch B of timesteps t, we perform:
Here α,β are chosen learning rates. In practice you alternate them every gradient update (or every few gradient updates), so that the student encoder sees both the “teacher matching” signal and the “reconstruction” signal in roughly equal measure. Over many iterations, ψ converges to a point where
qψ(zt∣ht,gψ(ot),ot)≈qϕ(zt∣ht,pt,ot)
(i.e.\ latent‐distillation), while also ensuring zt can be fed into the reused decoders to reconstruct consistent next observations, rewards, and continuation signals (the “world‐model” part).
Inference (Querying the Policy with Only ot)
Once training converges:
Receive a new real‐world observation ot (no privileged pt available).
Compute [μadapt(ot),Σadapt(ot)] and sample p^t∼N(μadapt,Σadapt). In many implementations one simply takes p^t=μadapt(ot) (the mean) to reduce variance.
Compute the student‐posterior parameters (μt,Σt)=Encoderψ(ht,p^t,ot), then set zt=μt.
Recurse the recurrent state:
ht+1=fϕ(ht,zt,at).
Query the student policy
at∼πθ(at∣zt).
In practice, if you wish to capture adaptation uncertainty in real time, you can sample multiple p^t and average or pick an action distribution that is robust to that uncertainty. But for many real‐world robotics applications, setting p^t equal to the adaptor‐mean μadapt(ot) suffices.