Actor-Critic Methods

May 16, 2025

Recall the notion of importance weights from the previous post. One drawback of this approach that I didn't discuss is that this method is only suitable when policies are very similar to one another.

If the policies are very different then the importance weights might blow up or vanish, giving us a worse estimate of the gradient of the loss function.

There is also the issue demonstrated in the example of the last post that policy gradients don't make efficient use of data! If a robot rewarded for forward velocity takes one small step forward and then falls backward, it pushes down the likelihood of a step forward.

For sparse rewards, actions with partially correct results aren't utilized by policy gradients since no actual reward is received.

In this post we'll discuss an alternative class of algorithms called Actor Critic Methods.

Value Functions

Let's revisit some useful functions related to Markov Decision Processes (MDPs).

  • The (on-policy) value function Vπ(s)V^{\pi}(\mathbf{s}) - future expected rewards starting at s\mathbf{s} and following π\pi.
Vπ(s)=Eτpθ(τ)[r(τ)s1=s]V^{\pi}(\mathbf{s}) = \bbe_{\tau \sim p_{\theta}(\tau)}\left[r(\tau) \mid \mathbf{s}_1 = \mathbf{s}\right]
  • The (on-policy) action-value function Qπ(s,a)Q^{\pi}(\mathbf{s}, \mathbf{a}) - future expected rewards starting at s\mathbf{s}, taking a\mathbf{a}, then following π\pi.
Qπ(s,a)=Eτpθ(τ)[r(τ)s1=s,a1=a]Q^{\pi}(\mathbf{s}, \mathbf{a}) = \bbe_{\tau \sim p_{\theta}(\tau)} \left[r(\tau) \mid \mathbf{s}_1 = \mathbf{s}, \mathbf{a}_1 = \mathbf{a}\right]

One useful relation between these is that

Vπ(s)=Eaπ(s)[Qπ(s,a)]V^{\pi}(\mathbf{s}) = \bbe_{\mathbf{a} \sim \pi(\cdot \mid \mathbf{s})} \left[Q^{\pi}(\mathbf{s}, \mathbf{a})\right]

i.e. if we choose the action in the QQ-function according to its distribution given by the policy π\pi, then we just end up with the value of the state. There is yet a third value function.

  • The (on-policy) advantage function Aπ(s,a)A^{\pi}(\mathbf{s}, \mathbf{a}) - how much better it is to take a\mathbf{a} than to follow π\pi at state s\mathbf{s}.
Aπ(s,a)=Qπ(s,a)Vπ(s)A^{\pi}(\mathbf{s}, \mathbf{a}) = Q^{\pi}(\mathbf{s}, \mathbf{a}) - V^{\pi}(\mathbf{s})

Revisiting Policy Gradient

Recall our policy gradient had the (non-final) form

θL(θ)1Ni=1Nt=1Tθlogπθ(atst)(t=tTr(st,at))reward to go\nabla_{\theta} \fanl(\theta) \approx -\frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}(\mathbf{a}_t \mid \mathbf{s}_t)\underbrace{\left(\sum_{t=t'}^T r(\mathbf{s}_{t'}, \mathbf{a}_{t'})\right)}_{\text{reward to go}}

The term on the right is the estimate of future rewards if we take action at\mathbf{a}_t in state st\mathbf{s}_t. Can we get a better estimate?

t=tTEπθ[r(st,at)st,at]=Qπθ(st,at)\sum_{t'=t}^T \bbe_{\pi_{\theta}}\left[r(\mathbf{s}_{t'}, \mathbf{a}_{t'}) \mid \mathbf{s}_t, \mathbf{a}_t\right] = Q^{\pi_{\theta}}(\mathbf{s}_t, \mathbf{a}_t)

can be seen as the true expected rewards to go. This would be way better!

θL(θ)1Ni=1Nt=1Tθlogπθ(atst)Qπθ(st,at)\nabla_{\theta} \fanl(\theta) \approx -\frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}(\mathbf{a}_t \mid \mathbf{s}_t)Q^{\pi_{\theta}}(\mathbf{s}_{t}, \mathbf{a}_{t})

Should we use baselines like before? Our average QQ-value would look like

b=1Ni=1NQπθ(st,at)b = \frac{1}{N} \sum_{i=1}^N Q^{\pi_{\theta}}(\mathbf{s}_{t}, \mathbf{a}_t)

This looks awfully familiar to the expected value of the QQ-value function across actions distributed according to our policy. This intuitively suggests a good baseline would actually be our value function Vπθ(st)V^{\pi_{\theta}}(\mathbf{s}_t).

Recall Aπ(s,a)=Qπ(s,a)Vπ(s)A^{\pi}(\mathbf{s}, \mathbf{a}) = Q^{\pi}(\mathbf{s}, \mathbf{a}) - V^{\pi}(\mathbf{s}) is the definition of our advantage function. Thus, naturally we should make our policy gradient

θL(θ)1Ni=1Nt=1Tθlogπθ(atst)Aπθ(st,at)\nabla_{\theta} \fanl(\theta) \approx -\frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}(\mathbf{a}_t \mid \mathbf{s}_t)A^{\pi_{\theta}}(\mathbf{s}_{t}, \mathbf{a}_{t})

The key is that better estimates of AA lead to less noisy gradients! How can we estimate the advantage function?

Estimating Value

Since advantage is a function of both value and state-action value, one might think we have to estimate both well to estimate advantage. However, there is actually a way to compute all three value functions as a function of VV.

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]=r(st,at)+t=t+1TEπθ[r(st,at)st,at]=r(st,at)+Est+1p(st,at)[Vπ(st+1)]r(st,at)+Vπ(st+1)(use the sampled st+1)Aπ(st,at)r(st,at)+Vπ(st+1)Vπ(st)\begin{align*} Q^{\pi}(\bfs_t, \bfa_t) &= \sum_{t'=t}^T \bbe_{\pi_{\theta}}\left[r(\bfs_{t'}, \bfa_{t'}) \mid \bfs_t, \bfa_t\right] \\[15pt] &= r(\bfs_t, \bfa_t) + \sum_{t'=t+1}^T \bbe_{\pi_{\theta}}\left[r(\bfs_{t'}, \bfa_{t'}) \mid \bfs_t, \bfa_t\right] \\[15pt] &= r(\bfs_t, \bfa_t) + \bbe_{\bfs_{t+1} \sim p(\cdot \mid \bfs_t, \bfa_t)}\left[V^{\pi}(\bfs_{t+1})\right] \\[15pt] &\approx r(\bfs_t, \bfa_t) + V^{\pi}(\bfs_{t+1}) \quad \text{(use the sampled } \bfs_{t+1} \text{)} \\[15pt] A^{\pi}(\bfs_t, \bfa_t) &\approx r(\bfs_t, \bfa_t) + V^{\pi}(\bfs_{t+1}) - V^{\pi}(\bfs_t) \end{align*}

Thus we can actually just fit VπV^{\pi}.

Version 1: Monte Carlo

Our original single sample estimate is

Vπ(st)r(τ)V^{\pi}(\bfs_t) \approx r(\tau)

where τpθ(τs1=s)\tau \sim p_{\theta}(\tau \mid \bfs_1 = \bfs). We can aggregate this into a labeled dataset of states s\bfs along with their single sample estimates

{(st(i),r(τ(i)))}\left\{\left(\bfs^{(i)}_t, r\big(\tau^{(i)}\big)\right)\right\}

where τ(i)pθ(τst(i))\tau^{(i)} \sim p_{\theta}(\tau \mid \bfs^{(i)}_t) is one of the NN rollouts. Then we do supervised learning to fit the estimated value function.

L(ϕ)=12i=1NV^ϕπθ(st(i))r(τ(i))2\fanl(\phi) = \frac{1}{2} \sum_{i=1}^N \norm{\hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs^{(i)}_t\big) - r\big(\tau^{(i)}\big)}^2

Version 2: Bootstrapping

Our Monte Carlo target is r(τ)r\big(\tau\big) and our ideal target is

y=Eτpθ(τst)[r(τ)st]=t=tTEτpθ(τst)[r(st,at)st]r(st,at)+t=t+1TEτpθ(τst+1)[r(st,at)st+1]r(st,at)+Vπθ(st+1)r(st,at)+V^ϕπθ(st+1)\begin{align*} y &= \bbe_{\tau \sim p_{\theta}(\tau \mid \bfs_t)}\left[r\big(\tau\big) \mid \bfs_t\right] \\[15pt] &= \sum_{t'=t}^T \bbe_{\tau \sim p_{\theta}(\tau \mid \bfs_t)}\left[r\big(\bfs_{t'}, \bfa_{t'}\big) \mid \bfs_t\right] \\[15pt] &\approx r\big(\bfs_t, \bfa_t\big) + \sum_{t'=t+1}^T \bbe_{\tau \sim p_{\theta}(\tau \mid \bfs_{t+1})}\left[r\big(\bfs_{t'}, \bfa_{t'}\big) \mid \bfs_{t+1}\right]\\[15pt] &\approx r\big(\bfs_t, \bfa_t\big) + V^{\pi_{\theta}}\big(\bfs_{t+1}\big) \\[15pt] &\approx r\big(\bfs_t, \bfa_t\big) + \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs_{t+1}\big) \end{align*}

where st+1\bfs_{t+1} is a next state sampled from the world given the state st\bfs_t and action at\bfa_t sampled from πθ(st)\pi_{\theta}(\cdot \mid \bfs_t).

We can aggregate the dataset of "bootstrapped" estimates

{(st(i),r(st(i),at(i))+V^ϕπθ(st+1(i)))}\bigg\{\bigg(\bfs^{(i)}_t, r\big(\bfs^{(i)}_t, \bfa^{(i)}_t\big) + \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs^{(i)}_{t+1}\big)\bigg)\bigg\}

and then run supervised learning to fit the estimated value function

L(ϕ)=12i=1Nr(st(i),at(i))+V^ϕπθ(st+1(i))V^ϕπθ(st(i))2\fanl(\phi) = \frac{1}{2} \sum_{i=1}^N \norm{ r\big(\bfs^{(i)}_t, \bfa^{(i)}_t\big) + \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs^{(i)}_{t+1}\big) - \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs^{(i)}_t\big) }^2

This is sometimes referred to as a version of temporal difference (TD) learning. Each term

r(st,at)+V^ϕπθ(st+1)V^ϕπθ(st)r\big(\bfs_t, \bfa_t\big) + \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs_{t+1}\big) - \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs_t\big)

is known as the TD error.

We can also do the same with the discounted state value function with discount factor γ(0,1)\gamma \in (0,1) (usually 0.990.99). Namely

Vπ(s)=Eaπ(s)[t=0γtr(st,at)s0=s]V^{\pi}(\bfs) = \bbe_{\bfa \sim \pi(\cdot\mid \bfs)}\left[\sum_{t=0}^{\infty} \gamma^t r(\bfs_t, \bfa_t) \mid s_0 = \bfs\right]

and the discounted TD error becomes

r(st,at)+γV^ϕπθ(st+1)V^ϕπθ(st)r\big(\bfs_t, \bfa_t\big) + \gamma \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs_{t+1}\big) - \hat{V}^{\pi_{\theta}}_{\phi}\big(\bfs_t\big)

Then the complete algorithm is as below.

  1. Sample batch of data {(s1(i),a1(i),,sT(i),aT(i))}\{(\bfs^{(i)}_1, \bfa^{(i)}_1, \ldots, \bfs^{(i)}_T, \bfa^{(i)}_T)\} from πθ\pi_{\theta}.
  2. Fit V^ϕπθ\hat{V}^{\pi_{\theta}}_{\phi} to summed rewards in data.
  3. Evaluate A^πθ(st(i),at(i))=r(st(i),at(i))+γV^ϕπθ(st+1(i))V^ϕπθ(st(i))\hat{A}^{\pi_{\theta}}(\bfs^{(i)}_t, \bfa^{(i)}_t) = r(\bfs^{(i)}_t, \bfa^{(i)}_t) + \gamma \hat{V}^{\pi_{\theta}}_{\phi}(\bfs^{(i)}_{t+1}) - \hat{V}^{\pi_{\theta}}_{\phi}(\bfs^{(i)}_{t}) for all t,it,i.
  4. Evaluate θL(θ)1Nt,iθlogπθ(at(i)st(i))A^πθ(st(i),at(i))\nabla_{\theta} \fanl(\theta) \approx - \frac{1}{N}\sum_{t, i} \nabla_{\theta} \log \pi_{\theta}(\bfa^{(i)}_t \mid \bfs^{(i)}_t) \hat{A}^{\pi_{\theta}}(\bfs^{(i)}_t, \bfa^{(i)}_t).
  5. Update θθηθL(θ)\theta \leftarrow \theta - \eta \nabla_{\theta} \fanl(\theta).

Off-Policy Actor-Critic

Currently the above algorithm is on-policy since we have to rerun the policy to generate training samples for every gradient step.

The first thing we can do is to use the same idea of importance sampling as before to enable resampling from old policies. Step 4 above would become

θL(θ)1Nt,iπθ(at(i)st(i))πθ(at(i)st(i))importance weightsθlogπθ(at(i)st(i))A^πθ(st(i),at(i))old policy advantage\nabla_{\theta} \fanl(\theta) \approx - \frac{1}{N}\sum_{t, i} \underbrace{\frac{\pi_{\theta}(\bfa^{(i)}_t \mid \bfs^{(i)}_t)}{\pi_{\theta'}(\bfa^{(i)}_t \mid \bfs^{(i)}_t)}}_{\text{importance weights}} \nabla_{\theta} \log \pi_{\theta}(\bfa^{(i)}_t \mid \bfs^{(i)}_t) \underbrace{\hat{A}^{\pi_{\theta'}}(\bfs^{(i)}_t, \bfa^{(i)}_t)}_{\text{old policy advantage}}

Some nasty issues can arise though. In RL, off-policy learning + function approximation + bootstrapping is reffered to as the deadly triad.

Over-iterating on the same data without fresh on‐policy coverage breaks the contraction guarantees your Bellman‐style updates rely on. The result is classic RL divergence rather than convergence to a sensible value function.

Idea 1: Use KL divergence constraint on policy.

Esπθ[DKL(πθ(s)πθ(s))]δ\bbe_{\bfs \sim \pi_{\theta'}}\left[D_{KL}(\pi_{\theta}(\cdot \mid \bfs) || \pi_{\theta'}(\cdot \mid \bfs))\right] \leq \delta

We see this in LLM preference optimization (RLHF).

Idea 2: Can we bound the importance weights? This doesn't directly constrain the policy, but removes incentives. This is the key idea being proximal policy optimization (PPO).

We will talk about these ideas later, but for now our solution will be replay buffers. The idea of the replay buffer is to store the recent history of transitions that we saw in prior timesteps. By a transition we mean state-action-reward-nextstate (SARS).

We then sample transitions from the buffer in mini batches and use this to estimate our state-value function and do gradient steps.

The overview of the broken algorithm is as follows.

  1. Collect experience {si,ai}\{\bfs_i, \bfa_i\} from πθ(as)\pi_{\theta}(\bfa \mid \bfs). Add this to the replay buffer R\calr.
  2. Sample a minibatch {(si,ai,ri,si)}\{(\bfs_i, \bfa_i, r_i, \bfs_i')\} from buffer R\calr.
  3. Update V^ϕπ(s)\hat{V}^{\pi}_{\phi}(\bfs) using targets yi=ri+γV^ϕπ(si)y_i = r_i + \gamma \textcolor{red}{\hat{V}^{\pi}_{\phi}(\bfs_i')} for each si\bfs_i.
  4. Evaluate A^π(si,ai)=ri+γV^ϕπ(si)V^ϕ(si)\hat{A}^{\pi}(\bfs_i, \bfa_i) = r_i + \gamma \hat{V}^{\pi}_{\phi}(\bfs_i') - \hat{V}_{\phi}(\bfs_i).
  5. θL(θ)1Niθlogπθ(aisi)A^π(si,ai)\nabla_{\theta}\fanl(\theta) \approx -\frac{1}{N}\sum_i \nabla_{\theta} \textcolor{red}{\log \pi_{\theta}(\bfa_i \mid \bfs_i)} \hat{A}^{\pi}(\bfs_i, \bfa_i)
  6. θθηθL(θ)\theta \leftarrow \theta - \eta \nabla_{\theta}\fanl(\theta)

Steps 3 through 6 are carried out on the minibatches sampled from replay memory. Why is this algorithm broken? Note that in step 3 the TD target is not correct, and therefore the action taken in step 5 is not correct!

Why? Note that we are using trajectories from previous policies to fit the value function for our current policy. The actions taken and therefore the next state are not necessarily the same as the actions our policy would have taken.

What if we fit Q(s,a)Q(\bfs, \bfa) instead of V(s)V(\bfs)? Now we can take into account the action taken instead of assuming our action is distributed according to our current policy.

I will spare you the math and just say that our new targets are

yi=ri+γQ^ϕπθ(si,aiπ)y_i = r_i + \gamma \textcolor{green}{\hat{Q}^{\pi_{\theta}}_{\phi}(\bfs'_i, {\bfa'}^{\pi}_i)}

where aiππθ(si){\bfa'}^{\pi}_i \sim \pi_{\theta}(\cdot \mid \bfs'_i). This is the same as our actual intended target as it provides an unbiased single-sample estimate of the state-value function via

V^ϕπθ(si)=Eaπθ(si)[Q^ϕπθ(si,a)]\hat{V}^{\pi_{\theta}}_{\phi}(\bfs_i) = \bbe_{\bfa \sim \pi_{\theta}(\cdot \mid \bfs_i)}\left[\hat{Q}^{\pi_{\theta}}_{\phi}(\bfs_i, \bfa)\right]

Similarly, in step 5 the given action ai\bfa_i is not the action our policy πθ\pi_{\theta} would have taken. We can use the same trick, but this time with ai\bfa_i instead of ai\bfa_i', i.e. we sample aiππθ(si)\bfa_i^{\pi} \sim \pi_{\theta}(\cdot \mid \bfs_i) and set

θL(θ)1Niθlogπθ(aiπsi)A^π(si,ai)\nabla_{\theta}\fanl(\theta) \approx -\frac{1}{N}\sum_i \nabla_{\theta}\log \pi_{\theta}(\textcolor{green}{\bfa^{\pi}_i} \mid \bfs_i)\hat{A}^{\pi}(\bfs_i, \bfa_i)

Conclusion

This is a nice overview of actor-critic methods with efficient bootstrapping and sampling techinques that are used in modern RL. It is also a doorway to more standard algorithsm like PPO (proximal policy optimization) and RLHF for LLMs which I will discuss in subsequent posts.

There are some more interesting techniques for approximating the VV and QQ functions that I also plan to write about, which is referred to and popularly known as Deep QQ-learning or Deep QQ-networks (DQNs). I have a previous post but I intend to delve more into the math in line with this series of posts on topics within RL.