We have a reinforcement learning policy π, and we want to evaluate how good that policy actually is. How do we evaluate the RL objective?
Recall we denote a trajectory by τ, which is short for an ordered tuple (s1,a1,r1,s2,a2,…,aT−1,sT,rT) where sT is a terminal state.
We denote our stochastic policy by πθ(⋅∣st), such that πθ(at∣st) is the probability our policy chooses action at at state st.
We can use this to define a probability distribution over our trajectores.
pθ(τ)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)
Here, the last p(st+1∣st,at) represents the probability our world transitions to state st+1 from state st given the action at.
Then our objective is to find
θ∗=θargminL(θ)−Eτ∼pθ(τ)[t=1∑Tr(st,at)]
where L(θ)=−Eτ∼pθ(τ)[∑t=1Tr(st,at)] is the expected future cost (loss) of our policy. The way we evaluate it is by approximation via sampling a bunch of trajectories and taking the average of the rewards. That is,
L(θ)≈−N1i=1∑Nt=1∑Tr(st,at)
We use the shorthand r(τ)=∑t=1Tr(st,at). This is kind of a hard expectation to evaluate the gradient of, so we do the following.
Then we can formulate the REINFORCE or vanilla policy gradient algorithm.
Sample trajectories {τi} from πθ(at∣st).
Approximate ∇θL(θ) as above using sampled trajectories.
Update θ←θ−η∇θL(θ). Go back to (1).
Intuition:
Increase likelihood of actions you took in high reward trajectories.
Decrease likeloohod of actions you took in negative reward trajectories.
Improving the Gradient
Suppose we have a humanoid robot learning to walk forward, and we let the reward be given by the forward velocity of the robot. In the initial traning, we have the following trajectores.
τ1: falls backwards ❌
τ2: falls forwards ✅️✅
τ3: manages to stand still ✅
τ4: one small step forward then falls backwards ❌
τ5: one large step backwards then small step forwards ❌
Q: What will the gradient encourage the policy to do?
A: It will encourage the policy to fall forward, not to take a step forward.
So what's the issue? The issue is that the robot doesn't have a sense of causality because the current policy gradient treats current actions as if they have influence on previous states.
Now with our updated gradient, we observe the following behavior.
τ1: falls forwards ✅
τ2: slowly stumbles forward ✅️✅
τ3: steadily walks forwards ✅✅✅
τ4: runs forwards ✅✅✅✅
Q: What will the gradient encourage the policy to do?
A: The robot will be encouraged to run, but will still stumble or fall forward some of the time.
The issue is that the robot is still gaining positive reward from stumbling or falling. Thus, instead we want to encourage the robot to get more reward than the average for every batch of policy rollouts.
In our original expected loss, we had
∇θL(θ)=−Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]
but what we want to do is subtract a constantb from the reward of the trajectory τ given by the average reward across N rollouts.
∇θL(θ)=−Eτ∼pθ(τ)[∇θlogpθ(τ)(r(τ)−b)]
where
b=N1i=1∑Nr(τ)
is the baseline reward. It turns out this actually has no effect on the gradient of loss in expectation, so this is mathematically consistent.
Therefore subtracting the constant baseline is unbiased and reduces the variance of the gradient. Average reward is a pretty good baseline, but there is an optimal baseline one can derive by analyzing the variance of the gradient.
The reason we have to optimize the variance a lot is because policy gradient is still noisy and high-variance, which means that it works best with dense rewards and large batches.
but computing the gradient is currently very expensive since we have to do N∗T backward passes. There is a solution which is to implement a surrogate objective whose gradient is the same as ∇θL(θ).
We do this by replacing the "log probability" term in the policy gradient by something like cross-entropy loss for discrete action policies, or squared error for Gaussian policies.
Importance Sampling
The obvious bottleneck in our implementation is that we need to recollect data at every gradient step! In other words, vanilla policy gradient is on-policy rather than off-policy, which means it updates based on data from the current policy rather than previous policies (e.g. sampling from replay memory).
Can we come up with an off-policy version? Yes, using importance sampling. Currently we calculate loss via
L(θ)=−Eτ∼pθ(τ)[r(τ)]
What if we want to use samples from p(τ) (the previous policy)? The idea of importance sampling is, given a proposal distribution q,
If we go through the motions of computing the gradient again, we find that the importance sampling version of our gradient objective given our current policy πθ where we use sampled trajectories from a previous policy πθ′ is
Since products are unstable, the product of the ratios of policies can become very small or very large, for larger T. What if we consider the expectation over timesteps instead of trajectories? Note in the below, we replace the ratio of πθ(at∣st) with πθ(st,at) and for θ′, the joint distribution over states and actions.
These ratios are called the importance weights for our off-policy policy gradient. The full REINFORCE algorithm is the same but now we can take multiple gradient steps per batch.