OVERVIEW of methods and papers

TagsCS 285

Behavior cloning / DAGGER

Policy Gradient Methods

Optimizing the policy directly, using various tricks.

REINFORCE

DPG / DDPG

TD3

TRPO / PPO

General method: Actor-Critic

An actor-critic algorithm requires there to be an explicit model for an actor (which does stuff) and a critic (which tells it how good it is). This is a generic framework. The actor can optimize through a policy gradient by replacing the monte carlo rollout with the critic value. Or, it can optimize directly through maximizing the Q function.

The A or the Q function can be fit through Monte Carlo regression or through Bellman Backup.

Depending on which framework it uses, AC algorithms can be online or offline. Often, however, AC methods are on-policy.

Q-Prop

SAC

Essentially a Q learning method but it is an actor-critic too because there is an explicit representation of the policy. We add entropy, which has a lot of benefits.

General method: Value Methods

These methods focus on learning a value function and then implicitly deriving a policy from it. In reality, Value methods and Actor Critic methods can intersect.

SARSA & Variants

This is a Q function method that uses a bellman backup to compute QπQ^\pi:

Qπ(s,a)=r+γQπ(s,a)Q^\pi(s, a) = r + \gamma Q^\pi(s’, a’)

This, of course, must be on-policy. We can make the Q function learning off-policy by replacing it with

Qπ(s,a)=r+γQπ(s,π(a))Q^\pi(s, a) = r + \gamma Q^\pi(s’, \pi(a))

(this is reminiscient of off-policy actor critic methods. Note that this becomes a SARS method, which is off-policy. You can even ditch the actor and do it implicitly with

Qπ(s,a)=r+γmaxaQ(s,a)Q^\pi(s, a) = r + \gamma \max_a Q(s, a)

(this is the same as fitted Q iteration as seen above). In the actor-critic setup, the last two equations are pretty much the same thing. But in value function methods, the last equation allows you implicitly define a policy.

DQN

This is fitted Q iteration.

Double DQN

Double Q Learning (comes later)

Offline RL: general methods

Traditional Q learning methods on offline data suffers from over-optimism, because we never get a reality check. A whole set of special tools have been developed for this.

AWAC

CQL

IQL