RL Introduction

Tags	CS 285

RL Algorithms: A quick survey

There are three basic parts of an RL algorithm

Generate samples in the environment

Fit a model to something (like a Q function)

Improve the policy

Specific algorithms

Policy gradients: directly differentiate the RL objective
- can be stabilized by a value function baseline (control variate)

Value-based: estimate value function or Q function, and the policy comes implicitly

Actor-critic: estimate the value function or Q function, and then optimize the actor based on it. A little bit of a hybid between policy gradients and value-based algorithms

Model-based RL: estimate a transition model, and then use it for planning or use it to improve a policy, or something else

Why do we have different algorithms?

There are a few tradeoffs

sample efficiency

stability and ease of use

different assumptions (stochasticity, continuity/discrete, or episodic/infintie)

different settings (simulation, real life, etc)

is it easier to represent the policy or the model?

Efficiency

The most important question is if the model is on-policy or off-policy. On-policy is very inefficient, but it can have its perks.

Stability

When you’re fitting a value function through DP, you aren’t guaranteeing anything. However, when policy gradients, you are directly optimizing the objective.

Assumptions

Full observability is a common observation, which is basically meaning that we see things as MDP and not POMDP (if you’re optimizing based on MDP assumption in a POMDP, the problem becomes not solvable)

Episodic learning: we can reset after each one

Continuity or smoothness: there is some continuous value function (not necessarily a continuous reward!)