RL Introduction

TagsCS 285

RL Algorithms: A quick survey

There are three basic parts of an RL algorithm

  1. Generate samples in the environment
  1. Fit a model to something (like a Q function)
  1. Improve the policy

Specific algorithms

Why do we have different algorithms?

There are a few tradeoffs

Efficiency

The most important question is if the model is on-policy or off-policy. On-policy is very inefficient, but it can have its perks.

Stability

When you’re fitting a value function through DP, you aren’t guaranteeing anything. However, when policy gradients, you are directly optimizing the objective.

Assumptions

  1. Full observability is a common observation, which is basically meaning that we see things as MDP and not POMDP (if you’re optimizing based on MDP assumption in a POMDP, the problem becomes not solvable)
  1. Episodic learning: we can reset after each one
  1. Continuity or smoothness: there is some continuous value function (not necessarily a continuous reward!)