RL Basic Theory

TagsCS 234Reviewed 2024

What is reinforcement learning?

The key objective

The key objective of RL is this:

This can be rewritten in terms of the state-action marginals (essentially the distribution of states and actions at any given time tt, computed through inference).

What it involves

RL involves

Here’s a nice chart that looks at how different AI methods use these things

UL is unsupervised, SL is supervised, and IL is imitation learning.

Key vocab

Big questions

What questions do we ask?

What assumptions do we make?

For exploration problems, we typically show worst case performance.

For learning problem, we abstract away exploration by essentially sampling any sample you want from the MDP. In other words, you assume access to P(ss,a)P(s’ | s, a) for all (s,a)(s, a).

It’s generally not possible to show that an algorithm converges all the time. However, we can start to understand how problem parameters can impact the output. We can use precise theory to get conclusions that we can approximately apply to the real world.

Theory often allows you to get heuristics!

Environment

There are a few attributes of the environment to keep in mind

Agent

The RL algorithm typically contains one or more of a model, policy, and value function

A model-based agent uses an explicit model. It may have a policy and/or value function…it depends.

A model-free agent has no model, but it has an explicit value function and/or policy function.

Model

A model is contained by some agents, and it’s the proposed behavior of the world. You can have a transition model which predicts p(ss,a)p(s’ | s, a). You can also have a reward model that predicts E[rs,a]E[r | s, a].

You learn the model; it’s not correct to begin with. But once you have the model, you have more freedom for planning.

Policy

A policy maps from states to actions. It can be deterministic or stochastic.

A deterministic policy can be good if the environment is easily exploitable. But a stochastic policy is good for learning, and sometimes it’s important even for rollouts, especially in an adversarial situation with gameplay.

Value function

A value function is the expected discounted sum of rewards

With a value function, you can quantify goodness and badness of states and actions, which helps you make decisions.

There is a discount factor which allows you to adjust how much you care about the future.

Control vs evaluation

The agent can do two things: it can either control or it can evaluate.

control is applying the policy in the world to maximize rewards. evaluate is predict the expected rewards following a given policy. Sometimes, you can accomplish both at the same time. Other times, you may only be able to accomplish control. It depends on which approach you use.