Coordinated Exploration

Tags	CS 234ExplorationExtra

What is concurrent reinforcement learning?

Concurrent RL happens when you’re dealing with many agents interacting in parallel. The agents have common priors, and they share information and benefit each other. This is a hard problem because we want to be efficient in agent exploration

Adaptivity (effective use of information as it comes in)

Commitment (carry out long-term plans)

Diversity (different agents do different things)

Failed Approaches

Concurrent UCRL

agents form same upper confidence bounds based on shared explorations.

however, this yields no diversity in exploration and the agents performs the same, which means that there is no divide and conquer

Posterior sampling

sample a model for every single agent, and then act upon it

if we sample every timestep, there yields no commitment and you might vibrate around too much
- in the bandit case, commitment isn’t an issue because it’s one step. However, in MDP, it’s worth having a long-time life philosophy.

Seed Sampling (tabular)

Have a mapping between observation history and MDPs be influenced by a random seed, which is different for every agent.

This yields diversity and commitment (similar to sampling Q functions)

Moving to Non-Tabular

Epsilon-Greedy DQN

If we use it out of box, this yields limited diversity. The epsilon random sampling also doesn’t yield much commitment.

Generalized Seed Sampling

The idea here is to draw random seeds to perturb the rewards slightly, as well as an initial sampled function parameter that we use an L2 regularize towards.

The general idea is that we give different interpretations to a shared data.

This gives us per-agent Bayesian regret $O(1/\sqrt{N})$ with $N$ agents, so the cumulative regret is sublinear