Coordinated Exploration

TagsCS 234ExplorationExtra

What is concurrent reinforcement learning?

Concurrent RL happens when you’re dealing with many agents interacting in parallel. The agents have common priors, and they share information and benefit each other. This is a hard problem because we want to be efficient in agent exploration

  1. Adaptivity (effective use of information as it comes in)
  1. Commitment (carry out long-term plans)
  1. Diversity (different agents do different things)

Failed Approaches

Concurrent UCRL

Posterior sampling

Seed Sampling (tabular)

Have a mapping between observation history and MDPs be influenced by a random seed, which is different for every agent.

This yields diversity and commitment (similar to sampling Q functions)

Moving to Non-Tabular

Epsilon-Greedy DQN

If we use it out of box, this yields limited diversity. The epsilon random sampling also doesn’t yield much commitment.

Generalized Seed Sampling

The idea here is to draw random seeds to perturb the rewards slightly, as well as an initial sampled function parameter that we use an L2 regularize towards.

The general idea is that we give different interpretations to a shared data.

This gives us per-agent Bayesian regret O(1/N)O(1/\sqrt{N}) with NN agents, so the cumulative regret is sublinear