Coordinated Exploration
Tags | CS 234ExplorationExtra |
---|
What is concurrent reinforcement learning?
Concurrent RL happens when you’re dealing with many agents interacting in parallel. The agents have common priors, and they share information and benefit each other. This is a hard problem because we want to be efficient in agent exploration
- Adaptivity (effective use of information as it comes in)
- Commitment (carry out long-term plans)
- Diversity (different agents do different things)
Failed Approaches
Concurrent UCRL
- agents form same upper confidence bounds based on shared explorations.
- however, this yields no diversity in exploration and the agents performs the same, which means that there is no divide and conquer
Posterior sampling
- sample a model for every single agent, and then act upon it
- if we sample every timestep, there yields no commitment and you might vibrate around too much
- in the bandit case, commitment isn’t an issue because it’s one step. However, in MDP, it’s worth having a long-time life philosophy.
Seed Sampling (tabular)
Have a mapping between observation history and MDPs be influenced by a random seed, which is different for every agent.
This yields diversity and commitment (similar to sampling Q functions)
Moving to Non-Tabular
Epsilon-Greedy DQN
If we use it out of box, this yields limited diversity. The epsilon random sampling also doesn’t yield much commitment.
Generalized Seed Sampling
The idea here is to draw random seeds to perturb the rewards slightly, as well as an initial sampled function parameter that we use an L2 regularize towards.
The general idea is that we give different interpretations to a shared data.
This gives us per-agent Bayesian regret with agents, so the cumulative regret is sublinear