Deep RL on Real Robots

TagsCS 224R

What do we want?

Learning Directly in the real world

The big issue of learning in the real world is safety. Not human safety, per se, but making sure that the robot stays in a recoverable position so that it can keep on trying.

Safety-constrained SAC

SAC is already a constrained optimization problem, so it doesn’t make it harder to add another constraint for safety. We want to keep the robot within a certain angle from level, so it doesn’t tip over. This is pretty easy to implement with dual gradient descent and a safety function constraint

Sim2Real

Why can’t we plug and play?

The biggest problems are actuator dynamics and latency.

System identification

This is the easiest way of sim2real. Just take apart the robot, find the parameters of every part, and then put them into the simulator. Then, run RL in simulator, and then plug and play.

What’s the problem? Well, there are some things that are easier to measure than others. But if you put in the effort, you can get some good models for things like dynamics and latency. You can also choose to use a neural model of the hardware.

Adaptive System Identification

You can have a system identification model which provides the parameters for the physics simulation. How does this get trained? Well, just compare the simulation trajectories and the real trajectories. You can play out the same sequence of actions in each, and measure the discrepancies.

Because some physics models are non-differentiable, you can use a gradient-free approach to optimize the parameters, like Covariance Matrix Adaptation (CMA)

This is great, although this requires a manual selection of physics parameters, and it may result in overfitting.

Domain Randomization

The idea here is pretty simple too. Just randomize the MDP in the hopes that you will cover the distribution present in the real world.

The limitation is that the randomization typically results in a more conservative policy, as there are only a few things that will work when so many things can unexpectedly change.

Domain Adaptation

Domain adaptation is sort of a response to the problems in domain randomization, in which we become too conservative. In reality, there’s one truth. There’s no need to prepare for all of them. So can we somehow indicate what the truth is, after exploring in the real world?

As it turns out, yes! Just use a environment embedding zz which is fed in as part of the state. We encode zz from the larger parameter vector μ\mu. We do this compression to get a more meaningful embedding space, but also to search across it later.

During runs in the real environment, we search across the zz space.

There are still limitations. For example, the latent space may simply not contain the real truth. Also, we aren’t updating the policy on the real data, which may not be advantageous. Maybe some finetuning goes a long way.

What’s next?

Simulations can’t capture the complexities of the real world. Especially true for things like water, sand, snow, etc.

The biggest challenge is to learn safely in simulation and then adapt using real-world fine-tuning. This is promising because at the start of training, we don’t care about the quality of the demonstrations; we care about the quantity. Later on, however, we care about getting the right parameters.