Google DeepMind robotics lab tour with Hannah Fry
Google DeepMindDisclaimer: The transcript on this page is for the YouTube video titled "Google DeepMind robotics lab tour with Hannah Fry" from "Google DeepMind". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=UALxgn1MnZo
Welcome to Google DeepMind: The Podcast, with me, your host, Hannah Fry. Now, you might remember that earlier this year I got to sit down with Karolina Parada, who is the head of robotics at Google DeepMind, and she was talking all about taking Gemini's multimodal reasoning and embedding it into a physical body.
Since we were coming to California for this trip to see what Google DeepMinders are doing on this side of the Atlantic, obviously the robotics lab was top of the list. Now, you have to remember that these robots aren't these fancy pre-programmed robots. They're doing backflips, right? This is something completely different. These robots are open-ended. They are understanding the instructions that you give them and then able to flexibly respond and adapt to an unlimited number of tasks.
Now, our tour guide for the day is Keerthana, who is the director of robotics at Google DeepMind.
I haven't been into a DeepMind robotics lab since, I think, 2021.
Oh, okay.
I mean, already it looks quite different. You haven't got the privacy screens.
Yeah, they've gone. Yeah.
You don't need them anymore?
Uh, no. I mean, we have the whole lab here in the open.
Is it that they're more capable of focusing?
Uh, yeah. The models are now trained with much more robust visual backbones. So, we don't care too much about the lighting or the backgrounds as much. So the visual generalization part of the problem is much more solved than like four years ago.
Big improvements.
Big improvements. Yeah. Okay. There've been a few big breakthroughs in robotics in the last couple of years and we're excited to show those today.
Yeah. I mean it might only be four years, but it's basically an ocean of time in terms of what's changed.
Robotics looks very different than it did four years ago.
What are the big changes then? I mean, large language models, multimodal models?
Yeah. So basically we want robots to be general, and to be general for human usage, these robots must be able to understand general-purpose human concepts. And the big breakthroughs in the last few years have been we're kind of building robotics on top of these other bigger models, these large vision-language models, and turns out they have great understanding of general world things.
So the latest robot models are now built on top of that. So we're seeing incredible improvements in how they generalized to new scenes, new visuals, and new instructions. So yeah, robotics is way more general than it was a few years ago.
Because I was talking to Karolina earlier this year and she was sort of saying that it's not even just vision-language models to perceive the scene around, but also to plan the actions that it's doing.
Yeah. So basically we developed these things called VLAs, which are Vision-Language-Action models. So what we did with those is we put actions, which are physical actions the robot is doing in the world, and put them on the same footing as the vision of the language tokens. So then now that these models can model these sequences and try to figure out, given a new situation, what are the new sequence of actions to do there.
So we call this action generalization, and even in this we've seen massive improvements in the last few years. So in the previous release, you saw robots doing kind of more short-term, short-horizon things like pick up things and place them somewhere else or unzip a bag. But really to be useful to humans, you want longer horizon things. And there we have an agent now that can orchestrate some of these smaller moves to make it a much longer horizon thing. Like you want to pack your luggage for London, you want to first look up the weather in London. So this agent can then check the weather, decide what you want, and then even pack your bag for you.
So it's like you've got this kind of fundamental layer, that sort of foundational model, and then you're building on top and on top and on top until you can chain sequences of actions all together to do a long complex task.
Yeah. And it makes it way more useful because you don't want that short horizon thing. What you really want for a robot is to do the full thing for you. So this agent really brings that other layer of intelligence to the whole thing.
And this is 1.5?
Yep. We have uh so there's two capabilities in 1.5. We have the agent component and then we have the thinking component. And thinking is a word that's been used a lot. So for robotics purposes here, what we're trying to do is we're making the robot think about the action that it's about to take before it takes it. So it'll output its thoughts and then it will take the action. And just this act of outputting its thoughts makes it more general and more performant because we're kind of forcing it to think about what it's going to do before it does it.
Cuz you see this in language models, right? Like "take a deep breath before answering" or the chain of thought ideas does actually improve the performance. But it's the same in robotics?
It's the same principle that we're applying to robotics and physical actions.
Isn't that weird? [Laughter] Just some of these emergent properties are just so weird.
Yeah. I mean, for robotics to do basic manipulation tasks is really difficult. We do these tasks very naturally, intuitively, without thinking about it. But for robots, it's hard. So getting it to think about these actions before it does it, it helps. It truly helps the robots.
Amazing. Okay. Well obviously I want to see... can we go see one of these ones?
Let's go. So let's take a look at the Aloha robots here. And here it's going to pack us a lunch with some very dextrous moves. And it'll do like a long horizon task. Amazing. As Kaden was going to wonder about.
Thank you.
So, this is going to pack a lunch box. And this is one of our most difficult tasks cuz it has, you know, it needs to know the millimeter-level precision of grabbing the Ziploc in the correct way.
Yeah.
And then it'll try to get the bread in that tiny spot.
And it's just all visual servoing?
Oh my gosh, I'm so impressed. I mean, as soon as I said the word impress, it started flinging slightly, but get stage fright. Yeah.
And does it correct itself?
It'll keep trying.
Hey, you got lots of cameras pointing at you. I understand. Understand the stress. [Laughter]
The first time I went into a DeepMind Robotics Lab was maybe 2017 or so.
Okay.
And at that point, they had, you know, the big Lego for toddlers. All they were trying to do was stack one on top of the other, stack blocks. And honestly, the pile of discarded broken Lego in the corner was illustrative of just how difficult... But this idea of like millimeter precision for the bag...
Wow. Look at that.
Nice. Okay. No. No way. I'm so impressed.
Try from the top. Give it another go. You want to see the bread and the Ziploc?
I'll try to do the... Okay. Oh, that is so almost, almost, almost. Wow.
Yeah, that's amazing. That's amazing. Because it crushed too hard on that and you wouldn't be able to close it.
Yep. And it's too soft. You're not going to be able to.
Some more stuff.
I mean, that was easy. The chocolate bar. And now the grapes. Is it going to hit go on a grape? [Laughter] Almost certainly. That's some grape juice going on there.
This is really impressive. So this is the dexterity in action, like just how precise it can get. And then it's going to try to close it I think. Yeah. So this will just learn from the data how to do this. This is just end to end.
So but exactly end to end as you say right, like this is just visual...
Vision and actions.
And what kind of data is it learning from? I mean do you have... it's not going to do the zip is it?
Let's find out.
What kind of data do you give it? So is this based on just allowing the robot to try lots of things or are you simulating?
So this is actually done via teleoperation. So we kind of embody the robot and do the task with the robot and it learns through that perspective. And it is going to... so it can pack you some lunches.
So you've demonstrated to it this is what it means to do it correctly.
Yep.
I see. All right. Thanks. [Laughter] That was so cool. I sort of want to give you a high five but your hands are quite pointy.
Yeah. Not these [laughter] ones. Okay. So we saw dexterity here. Let's take a look at another demo where we'll showcase the generalization capabilities of these robots. We talked about how VLMs are general world understanders. So, we'll see that on robotics now.
Cuz that one was a task that it does over and over and over again.
It's more about the dexterity here. It's more about the generalization. So, Colleen here is going to show us what the robots can do in a more general setting.
So, uh here we have our robot running a general policy so it can kind of interact with the objects and you can just speak to it because we have a Gemini layer on top. So, for example: Hey, can you put the blue block into the blue tray?
I am putting the blue block into the blue tray.
It's chatting while it does it.
Yeah. So, if you want to just ask it to do something, it's a push to hold mic.
Yeah. Can you put the green block in the orange tray, but do it as Batman would?
[Batman voice] Nice. I cannot perform actions as a specific character. [Laughter] However, I can put the green block in the orange tray for you.
Fine. [Laughter]
I am now moving the green block into the orange tray.
[Snorts] Okay, that's cool. So, if this is completely generalizable... okay, so I have a stress ball that I travel with, and it's never seen this before.
Yeah. So, if I do that in the scene...
Okay.
Um, and that's a pot, right?
Right. This is a little container and it lifts open.
Okay. Let's try this.
Open the lid of the green pear.
I'm getting started on opening the lid of the green pear.
It's going to be tricky.
That is difficult, right? That's small.
Amazing. Place the pink blob inside the green pear.
I'm working on placing the pink blob inside the green pear.
Difficult.
Oh, nice.
Go on. Go on. I want it to succeed so much. Oh, squishy. [Laughter] [Gasps] Yay. [Clears throat] Okay. Put the green pear lid back on the pot. Yes. Yes.
So impressed.
You know what? You look like a proud parent. [Laughter]
No, it really feels that way sometimes.
Amazing. This is amazing. Never done that before.
Yeah, that thing, stress ball, has never seen before.
Yeah, the open-endedness of this is really extraordinary.
Yeah. So now we can chain together some of these short tasks into a long horizon task and it could be way more useful because as you saw with short horizon task it can only do parts of it, but once you can string them along to do something more impressive and long-term then we get like more useful tasks. So instead of instruction, instruction, instruction, you can just tell it to do something.
You can ask for some high-level thing and then an orchestrator will break that down into smaller instructions for the VLA and it will do the whole thing for you end to end. And we can see that now here.
Okay. I'm in San Francisco and I don't know the rules about sorting trash. Can you look it up for me and then tidy up?
In San Francisco, you're required to separate your waste into three categories: recyclables, compostables, and trash. Each with its own color-coded bin.
Nice. Nice. Oh no. Oh yes. Yes. Yes. Wow.
Now I will put the rubbish into the black bin.
So it's chaining the... Yeah, you can see how the agent can orchestrate a few more tasks and make it way more useful.
So in terms of the architecture of that then, how does it work? I mean do you have sort of a separate system sitting on top that's giving instructions?
Yeah, we have two systems. One, kind of our reasoner model, which is better at reasoning and that orchestrates the other model which is our VLA which does the physical actions. So both of these come together to do these long horizon tasks.
Okay. So the VLA being the Vision-Language-Action model and then the reasoner model being the...
A VLM, just a vision language model that's designed to, you know, be better at these kinds of tasks.
It's doing the reasoning.
Exactly.
I think if we're going for full science fiction future though, you don't want just arms.
You want the full humanoid.
You want the full human.
Let's go check out the human lab.
Okay. Yes, please.
All right. So here we have a robot that will sort laundry for us. It'll put the dark clothes in the dark and the white clothes in the white bin. This is Stephanie and Michael who's going to run the demo. And the cool thing is you'll just read the thoughts of the robot as it's doing it and you'll see, you know, what it is thinking. And this is our thinking and acting model where it'll first think and then take the action.
You get an insight into its brain.
Yes. You can look at what it's thinking now.
So, this is every time step, is it?
Yep.
I got you.
You want to throw in a few more clothes?
Absolutely.
Go for it.
Let's do it. Let's get a red one in there.
Does get a red one in there. So, do not put that in the white one. Thank you.
So, do you have a system sitting on top of it that's kind of making these decisions? I mean, how does it work? Is it like hierarchical?
This one is pure end to end. He's thinking and acting in the same model. There's no hierarchy. So, it's very closed loop. So, okay.
The bottom clothes from the table. Red cloth. The black box.
Beautiful. Nice. Beautiful. I mean, I would probably wash that separately, but [laughter] you do you.
Hop in no time.
Okay. So then if this is end to end, how do you extract this information out? Is it just outputting actions?
So the beauty of this is it's outputting both its thinking and its actions. So think about how Gemini outputs its thinking before it outputs the response to the user. This is doing kind of something similar.
Oh yeah. No, it's actually truly exciting. Like it's a different way of doing robotics that I feel like is...
Is very exciting to us. Yeah. All right. So here we have the robot where we showcase its generalization capabilities. And this is Kana, one of the researchers working on it.
So, let's just see what he can pick up and maybe he can pick something and put it in one of these things.
Um, I'd like the plant in the basket.
Also, all these objects are not seen by the robot during training.
So, they are completely new.
Many of them we shopped like yesterday.
Oh, really? That's true. We went to Target and we bought a bunch of things yesterday. [Laughter]
So, this is about how the robot can handle completely new objects, things it's never seen before. Here we go.
Hi. Do it. Oh, that is quite tricky to pick up, isn't it?
Yeah, it's like kind of sliding away. Okay.
I'm just not sure if...
Nice.
The scruff of its neck. You did it.
What next?
Okay. I'd like um... I'd like the Doritos in the hexagon.
And you can move it as it's trying to do it. And you can see it... yeah. Trick it. Okay.
Oh. [Laughter] I hope you weren't planning to eat those.
Amazing thing is, okay, it's still a bit slow. It doesn't get it right 100% of the time, but you can see that it's on the right path, right? I think that's what feels very different to last time I came to one of these labs. Yeah, you can see the intention behind the actions and it's generally trying to do the things that you're asking it for.
Do you feel as though the stuff that you're doing now isn't going to be thrown away and scrapped for a whole new technique, or do you feel like you're building the sort of foundational blocks?
No. Yeah. I think these are the foundational blocks that will lead to the final picture of Dropbox robotics. So, we'll just have to build on top of this.
In that building on top, do you think it needs another revolution? Like, do we need another architecture or do you think that we've got enough already?
You know, I think we need at least one more big breakthrough. Like even now these robots, they take a lot of data to learn these tasks. So we need a breakthrough where they can learn more efficiently with data.
So do you think that's the only limiting factor then? Do you think if you had a similar order of magnitude, you know, many many more orders of magnitude of data like you do with large language models or visual language models, do you think that this would be sorted?
Uh there is one hypothesis that that is all you need. If you can collect that much robot data, then we're done. We're going to pack it up. But there's still a long tail of problems to solve. Like they have to be safe, you know, they have to like really master the task. So there are still challenges, but the core of the problem is still robot data, this physical interaction data, you know, what it feels like to do all of this stuff. You know, it's just limited, like it's not as big as the internet.
So right now we still have to collect all this experience on robots, but there is a lot of manipulation data that is collected by humans. Humans posting videos about how to do anything. We should be able to learn from that at some point and really increase how capable robots are. This is very unstructured. Like solving robotics, general manipulation is a very unstructured problem.
Yeah. And completely open-ended in terms of the type of things you could potentially ask it to do.
Amazing. [Music] I'm so impressed. Well done.
Sometimes these robots are a little bit on the slow side, right? Sometimes they're a bit clunky, but you have to remember that this idea of having a robot that can understand semantics, that can get a contextual view of the scene in front of it, that can reason through complex tasks... this is completely inconceivable just a few years ago.
And okay, then there may still be some way to go, but the progress here is really limited by the amount of data that we have on physical interactions in the real world. But solve that, go through that barrier, and I don't think you're just going to be watching robots sort laundry. I think we could be on the cusp of a genuine robot revolution.
You have been watching and listening to Google DeepMind: The Podcast. If you enjoy this little taste of the future, then please do subscribe on YouTube so you won't miss an episode. See you next time.