Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg
Sequoia CapitalDisclaimer: The transcript on this page is for the YouTube video titled "Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg" from "Sequoia Capital". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=OJCT-HGxPjk
Just like the fact that this whole thing works, it's kind of mind-blowing. Yeah. Right. Like you build this like loosely brain inspired thing that has very general purpose learning algorithm. You feed it data and it somehow gets it and gets it way better than anything we've ever had before. And this applies to robots and it applies to vision and language and sound and all kinds of other things. And like I think if you stop for a second and just think about it how it works and and that it works, it's just like absolutely mind-blowing.
In this episode, we sit down with Carol and Toby of Physical Intelligence, a company building foundation models for robotics. Carol and Toby explain why the classical approach of breaking robotics down into perception, planning, and control was fundamentally wrong, and how end-to-end learning with reinforcement learning is finally making deployment possible. You'll hear how they achieved robust real world performance, getting robots to make coffee for 13 hours straight, and how these models generalize across radically different tasks from surgical robots to drone flying in ways that we don't fully understand. We also talk about the technical insights behind Pi Star 0.6, which is Physical Intelligence's newest model that learns from experience using reinforcement learning. Enjoy the show.
Carl, Toby, thank you so much for joining us here today.
Thank you for having us. Excited to talk everything Physical Intelligence, general robotics, etc.
Maybe before we get into it, just for our audience, can you share a little bit about what Physical Intelligence is and the mission that you're after?
Yeah. So at Physical Intelligence, we are building robotic foundation models. These are models that in principle should have should be able to have any robot do any task. And over the past one and a half years or so, we started building the, we created the right building blocks that show how these models could scale. So, we've shown that they're able to control many different robotic form factors, many different types of robots. We've also shown that they're able to generalize, so you can bring it to completely new environments and what it takes for them to generalize. And this last release that we just had called Pi Star 0.6 that we also wanted to tell you more about, shows how we can get them to good performance so that they're starting to become deployable.
And this is really important to us because we want to see this technology actually deployed in the real world, but also because we don't have the benefit of having the free data on the internet. There is no data of robot actions. So we need to create the data sets ourselves. So we are after the problem of physical intelligence, after the problem of creating foundation models for robots, and we've made quite a lot of progress.
Wonderful. And can I ask why the decision to build foundation models as opposed to you know there are companies that are building fully vertically integrated robotic products right now? You know this the Sunday launch last month is in the back of my head. You can buy a cute little robot helper for your household. There's companies working on cooking robots. There's obviously the humanoid companies. Why build a foundation model versus build a robot yourselves?
Yeah. So I think if you look at the history of robotics, it's very very clear to me and I think to many roboticists that we've been always bottlenecked on intelligence. We've had robots that are capable of doing incredible things whether it's in the home or in industrial settings. We've seen robots more than a decade ago that if teleoperated, they can clean the entire house. The really important caveat is "if teleoperated." So if there is a human mind behind it, it's clear that the hardware is capable of doing lots of different things. And for a very long time, robotics companies have been structured the way you described where you kind of think of creating a specific robot that's designed to do just a single task or a single application. And instead what we thought would be would really help the field is to focus on the bottleneck, on the intelligence. So we created a company to focus on that bottleneck because we think that this is, if we if we address that bottleneck, we can actually make robots happen. And if you do it any other way, you're basically not making as much progress on the bottleneck as you could be. So we thought we would just target this problem headon. Focus on the intelligence, and if we can do that, that would lead to many different vertical products. It will lead to, you know, robots being deployed in the home in industrial settings, basically anywhere.
Can I just pressure that, test that a little bit? So on the hardware side, like I've seen the latest videos, for example, of the the Optimus hand. It it's like it's exquisite. It's a it's a piece of art. And I hadn't seen the videos of people, you know, teleoperated robots cleaning houses 10 years ago, but I'm wondering if there's a set of tasks that's, you know, maybe now just on the cusp of becoming possible, for example, cooking or like being able to, you know, peel and dice an onion that like you couldn't have done with with hardware prior to where we currently are. So, how much of a "why now" do you think hardware is or isn't?
So, there's a lot of progress in hardware, especially in humanoid hardware like dexterous hands, for instance, as you mentioned, they're much better now than they were even a few years ago.
But that still doesn't address the bottleneck. We could have had robots operating, you know, chopping vegetables or doing cooking even with simple grippers before. The problem is that we don't have the intelligence to operate these robots. And the more complex the hardware is, it doesn't really resolve that bottleneck, right? Like it allows you to do more potentially, but you're still bottlenecked by the fundamental challenge of robots not being intelligent enough.
I see. So hardware may raise the ceiling on what you're able to do, but like the capability floor we're not even there yet.
That's right. So even with simple robots we are not yet at the level of a human operator.
So the limit being the intelligence layer. What's the limit to developing the intelligence? Is that collecting data? Is it doing it cheaply? Because you know you've broken down the problem. We're going to keep asking you why why and just drill down further. So what's the next layer of the okay what's the bottleneck for solving intelligence generalization?
It's a good question. So we thought about it in terms of three factors. We refer to them as capability, generalization, and performance. With capability, our idea was that we want to get to the point where as long as you can collect data for something, for a task or for a robot, you should have a model that should be able to replicate that, to automate that task. This is something that we've gotten to fairly quickly. This was our PI zero release around a year ago or so, showing that it's basically possible that if you can collect data for any task, for any robot, you should be able to automate it and all should be able to learn it.
The next challenge is around generalization, and this is still an open challenge. So, we wanted to get to the point where the robots can just work zero shot, and you can just bring them to a new home, for instance, and they should know how to operate in that home. And this is a really really difficult problem, right? Like if you if you put a robot in a new home, it needs to understand, you know, where different items are, that the counters look different, the lighting is different than what you've seen in the past and so on. And I wouldn't say that this problem is solved, but I think we start to get a handle on how to how to solve it and how it scales. And the only answer to generalization that we know in machine learning is through diversity of data. So if you see a lot of different diverse data sets, you should be able to generalize to a setting that is similar to the one you've seen. And this is something that we've seen with our PI 0.5 release in April of this year that we got to the point where we can bring a robot to a new home that it's never been to before and it's able to operate in that home. It's not perfect yet, but at least it has some kind of common sense on, you know, how to go about simple tasks like cleaning up the kitchen and things like that.
And then the last challenge that is also not fully solved yet is performance. So how can we get these models to the point where the performance is good enough so we can actually deploy them. And deployments here are really really important because as I mentioned before, we also need to gather data, and I think that is going to be the most scalable way of collecting data because you'll have robots out there in the world doing economically valuable tasks. And that way that the cost of that data collection is basically negative. And the more broadly you can deploy this technology, the more data you'll be getting. And I think in the limit, that will be the biggest source of data, you can imagine, much bigger than internet data, for instance.
And how far away do you think we are from generalization or from a performance level that maybe it's a controlled environment, maybe it's a general environment in homes or offices, but not the whole whole world? If you could limit that, where do you think generalization performance will will need to be before we can deploy these kind of robots?
I think we are actually fairly close to deploying these robots. We started deploying them ourselves already. We thought this was something that was going to take something like 5 years to get to the point where the technology is actually ready to deploy a robot in a commercial setting and have it do something valuable. But we've done it I think two months ago or something like that. So I think we're now getting to that threshold that the models are useful enough, they're performant enough and they can do enough of variety of tasks to be actually useful. So that's a really really exciting moment. I think we just crossed that threshold. I think it's still to be determined how wide is the aperture of where we can deploy. There are some tasks where the failure can be really catastrophic. Maybe these are not the best tasks to deploy just yet. There are some tasks that require a ton of generalization like deploying in the homes or that are you know have privacy concerns or safety concerns and so on. Maybe these are not the best places to deploy just yet, but I think there is, the aperture is growing. As we collect more data, as these models get better, we can deploy them in more and more settings. So I think we're starting to get there.
Where is the current aperture that you're deploying right now?
This is a really difficult question to answer because with these foundation models, sometimes you don't fully know. So kind of similarly to how with large language models, you know you train this model, you kind of cook it in house, you try to make the best job possible, and then at the very end you get this artifact and you can't really predict how good the artifact is going to be. You kind of have to test it, and that's where we are with these models as well. So for instance, we open source them so that we are not the only ones testing it and we're not the bottleneck in knowing what their capabilities are. And by open sourcing them, we see them being applied to actually many more applications that we could have imagined. Things like driving or surgical robots or agriculture and places like that. So I don't have a very good estimate of what the aperture is. I think it's wider than what I had expected. And I think it will be growing over time. The more data these models get, the more mature they get, I think the aperture will continue to grow.
I would add maybe on the performance level, like as you said, the aperture is probably wider, the starting point is wider than we thought. But at the same time, of course, if you actually want each of those starting points that you start at for each of those applications to be at a level where people would want to use this as a day-to-day driving, you know, their businesses where there's probably still quite a bit of hill climbing to do in terms of performance, right? So we've with this release that we're going to talk about a little bit in a bit, I guess the Pi Star, we've made progress on like learning from experience data, getting that back and making the models better when they are deployed. It's still for a lot of things that I can naively imagine that there'd be lots of scenarios where there's a really really long tail of things that can go wrong or that you can encounter that we we don't yet have a great grasp on like how to completely solve I would say as well.
And you guys have been really great about publishing your results with a lot of transparency releasing open source. So, whatever you're comfortable sharing, can you talk about what your overall technical architecture so to speak is? And do you think that that the architecture to kind of get to this promised land is, you know, pretty much baked and it'll be variations on the theme of where we are and we just need to collect a ton of data, or do you think that, you know, the architecture is still still being figured out?
I would say so we can maybe start with like a little bit discussing like where we're at now and then we can like go into the details of like how that might change. So at the moment, you know, the architecture is very analogous to how VLMs are are built that probably, you know, most of you interact with on a day-to-day basis, right? Type something in and put the image in and ask it to read what's on the image and so on. And we've kind of like started from the same standpoint of you know there's a model that's trained on internet scale data and it's ingested image data and text and we're adding all this robotics data. And our training actually predominantly now is on robotics data, on data that we have collected ourselves. We have a little bit of that internet data in the mix, but the majority of it is robotics data. The architecture as kind of this vision language model and we add something on the side which is what we call the action model, the action expert, the part of the model that actually then has to drive the robot, right? That basically looks at the image and the instruction is getting and has to perform the task, has to send commands to the robot. And so broadly it's like a transformer model that is a fairly large model, up to like some billion parameters at this point, that we pre-train on our robotics data and on internet data. And it is trained largely in initially from human demonstration data. Carol mentioned this earlier a little bit, and we we have this demonstration data, teleoperated data of humans trying to get the robot to do stuff. So that's the the architecture that looks like now, and like roughly the scaling that we're getting is from scaling our data, and we use models similar to what comes from the VLM world.
How that might change, I think, is an open question. I I think there's lots of opportunities in in adding more capabilities to these models that we're also exploring. Right? You can imagine that, you know, you might want more context in these models. You might want more more cameras added to to the robots that the model then needs to be able to use. You might want to have a better understanding of the physical world in the sense of, you know, understanding exactly what's in the room, what can break, what is easily movable, and so on. So there's lots to be done I think in those capabilities and also changing the architecture around. And I wouldn't be surprised if in like five, six years we look back and we say, oh, you know, maybe the backbone of the model that we used at the time which currently comes from this VLM land has changed. Maybe we've moved on and we we use something slightly different. I think that will evolve over time, but I think the the foundation of like the data and how we bring it into the model will probably stay stay like this.
Got it. And should I think about it as it's pixels or signals in and then actions out? Is that and like like a single single big neural net?
It's one big model. Yeah. It's really just basically images in text in, text out, and and actions out at this point. Yeah.
And are you I guess do you have a separate kind of locomotion versus manipulation stack? Maybe this might be a good time to talk about kind of just the historical evolution in in robotics and the various different waves of learning, and how it pertains to your stack.
Yeah. So for a long time even before learning arrived here, people thought that robotics is one of these problems where you can if you put enough people on it, enough engineers, they can think really hard about it and eventually write the code that will you know have the robot do anything in the world. And you know people have tried really really hard to do it this way and then it turned out that the world is just way too complex. Right. Like you can't just write every single every single case you'll encounter in the real world. So that doesn't work. And also as we were trying to work on on that version of the problem, what ended up happening is people did what they usually do. They try to break down this problem into smaller sub problems. So like rather than working on the full robotics problem, you would say there's a perception aspect of the problem, there's a control aspect of the problem, there's the planning part of the problem. And this almost grew into different communities. There's a planning community, there's controls community, they have their own conferences, their own problems and all of that. So then as we realized that, you know, it's not really possible to to handwrite all of these rules, people thought that we should learn them. We should learn them from data, which seems like a really good idea, right? This is how we learn too. But what ended up happening is that they started learning each one of those components, these broke down components separately through learning separately. So you would have a perception layer that is fully learned. Maybe you'll have a control layer that is learned. Maybe you'll have a planner that is learned. And that showed some progress. It was better than what we had before. But then turn out that breaking down this problem into these subcomponents actually is the piece that doesn't work because, you know, when I try to, you know, pick up this glass, I don't think about it in terms of perception and then planning and then control. I just I just go for it. I just pick up the glass and it's just all very natural. So it turned out that this pipeline approach where you have these predefined interfaces that like perception gives you the position of the object and then the planner gives you the trajectory and the control executes it. Those interfaces are the pieces that broke down. So everything that we thought we knew how we work was always wrong.
So then we then arrived to the next stage of this where we said well maybe just breaking down this problem was a bad idea to begin with. So let's just train the whole thing end to end. Right. So we'll take whatever the sensory inputs as input to the network and we'll have actions as the output. That's what we refer to as the end-to-end approach where you try to go straight from pixels to actions. And we'll we'll have the network figure out or the learning algorithm figure out how to split it into these different components if it's even possible. And then while we were doing that, we figured that it actually requires a ton of data to do this. And often it breaks when it requires some kind of common sense. And to gather that common sense through first person action data sets is really, really hard because you would need to experience every single thing in the world to do this. And that's where we stumbled upon vision language action models where we can use models that were pre-trained on internet data that already have pretty good understanding of how the world works. And we can utilize that knowledge so that we don't need to experience everything firsthand. You can just add some action components on top of it and have a common world understanding and connect it to how to actually perform things in the world.
I see.
And that's more or less where we're at today. Now, at Physical Intelligence, we figured few other things. So how do you scale, how do you start to scale these models? How do you get them to generalize? How do you get them to perform much better? How do you have them move much faster? How do you get them to the point where you can start deploying them? But I think largely we're still in this in this era of how do you bring some of the common sense knowledge from the internet pre-training? How do you make these models very general so that they can work on any robot and perform motions?
And can I ask for something like reasoning, right? There's so much stuff happening in the reasoning side of the large language model space. Do you get the benefits of that in as part of your VLM backbone? Do you have reasoning kind of emerge as a consequence of what you're doing as you train these end to end? Or can I think about some of the benefits of what's happening in the LLM world? Do they do they benefit you or not?
I mean, I think definitely the models that we have today, they are already planning actions not just at a what is the immediate action, but kind of what is the what are the next 50 things I need to do, right? So like the next 50 time steps in some sense it's a very short horizon, 50 steps means like a second or or two, right? And it also additionally kind of decomposes tasks into subtasks in language space already. So when you when we ask it, oh, clean the kitchen, the first subtask it might pick out to do is like, oh, I have to drive to the counter and then I have to like blah blah blah blah pick up the the glass, move the glass into the sink. So it already has those aspects in some sense, right? So like it decomposes task into subtask because it gives itself its own subtask and it predicts like a little bit of a horizon of how actions go. So some of it is already there. I think in the future there will probably be more of it. I do totally expect that all the advances on like RL training for reasoning, all these things will will also make their way into robotics. Yeah.
And I think it's kind of interesting to think about because it's it's maybe a little different than than the RL for math problems that people do, for example, right? Because I think those are very easy for easy for us humans to think of as like textual problems, right? You think through them in your head in like text. Okay, if I change this formula this way, I will get this outcome and so on. And I think for the physical intelligence part of it, it will probably be a bit more than that, right? It's it's going to be a little bit different when you try to learn a new sport, for example, when I recently started to try to learn how to play tennis. And you know, I don't think through in my head of like I need to now grab the racket, I need to move it here and I need to do this swing. But it's more like you think through the motion itself, right? You you think about like how does your body move? How maybe maybe you plan in some sense trajectories of objects around you in your head. And so those things I think we'll see come into the models more over time.
Yeah, I I suspect that over time. Right now we're in a place where we benefit quite a bit from vision language models. I think it's it's very very likely that that that's going to reverse. A lot of the the shortcomings that we see in LLMs today are kind of baked in are because we are focused on on the text problem, on problems like math and coding. And I think robotics will offer this new avenue where you need to kind of rethink how how to think about reasoning. Reasoning should probably happen in some kind of abstract space where, you know, you can reason a little bit in text, you can reason a little bit in images, maybe you can reason in trajectories or in all kinds of different spaces to arrive at the answer. And robotics provides this really nice test bed where you're grounded in the physical world. There is not that much data yet. So you kind of need to deal with some of the the difficulties that come with that. But I think it will provide for for new findings that will then be reapplied to to the LLM world.
Speaking about data, give us a sense of I don't know how you measure the sort of magnitude of data you've already collected and how much you would like to collect in the next year. Not I'm sure more is better, but like what is the magnitude we're talking about?
Yeah, data is one of those things that's actually fairly nuance. It's not just a matter of quantity. Quality obviously matters, but also things like diversity. And even when you think about the quality or diversity of robot data, these are not very strictly defined terms. Right? Like if you if you go for the same tasks in like 10 different ways, is this diverse data or not? Or how do you compare it to the diversity of the data if you go for like 10 different glasses? So this is something that I don't think we as a community fully understand, like how to characterize the data, how to describe diversity, how to describe the quality of the data, how how to make it very very rigorous. And we're also finding out that there are some aspects of the of the data that really really matter. Like, for instance, if you want to get to a certain performance on a task, you're not going to get there by just increasing the quantity of the data you already have. We've been working on these three different tasks for the Pi Star 0.6 release, and we've noticed fairly early on that if we just keep on collecting more and more data the same way that we've been collecting so far, the performance plateaus. You're not going to just keep on getting better. So you need to find either new ways of collecting it or you need to start thinking about what kind of data will result in better performance. And this is where things like reinforcement learning and things like this can really can really really help.
Let's talk reinforcement learning and let's talk Pi Star 0.6. Is the star a nod to to Q star or okay.
Effectively trying to get to like policy star, actually optimal.
Policy star. Okay. Wonderful. Can maybe just say a word though on what you guys are doing with Pi Star 0.6 and then we can dive into what RL means for your world.
Yeah, for sure. So I mean, I think the main, if we want to contrast it to what we talked about earlier, the main difference is that up to that point basically all of the robotics foundational model learning that that we've done was basically demonstration data, teleoperated, going into the model. The model is trained kind of like just imitate that data, right? And now with this new model, Pi Star 0.6, what we're using is basically RL from experience that the robot collects itself by actually running a policy. So we start with the initial policy, this demonstration trained policy, and then you deploy it. You try to actually have the robot solve the task, and then it additionally gets kind of reward signals given by humans, and it can also get correction where the human intervenes and says, oh, actually, you know what, this is not right, let's let's do this a little differently. And that data, that process basically, that data is collected, comes back in, and the model kind of uses that data to try and figure out which of the data can I kind of kind of like should I reinforce, should I do more of, and which of I do less of, and basically improve itself over time. Basically, that's kind of the big distinction, and having that stream of real data coming in is kind of the missing piece that Carol was talking about that allows us to now escape this plateau that we otherwise we were finding we were kind of like getting to.
Yeah. And I guess in my brain I I think RL is, you know, you're hill climbing on your reward signal. And so how do you make sure you're how do you make sure you're generalizing as you as you hill climb on these specific tasks?
The way we're thinking about this for for this specific kind of problem is like you have this sort of general model and it achieves some performance that isn't isn't great. And now your first goal actually isn't to further generalize. You want to kind of solve this specific task first, right? Like so we deploy it and we have we've picked like three, four tasks. So it has to generalize across tasks. Nonetheless, the method has to generalize. But when you're actually kind of deploying it and trying to start this RL process, you really care about let's make sure I nail down this task and I kind of nail it down in a way where I can can solve it from many different positions, and I can deal with all the long tail of failures that I that I will encounter. Right? So in some sense the the generalization and the performance here may seem at odds when you look at it from like, oh wait, but now you're like just doing this one task, but really at the end of the day what what we want to do is we have the same method, the same process that deploys to each of these tasks and then kind of gets the performance high. And then we can have all of that data across all of these tasks and we bring that data back basically, right? So in in in that sense it's not actually at odds, if that makes sense.
Yep, that makes sense. How much of the RL are you doing? It sounds like this is a this is an in real life RL. Can you talk a little bit about the approach to how much RL you're doing in sim versus in real life?
So we have taken a quite real world first approach as opposed to using sim. We are exploring sim of course as well as as a research tool. But all the RL we've done for the Pi Star 0.6 paper is actually on real systems in the real world. And the reason for that is that it's actually really really hard to model, again we can get back to like long tail of failures that you see when you when you do deployments. I can give you a lot of examples from the tasks that we've actually looked at for this release where there were failure modes that we we saw that if you had just done done a simulation of it, you might not have seen it. So to give you an example, we have this one task which is you have to build a box, right? So this is an actual deployment task where the goal is we build these little cardboard boxes to put chocolate into such that they can then be packaged up and and sent out basically. So it's building a chocolate box basically. And building this box, initially, you know, worked great. And then there was these new shipments of boxes coming in, and they come in as like a flattened sheet of cardboard. And then these cardboards that came in in this new shipment were kind of not perfectly perforated. So, they were sticking together, right? And then the robot starts like grabbing them, puts them on a table to to try to build this box and like has two boxes suddenly on the table, right? And this is something that wouldn't happen in sim if you had written like a nice simulator that would where you would just get individual cardboards and like fold them. And so now you have to deal with this problem, right? And if you just learn everything in sim and then try to deploy it, you wouldn't encounter it. So we encounter it, and then our kind of method can kind of figure out that, oh, actually what I need to do is I need to separate this and I need to move that that second piece back and and build the box basically. And we see a lot of successes for ARL being applied in sim and transferred to the real world, especially in locomotion.
And we haven't really seen that kind of success in manipulation for these kind of methods. And I think maybe one reason for that is that with locomotion, with trying to move around, it seems that the biggest part of the problem is modeling your own body. So if you can figure out how to model you yourself as a robot, you're basically like almost there. So you can do this modeling simulation exercise once because you only can do it, you only have to do it for yourself, for this one robot, and then you're basically done. If you do it really, really well, it should transfer. With manipulation, however, the problem is not how you move your own body, it's how the world reacts to it. You're actually changing the world around you. It's not difficult to figure out how to move your hand from A to B. It's difficult to figure out how this affects the objects you're interacting with. And now the problem is no longer just modeling your own robot. You have to model the entire world, right? Like every single object that you might be interacting with, every single task you can think of. And that's where we see scaling problems. And that's I think why we haven't seen those kind of methods be as effective in manipulation.
What was the headline of the results from Pi Star 0.6? And where do you see the model get after RL on the on the test that you cared about, and what do you think that means about your overall training recipe going forward?
Yeah, so I think for me the most impressive thing honestly, for me personally, to see was just have these models run for hours at a time, recover from lots of different failures, and basically just keep going. And at the same time do that at a at a rate that is actually much better than the initial model that we started with. Right. So the headline figures were we increased kind of throughput of the policies by over 2x on on these three tasks. So there's one task was this box building task I already talked about. One was the making coffee with an actual kind of industrial scale espresso machine, and the other one was kind of like folding laundry. And so for each of them we managed to like make the base policy that was trained just from demonstrations much much faster, and also make it be able to recover from failures much much better. And so seeing that actually in action, when you you you sit there right, we have, if you go to our website you you can look at the videos, we have the robot serve coffee for 13 hours in a row or fold laundry for 4 hours, things like that, actually seeing that life changes the way you think about these models. You know, changes the way at least I think about it actually being realistic that we can deploy them, that that we, and do it in a way where it's not just a toy demo which is shown once, but is actually kind of doing the real thing fully.
And and that's been really a challenge in robotics that I don't think many people are aware of. Yeah. Like, you know, you see so many videos of robots doing cool things, and you know, we post these videos too. There's basically like anything you want robot to do, there's probably already a video of a robot doing that. But, you know, you can take as many takes as you will as you want. You can keep on recording until you get the perfect shot. And the problem that I think everybody encounters is the reliability of these models. How performant they are, how fast they can they can go about the task, how for how long you can actually deploy them without failure. And I think this is the biggest bottleneck in terms of deploying these models in the real world because, you know, if you if if they break every every other trial, they're not really deployable, right? And this is this is I think the most important breakthrough for us with this Pi Star 0.6 release, that we can actually start getting to a place where they are deployable. Yeah. Where we use these robots in our office to serve us coffee or we can give them to people at PI to fold laundry in their home or we can deploy them and have them fold boxes for real, and that is really really exciting.
Should we think about what you guys are doing with reinforcement learning as primarily a, you know, a customer deployment reliability point then? Like you can now make sure that you can, you know, go reliably deploy the the coffee making model on a customer site and it's it's going to be fast enough, it's it's not going to fail over long time horizons. So it's it's it's more of a customer deployment innovation versus like a fundamental kind of capability innovation, or is it both?
I think it's both. I think, I mean, Carl, you said this a little bit earlier. I think to some extent the robots that we really really want, right? The robot that you want as home, which can do your laundry, do your dishes, cook for you, drive around, and also the robot that people want in these smaller businesses, maybe solving a real problem that they have that they don't want to automate in a classical way because it's too expensive, like building a chocolate box. Those are things where the robot has to be reliable. It has to be good, and it has to have the capability to do a new task that it hasn't seen in initial training stages. I think it's unrealistic for us to assume that, you know, we can just go with like more and more human data collection, go bigger, bigger, bigger. We will do that, but there is always going to be a limit to how good and how much data you can you can get and how good the initial policy is going to be. So I think it is that what what you said in terms of we, if we want deployments, we need this, but also I think increasingly over the next years we will, I expect we will see that we will do these deployments, and that data will actually become really valuable as a source for pre-training, for making our models better themselves. And we'll rely more and more on autonomous data collection is my prediction at least over the next coming years to kind of build that host of data, that convex hull of all the tasks that we want robots eventually to do, such that the models ingests this and and becomes good at at doing them and interpolating.
And I think of it as a new capability. We haven't so far figured out how to learn from your own experience, or there's been many attempts, but I don't think we've seen it done at scale to like a very to to to the extent that actually shows a convincing result that allows you to deploy something. And this is why this result was was really really important to us. We wanted to get to the point where they can learn from their own experience because, you know, similarly to how we learn, you know, you can learn a little bit from watching videos and practicing and, you know, maybe learning from others, but at some point you need to learn on the job. You need to try the thing yourself. You need to see how your actions impact what you actually want to achieve and make your own conclusions and try to learn that way. And I think this is the first step towards that.
You're reminding me of the, do you guys read the Rich Sutton age of experience paper this year? I love, I thought it was very profound. Do you think that this unlocks kind of continual learning in robotics for y'all? Will this be part of that?
It kind of depends what people mean by continual learning. I think it's it's definitely more continual than what we've done in the past where, you know, you have like a big pre-training mixture and maybe like a post-training mixture and you like, you know, you you you sit down, you work really really hard and then you come up with an artifact and like that's it. Right. Like the artifact is done and there's not much you can do to change it. Now, this is a much more of a living thing, right? Like we we start with a process similar to this, but then you deploy it and then it keeps on learning, right? So, it's much more continual in that sense that that it it tries new things. It it tries to learn from its own experience and it keeps on getting better. Now I I think there is still room for it to be much more continual where it can acquire new skills that way, or it can be even much faster in doing this. It can probably reason throughout this process. So I I think there's a spectrum of like how much you can learn on the job, and this is really promising because it shows that you can do it, but I think we can make it much much better.
Yeah, I would agree. I would say we're at the very beginning of of this, right? And it's not it's not it's definitely not continuous learning in the classical sense that people would have thought about it of like data streams and then the whole thing turns and it just ultimately leads leads all the way to I don't know AGI or something like this yet. But, you know, it's it's a first step. I would say we're moving in the right direction there and there's lots more to be done. And I think I will say from even from this release, like I was personally impressed and to some extent you know shocked how good these models actually are at picking up little things that you put back into the data. I was surprised that even with just like human corrections for there was one example for for tamping, when when we do so tamping is a specific part of making an espresso, right? You like put the the beans and you have to tamp down the...
The best part.
Yeah, the best part. You have to tamp down coffee before.
There you go. See, I'm not a I'm not a coffee expert. It's a skill issue.
That's right. And so our robot in the beginning like tammed way too hard because it just happened to be the case that, you know, the initial human demonstrations were just making sure that, you know, let's make sure the coffee grounds are flat so we can put it in. And then the robot was like tamping really hard and like almost lifting itself off the table when we looked at we're like, ooh, that's that's a bit much. And so with just, I don't know, it was I think 30 to 50 episodes, there's a really small range of corrections that humans did, and we feed that data back, and the model actually starts like being much more gentle and doing the correct thing. And I was really surprised by that because you you think, you know, this model has been pre-trained on these millions and millions of episodes, and now you're just doing a little correction and that actually works. So seeing that happen was was a thing that I think is pointing towards this continual learning part, which I find impressive.
Can I ask though, and I'm the thing I'm still hung up on is generalization. So, as I learn how to tamp better, does that make me better at folding boxes or not?
Uh, in this specific case, no. But the mechanism is the same that you can also employ to fix the, oh, I have two boxes in front of me that are sort of stuck together and I need to pull them apart, right? Because you can get 30 corrections for the stamping part. You get 30 corrections for the pulling boxes apart. You get 30 corrections for, oh, you know, this box wasn't like neatly folded together. And all of this accumulates together to then give you this more generalized improvement, I would say.
Okay. So, it's a repeatable recipe, but they don't necessarily crosspollinate.
Yeah. I mean we, I would expect that as we scale this up, we might see also things actually kind of transfer from from A to B if there are motions that are kind of similar across tasks. But at this point, yeah, I would say it's more like a repeated recipe.
And and yeah, we we see a lot of generalization from pre-training where you train on more and more tasks, more and more data. You see that it's much easier to onboard any new task or you see tasks that appear zero shot that you didn't expect before. And this keeps on improving. We kick off a pre-training run at certain cadence, and every single time we start seeing that the model keeps on getting better because there's more data being fed in, there's more improvements that we're making to the pre-training process and so on. And I also suspect that as we have more and more of these models deployed doing all kinds of different tasks, they also bring data back in. And I think one way where where I'm I'm quite certain where we'll see more generalization is from that process, that as you deploy these models, the the data comes back, the models get better, you can deploy them more, then the the models get better, you can deploy them more and so on.
And I think maybe it's worthwhile for this point that you brought up. We haven't really talked about one crucial detail aspect of this Pi recipe, which is that the model has kind of two parts. One is the policy that is trying to like improve, right, via corrections and and RL feedback. And the other part is how do you actually get this RL feedback, right? So we've talked a little bit, I've mentioned like, you know, humans might correct and that's the human correction part. And the RL feedback part is is a little different, and it's kind of in already has some of these aspects of of generalization that I think you're you you're like trying to search for, which is that the way we do this is we we first basically get humans to to to tell us basically whether a specific attempt of making the coffee or doing the box was successful or not. So there will be like human labels provided with these episodes. And then we train something which is called a value function to try and predict basically from my given point of where I am in my in the task, will I likely be succeeding or failing basically. And this value function is then used as kind of a baseline to to decide whether for this data point, should I like bump that up or should I bump that down depending on whether I expect that I will be moving towards success or I'm more likely to move towards failure.
And one thing that we saw when we trained these value functions, so those are trained basically from the same kind of backbone on the same kind of model, but they're pre-trained before the actual policy is trained that that actually runs the task. When we trained these value functions, we see that adding more data from different tasks actually helps there. And the model starts being actually really quite good at, at least for for certain tasks, at knowing when it will fail beforehand and before it is obvious for me. For example, when I look at a video of it trying to insert the,
Filter.
The porter filter. Thank you. See, I'm I'm not good at making coffee and trying to insert a porter filter into the the coffee machine. It kind of knows that it doesn't quite have the right angle before that happens. So like 30, 40 steps before that actually happens, the value function kind of, if you look at the prediction, drops and and saying, "Oh, this this is not good in this specific episode. So I I shouldn't include this data."
Interesting.
I think RL is just like a such a vast field, and there's so many different approaches to it. And people often associate RL with something like a policy gradient method or very specific on policy learning approaches. And to me it's more of a problem definition, and there are many many approaches that get around the problem that you're referring to, which is that, you know, you only get the reward at the very very end, and it's not really scalable for very long horizon tasks. There are things like value functions. There are things like temporal difference learning that try to get around this problem where you constantly make predictions and you do it in a sequential way. And this is maybe another one of these things where I think robotics can really help the the broader AI community because we don't have the advantage of having a perfect language simulator where you can run as many simulations as you would like. Instead, you need to do it in the real world. So you need to make more efficient methods, and therefore we need to learn value functions and things like this. And I think this will these will be really valuable everywhere.
Yeah. Can I push a little bit on, I'd love to understand, you know, internet video seems like it's part of the recipe but not a huge focus right now as I see it. Like, do you think that there's gold left to be mined in internet video? And then if you look at what's happening in video models right now, world models, to what extent do you think that's going to be a, you know, discontinuous jump in model capabilities and, you know, an important part of your your model pipeline?
Yeah. I think maybe there are two questions there. One is about the data, like how do you bootstrap yourself to the point where you can start deploying. And the other question is, you know, what about what about video models and kind of the world model aspects of it. On the data point, I think we are now in this bootstrap phase where basically anything goes. Like whatever you can figure out how to like add to the model to to its benefit, I think it's good, whether you can add sim, whether you can add human videos, some kind of handheld devices, human tele operations. I think it kind of doesn't matter. You just need to figure out some way to bootstrap yourself to the point where you can deploy these models because I think in the long term there's going to be this bootstrap phase, but then there's going to going to be the deployment phase, and I think deployment phase will be will provide much much more data than anything you could do in the bootstrap phase. So we're in this kind of like weird spot right now where we tried many different things, try to see what what sticks to just get us to the deployment threshold. And once you can deploy, I think that will vastly be much greater than than anything you can do before that. So so that's also what we are sprinting towards. That's why we want to start deploying these models. That's why we want to do this at, you know, with many different tasks in in many different environments so that we can just have this very powerful data engine.
Now on the on the world modeling side of things, I think the world models and our approaches are kind of targeting at the same problem, the problem of counterfactuals, or a credit assignment problem, right? Like how do you figure out which actions were the ones that actually matter for your success and how would the world have evolved had you had you taken a different action? And one way you can do this is by predicting what would have happened, right? Like rolling out a full video of, you know, if I if I put this porta filter a little bit differently, you know, where would I end up and would this be a failure or a success? Or you can do this through reinforcement learning, and it does it through a slightly different mechanism, a little bit more implicitly, but it fundamentally targets a very similar problem. We are exploring all of those approaches and try to see, you know, how to how to really solve the counterfactual problem. I don't think there is an answer yet, but we see we see a lot of progress with with reinforcement learning that we that we've just shown with with Pi Star with Pi Star 0.6. But I think there is probably room for for many many other approaches too.
Awesome. Can we talk about once you guys get past that bootstrap phase? Let's talk about customer deployments a little bit. What do you bring to a customer? What do you sell them? And then how do you imagine that's going to evolve over time? Like, are you selling them a fully vertically integrated robotic solution? Are you selling them a model that they have to figure out how to integrate into their operations? Like how does this all work?
The real answer is we don't know yet. We are still figuring that out. We are still quite early in in the technology, as you can as you can tell. We are just starting to to even get to the threshold where we can start deploying these things. So we believe we should focus on the on the technology first to figure out how to get it to the point where it's actually easy to deploy and expand this aperture that we're talking about initially. And robotics, the history of robotics startups is very often gets to this point where you develop a technology for for some period of time. You started with this grand vision of what it should be able to enable, how general purpose it will be, and as soon as you pick an application that you want to apply it to, you're kind of stuck. You start cutting corners. You start figuring out very special purpose solutions just for this application. And very quickly you become, you know, an application company that just focuses on let's say warehouse pick and place robots and that's it. And we really want to avoid that future. We think we have a chance to really solve physical intelligence, and the benefits of doing this will far outweigh any single applications that we can focus on now. So we want to make sure that the technology is as general as possible, as easily deployable as possible. This aperture is as wide as possible, and then we'll start figuring out how to commercialize it. And as you said, there could be many different ways of doing this. There's probably ways that we can't think of just yet because they will depend on how the technology goes, whether it's an whether you can be a model provider, fully vertical solution, or you sell robots or or whatever else. But I think it's a little too premature to answer this question.
It will give you a lot of comfort, you know, just to like pick one of us. Give Alfred a lot of comfort. No, you guys have a grand grand vision. So, thank you for working on physical intelligence. It's a it's a wonderful wonderful improvement just for Pi Star 0.6. It's just a huge sort of breakthrough and so congratulations on all the success you've had.
Thank you.
Can I follow up with a spicy question?
Sure.
So, as you said, this vision is so grand, so broad, you're doing all these different things. I'm sure you've you've studied all previous robotics efforts and they've largely as you said applied an application to an application and they get narrower and narrower. And one of the most successful cases of a large application is self-driving, and Waymo or Tesla have done enormously well, but if I had to go back in history, you know, I learned about self-driving when Sebastian Thrun was on the stage of TED in I think 2009, 2010. And he talked about the thing where they won the DARPA challenge that was 2007. And we're in 2025 and the thing barely goes from San Francisco down here. They kind of can do it now, but they take local roads. They can't even get on the freeway. If you do such a generalized job, how long is the runway or the timeline that you're thinking about to build for generalization and performance?
Yeah. So, there are some aspects of the problem that make it easier than self-driving and some that make it harder. One thing that makes it easier is that we don't need to deploy it only when it's 100% reliable, right? There's many many tasks out there that even if you're at 95% reliability, you're totally fine. If you have a robot in your home folding your laundry and every 100th item, you know, it doesn't fold it perfectly, you'll be totally fine.
You just call your child to go fold the...
That's right. We still We still need chores. Exactly. And with self-driving, that's not the case, right? Like if you fail every 100th time catastrophically, that's that's a big problem. So I think in terms of deploying this technology, it might be easier. Now we also benefit from the fact that this is a different era of technology. We are at the era of vision language models, of foundation models that have some common sense, and we learn a lot of lessons between what was it, 2009 and 2025, and we can benefit from all of those. So I think that also really really helps, and these are much more general purpose solutions than what we had in the past. At the same time there are some things that will be very challenging, right? Like there isn't just a single application. This is a very general purpose solution that can be applied to driving but also to manipulation and locomotion and flying and all kinds of other things. And I think it's to be seen how much harder this is. So far, based on what we've experienced, it doesn't seem to be that much harder, to be honest. It seems that if you if you tackle this with with a very general purpose kind of mindset from the get-go, it turns out that it can generalize fairly well. And there is something about physical intelligence that we don't fully understand that allows these models to generalize between driving and making coffee and flying a drone and operating a surgical robot. Even though they seem so far apart from each other and it seems that these should be all different models and different applications, these models somehow can make sense out of all of that data. And that gives me a lot of hope that maybe the problem is not that much harder and it might be actually easier. So I think it's a fair question, but I also don't want to draw the wrong conclusions from what we've seen from from self-driving.
That's beautiful. Congratulations. What result has impressed you the most outside of results you think? That's a great question.
Yeah, it's a good question actually. I can start. I've been I've been really impressed by the video models, what you mentioned earlier. I saw them a few years ago. I worked on aspects of them a few years ago, and I didn't expect this trajectory to be, the improvement to be so steep. Like they're basically indistinguishable right now from reality and they can do incredible things. So that's been really really impressive and really surprising to me. Yeah, I would say I'm still in awe to some extent that we've gotten to this place where we do seem to get models that do seem generally intelligent to a level that I really didn't foresee coming out of out of just next token prediction. I'm still amazed with this and like every little advance that I see, you know, winning IMO mass challenges or applying it to finding new stuff in science to me. Yeah, there are so many things this year where I thought like, wow, there's still there's still a lot of progress to be made, even though it felt like at the beginning of the year maybe this whole pre-training business of LLMs is kind of maybe petering out a bit. Yeah, realizing that there's like this whole almost second breath of yeah, fresh air basically coming in.
Yeah, I would maybe add to this just like the fact that this whole thing works, it's kind of mind-blowing. Yeah, I don't think we like fully realize how ridiculous this is, right? Like you you build this like loosely brain inspired thing that has very general purpose learning algorithm. You feed it data and it somehow gets it and gets it way better than anything we've ever had before. And this applies to robots and it applies to vision and language and sound and all kinds of other things. And like I think if you stop for a second and just think about it how it works and and that it works, it's just like absolutely mind-blowing. Like the fact that we can have robots, you can put it in a home and it kind of knows what to do in a home that it's never been to before, or it can make coffee for 13 hours straight or, you know, things like that. And this is from this very general purpose thing that that trains fully end to end, that we don't fully understand, but it seems to start to get it. That to me is just mind-blowing.
We're in a simulation. It's what that's what Sonia believes, that that we're living in a simulation. But it is interesting, right? Like in science, they teach you to take a big problem and break it up into smaller and smaller problems. And then basically somebody realized that that's maybe not the best way to train machines or robots of any kind.
And to be honest, the whole machine learning, like AI field made that same mistake actually to some extent, right? We were working for a long time, people were working on solving individual problems very deeply basically, right? And then over time there is this like notion of, oh, if we can put it all together, like do multitask learning, if we could do that really really well we do much better. And then but then the fact that that all happened just because we switched to this, you know, general pre-training objective, and then it just all falls out, that's the part that is is the surprising bit, right?
Do you think it's like an accordion where we go from one framework to the other framework? We take big problems, break them up into smaller smaller and smaller ones that work for a period of time, then it stopped working, and we're like, "All right, let's go back to the big problem and try to solve it more generally and go back and forth."
I don't see us going back. I think there's a lot of approaches or a lot of people saying that, you know, you need the best of both worlds and you need some kind of way of incorporating the rules that we already know about like, you know, Newtonian physics. You don't need to learn that. We already know how it works. So, can you just like put it somehow into the weights? But from what we've seen so far, it it doesn't work. If you try to do this, you kind of limit the ability to to learn new things. And I don't think there's the best of both worlds. I think we just go all the way learning. And it's kind of interesting to, you know, how similarly to how we learn. You would think that if there was a way to pre-bake all of the intelligence, the evolution would have figured this out. You would have just been born, you know, knowing everything there is to know. And we see this with some other species, right? Like I think deer when they when they get born, they're basically like as smart as they will ever be. Like they don't really learn much throughout their lifetime. But for intelligent species like like humans, but also I think crows, for instance, they have these childhood periods, the adolescence period where they're not very smart to begin with, but they have to learn from their own experience and it doesn't come pre-baked. You kind of have to earn it on your own. And I think there is something something to that. You need to just experience the world and and learn from that. And I think that's the lesson we're learning in in machine learning as well, in AI, that you know we we think we know how we think, but we actually don't. And we just need to let the algorithm learn it from data.
Same thing with raising a child. We I think I know how my son is thinking, but I don't.
Yeah. Yeah. I have a I have a small daughter, and yeah, it's just so surprising like...
They learn so fast.
They learn so fast, and you don't know where they get it from.
Hopefully from the parents.
Hopefully.
She definitely knows some things that they didn't teach her. Thank you guys so much. It's a really beautiful mission you're building after. Thank you for coming to share.
Thank you.
Thanks for having us. Thanks for having us.