TextPurr Logo

TextPurr

Loading...
Loading...

Quan Vuong: π0.7, A Generalist Model with Emergent Capabilities [ETHZ Robot Learning 2026]

Oier Mees
In this guest lecture for the ETH Zurich course "Robot Learning: From Fundamentals to Foundation Models" (Spring 2026), hosted and led by Oier Mees, Quan Vuong (Co-Founding Member of Physical Intelligence), who's has been a central figure in the push towards large-scale pre-training for robot learning first at Google DeepMind at now at Physical Intelligence, talks about the latest findings at Physical Intelligence when developing their frontier models that show emergent capabilities.
Hosts: Quan Vuong, Oier Mees
📅April 28, 2026
⏱️00:36:59
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "Quan Vuong: π0.7, A Generalist Model with Emergent Capabilities [ETHZ Robot Learning 2026]" from "Oier Mees". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=pzolgvyWEFY

00:00:00Oier Mees

So, it's my pleasure today to welcome Quan Vuong for our guest lecture. He co-founded Physical Intelligence and focuses on large-scale pre-training for cross-embodiment robot models. He obtained his PhD at UCSD, during which he was advised by Henrik Christensen and Hao Su.

💬 0 comments
00:00:25Oier Mees

And before co-founding Physical Intelligence, Quan worked at Google DeepMind, where he started and co-led the RT-X and Open-Ended Embodiment projects. And with this, please give him a warm applause. Go ahead, Quan.

💬 0 comments
00:00:43Quan Vuong

Sounds good. Yeah, I can actually see the Zoom link and I'm not going to try to make it show up. So, you know, if you have anything, feel free to interrupt me. I'll keep today really informal, just to share what our recent progress is, and I think my goal today is really to interact with the robotics community at ETH. So, if you have any questions, save them for the end.

💬 0 comments
00:01:09Quan Vuong

So, $\pi_{0.7}$—this is a work that we released recently, and I would say that one of the main innovations in this work is to get a single model that can perform many tasks, and perform many tasks at an extremely high level of performance. Why is this important and why is this hard?

💬 0 comments
00:01:30Quan Vuong

Previously, when you wanted to get really high performance for any particular task in end-to-end learning, normally what you do is you take a pre-trained model that is somewhat reasonable in performance but not great, and then you post-train that model to that specific task. The reason why getting a single model that can perform many tasks at a high level of performance is important is that it's much more scalable, right? You don't need to worry about which model checkpoint or what hyperparameter during test time to use for a particular task. The fewer models that you have to manage, the easier it is for you to scale. And it's hard because it is not obvious how to do it; it's a really open research question. And $\pi_{0.7}$, by no means, solved this question of how do we get the best model that can perform many tasks, but I think it's a meaningful improvement.

💬 0 comments
00:02:41Quan Vuong

And so, I think we're in the age of scaling for robotics. When people talk about scaling, usually there are two main axes: there's data and there's the model. One challenge is that it's not obvious what you scale and then how you scale it. For example, if you take all of the data that you might have in robotics and you train the model on it, most likely it won't be a great model. It won't be as good as a specialist model that has been really heavily tuned and optimized for a particular scenario. If you look at face value, what that means is more data doesn't necessarily equal a higher-performing model. So, the majority of the rest of the talk will focus on how $\pi_{0.7}$ was able to produce a single checkpoint that can match specialist performance.

💬 0 comments
00:03:43Quan Vuong

So, if you look at the top right of this slide here, what this is showing is a figure from one of our previous works on RL-fine-tuning. The x-axis is the episode length of episodes seen during evaluation, and the y-axis is the success rate. What this is illustrating is that the green plots demonstrate the speed of a human teleoperator controlling the robot. The slightly pink base policy is illustrating the speed of a policy trained on the teleoperation data, which is actually slower than the speed of the human teleoperator—which is kind of interesting, right? Because you would expect that a good imitation learning policy will at least reproduce the speed that you see in your training data. And then the yellow, which we wanted to show with RL-fine-tuning, was that if you perform online reinforcement learning, you can get the speed to be superhuman in this case, where you can see that the yellow plots are to the left of the green plots.

💬 0 comments
00:04:57Quan Vuong

So, one original question that you can ask here is first: how is it possible that you have a dataset, you take a policy, you train on that dataset, and the policy is actually slower than the dataset? And how do you get a generalist policy that has the same speed as the specialists that were trained using reinforcement learning?

💬 0 comments
00:05:23Quan Vuong

So here comes $\pi_{0.7}$. In terms of the architecture of the actual action generation component of the system, it has not changed significantly from our prior work. It's very similar to $\pi_{0.6}$, and very similar to $\pi_{0.5}$. We use a language model and a pre-trained vision encoder, and then we wrap an action expert onto it to teach the model how to speak "robot language," if you will.

💬 0 comments
00:05:56Quan Vuong

The model takes in a prompt, for example, "clean the kitchen." It might also additionally take in a subtask prompt, which indicates a subtask that if you do it and finish the rest of the subtasks, you will finish the main task instruction. Additionally, there is metadata that indicates the speed and the quality of the episode that you're training on. One other innovation in $\pi_{0.7}$ is the goal-conditioning capability of the model, which leads to much better generalization that I'll also touch on later.

💬 0 comments
00:06:43Quan Vuong

What I just talked about is the action generation component of the system. There is also a high-level component, and the way to think about the high-level component is that it's like a planner. The high-level instructions, which are indicated either by the human, the high-level policy, or the world model here, are not capable of generating low-level actions themselves. Instead, they are capable of generating plans for the low-level policy to figure out what action to take to execute on those plans.

💬 0 comments
00:07:20Quan Vuong

To show you the kind of tasks that the $\pi_{0.7}$ model is capable of—these are tasks that stress test very different axes. There is laundry folding, which has to deal with an infinite state space. There is the really precise task of screwing a tiny screw into a robot arm. On the right, there's a task of building a box that requires really precise two-hand coordination. And there's a long-horizon task of taking out the trash on the bottom. The noteworthy thing here is that all of these videos and all of these tasks were performed by a single checkpoint from the $\pi_{0.7}$ pre-training run.

💬 0 comments
00:08:16Quan Vuong

These are the kinds of figures that I won't go into detail on, but the main takeaway here is—first, let's look at the different policies that are presented here. There is $\pi_{0.7}$, which is the pre-trained model. And there is what we call the RL specialist and the SFT specialist. To abstract away the details, the way to think about these specialists is that they are policies specifically tuned to get high performance on specific tasks. If you glance at the slide, you can see that the yellow, which is $\pi_{0.7}$, is either the same as or actually better than the specialists.

💬 0 comments
00:09:19Quan Vuong

This gets to the point I was trying to make: I think this is the first time where, at least for the $\pi$ models, it's possible for us to get a single checkpoint that performs as well as policies specifically tuned for different tasks. Now, the even more interesting bit here is that in some of these plots, you see that the yellow is actually better than the specialists.

💬 0 comments
00:09:40Quan Vuong

This is saying that when you have a specialist specifically tuned for a particular scenario that is already very high-performing, if you can get a generalist to also perform well in those scenarios, the generalist can actually perform better. Why is that the case? That is the case because, in large-scale robotic evaluation, really every evaluation is a generalization evaluation. Even when you try to have controlled experiments and fix the scene as much as possible, there's always going to be some factor that changes. Because you have a generalist, it's going to be generally more robust to subtle changes in the environment, and therefore it can actually outperform the specialist. We really tried our best to make it an apple-to-apple comparison here in terms of the data that different models are allowed to see.

💬 0 comments
00:10:33Quan Vuong

Now, one of the interesting plots from the $\pi_{0.7}$ paper is this plot on the left here. The x-axis is the percentage of data used in training, where as you go towards the left of the x-axis, you use less data, but the data is of higher quality. If you go to the right of the x-axis, you use more data, but the data is of lower quality. So, when you go across the x-axis to the right, you are increasing the volume of data that you train on by adding lower-quality data.

💬 0 comments
00:11:14Quan Vuong

There are two policies being compared here. There is $\pi_{0.7}$ without metadata, and we can see here that increasing the data volume actually leads to decreased performance when you go from 80% to 100% by adding the really low-quality data into training. This is experimental evidence illustrating the point I made at the beginning: that more data is not necessarily better.

💬 0 comments
00:11:48Quan Vuong

Now, the other interesting bit here is the second policy being shown, $\pi_{0.7}$ with metadata. There is a huge jump in performance going from 80% to 100% by adding lower-quality data. This shows that by representing information to your policy properly, you can actually benefit from lower-quality data, and benefit in a pretty dramatic way.

💬 0 comments
00:12:26Quan Vuong

In addition to all of the things I just mentioned about high-performing policies, one other really interesting result from $\pi_{0.7}$ is cross-environment transfer. How do we categorize cross-environment transfer here? Let's say you have these two robot arms here. You collected data for folding a shirt, which is a hard task that can require up to hundreds and hundreds of hours of data because it's a pretty precise manipulation task that also has to deal with a very large observation space.

💬 0 comments
00:13:06Quan Vuong

With $\pi_{0.7}$, it's possible for you to take the policy checkpoint that was trained on only shirt-folding data from the robot on the left, and run it on the robot on the right here, which is a station of two UR5s, and have that station perform shirt folding as well. We have never included any shirt-folding data at all from the dual UR5 station. So, this is generalization to an unseen task on an unseen robot and task combination. One thing that's interesting to me about this result is that this is not a simple task; it is very difficult, and we were very surprised by how well the policy works in this case.

💬 0 comments
00:14:07Quan Vuong

I haven't yet touched on the high-level planning aspect of $\pi_{0.7}$. What these videos are showing is that if you have a human that's coaching the robot by giving language instructions, you can actually get the system to perform tasks that you haven't collected data on before, and also on completely unseen environments and unseen objects. Like this task where the human is trying to coach the robot to fry a sweet potato, and the fryer and sweet potato are just not in the training data at all.

💬 0 comments
00:14:52Quan Vuong

So, at Pi, we believe that robotics is very far from being solved, and we really want to help propel the community forward whenever we can. Part of that effort is to create the infrastructure for everyone across the world to experiment with robotics. We open-sourced $\pi_0$ and $\pi_{0.5}$. People are always very surprised when they ask me the question, "Is there any difference between the open-source $\pi_0$ and $\pi_{0.5}$ models and the models that researchers use at Pi?" Actually, there is no difference. The checkpoints that we open-sourced for $\pi_0$ and $\pi_{0.5}$ are also the checkpoints that researchers at Pi use for their daily work.

💬 0 comments
00:15:47Quan Vuong

One other thing that I wanted to show is this blog post that we did called "The Physical Intelligence Layer." I'm showing the blog post here with the URL so that you can look it up yourself later on as well. The reason why this blog post is interesting is that you can ask the question: if Pi can build a really great robotic model, how does Pi externalize it? That is to say, how does Pi make that intelligence available for the rest of the world to use?

💬 0 comments
00:16:24Quan Vuong

As part of an exercise to understand how we make that model available, we work very closely with robotics companies that want to deploy robots today to see if the model and the systems we are building together are good enough for real tasks.

💬 0 comments
00:16:45Quan Vuong

The first video that you see when you go to this blog post is a video of a robot from Weave Robotics folding laundry in a real laundromat in the Mission District of San Francisco, which is where Pi and Weave are located. This is interesting because these are completely unseen clothing items, and the robot is performing laundry folding for a real customer order.

💬 0 comments
00:17:19Quan Vuong

In the blog post, there are also examples of various scenarios where the robot has to be somewhat intelligent to recover from errors or to be able to fold a really complex item of clothing. For example, these jeans with really long pants are pretty hard to fold because the two legs of the pants can get tangled together. So, that's one example.

💬 0 comments
00:17:52Quan Vuong

The other example that the blog post shows is this example of a robot packaging task. The task here is—well, I'm not sure how it works where you guys are, but in the US, if you order an item from Amazon, it sometimes comes in this small, soft pouch. The task here is that there is a tray containing various kinds of items, and the robot is supposed to pick them and put them through this really small gap into the packaging machine. Then, when the package is sealed, the robot is supposed to pick up the package and put it over here to be ready to be shipped out.

💬 0 comments
00:18:30Quan Vuong

This task is hard because, first, there are deformable items—there can be many different kinds of items that the robot is packing. The second reason why it's hard is because this is not a task that's easily automatable using a classical stack, because errors can happen and the pouch itself is a small, deformable object.

💬 0 comments
00:18:59Quan Vuong

One of the things that I really love about this video is that this is daytime, in the morning, when the robot is running. And if you scroll to the end of the video, you can see that it's nighttime. So, this is a robotic system performing a real task, packaging real customer orders in a real warehouse, and it's running a full-day workload. The majority of it is autonomous. I think this example is pretty extraordinary to me because it didn't even require that much data to get this level of autonomy.

💬 0 comments
00:19:40Quan Vuong

And so, the takeaway here is that we want to figure out how we externalize the intelligence of the models that we're building. One way to do so is to work with companies that want to deploy robots today, scale the deployment, and work really closely with them. We work as if we're on the same team with very free-flowing information to see if the systems we build together are good enough for deployment.

💬 0 comments
00:20:06Quan Vuong

One of the really critical aspects in these deployments is the ability to make mistakes and to remotely intervene when the system makes a mistake so that the system can recover. The key question is: can you get to a level of autonomy where you don't need to remotely intervene as much, and you essentially break even from a financial perspective so that you can scale deployment? I think for some of the tasks that we work on with partner companies, we are essentially there.

💬 0 comments
00:20:40Quan Vuong

So, with that, let me end the presentation. I want to leave some time for questions, answers, and discussion. Feel free to ask me anything that's on your mind. And if you want to learn more about us, please go to our website, `pi.website`, and take a look at our blog posts. We put a lot of effort into making them presentable. Thanks for listening.

💬 0 comments
00:21:10Audience Question 1

Hey, so Sergey Levine said that you can get quite far with like a gripper hand, but since you're now incorporating data from humans, is it maybe now time to get a more dexterous robot?

💬 0 comments
00:21:26Quan Vuong

Yeah, so the problem with what you're referring to as a multi-finger hand is the following: it really depends on whether you want to build hardware or you want to study intelligence today. Because if you want to study intelligence today, then you would like to have scale. We believe that scale is important for intelligent behavior to emerge.

💬 0 comments
00:21:54Quan Vuong

Now, what do you need to scale? In terms of hardware, you need hardware that is cheap, reliable, and easy to teleoperate. If you look at multi-finger hands, you can get two out of three today, but not all three. You can get a hand that is reliable and easy to teleoperate, but it's not going to be cheap. For the price of one multi-finger hand, you can get maybe six or seven robotic systems with a two-finger gripper. Or, you can get a multi-finger hand that is cheap and easy to teleoperate, but it's not going to be reliable. It would break so often that you would wonder why you are doing this in the first place.

💬 0 comments
00:22:40Quan Vuong

And so that's one argument why, if you want to focus on the intelligence question, studying that using multi-finger hands is a little bit of a distraction today. Now, this is not to say that multi-finger hands are not important. I think they are, and when the hardware gets to the point where it satisfies those three criteria, then we'll consider working on it.

💬 0 comments
00:23:11Quan Vuong

The other thing is, we've probably worked on thousands and thousands of tasks at Pi at this point, and we work on really realistic tasks. We haven't seen that many tasks where you actually need something more than a two-finger mechanism.

💬 0 comments
00:23:32Audience Question 2

First, nice presentation. I just wanted to ask about the scaling of these VLA models. Right now, I see that in your paper you are getting some good cross-embodiment generalization results, and the hardware is reaching a point where it's actually very cheap. So, what is stopping you from building big data generation farms with thousands of people collecting multitask data and then training a model that you can just deploy in any home? Do you have very obvious bottlenecks that you are trying to solve at this point, or do you think that we have actually reached this point?

💬 0 comments
00:24:11Quan Vuong

I see. Great question. So, let me tackle the last part of your question first, which is the deployment of robots into homes. I want to build a home robot at some point, but I think it is incredibly hard to do and may not be the right thing to do today. I think robots in the home, in the limit, are a physical-AI-complete task. If that works, then we would have solved physical intelligence.

💬 0 comments
00:24:50Quan Vuong

That's the scientific answer. There is also an economic answer, where consumer robotics is just really hard. Selling into the home in a way that creates a product that is compelling, cheap, and safe is just really difficult. Because we believe that robotics today is bottlenecked by model capability, we want to focus on the research question first. So, why not the home today? That's the first part.

💬 0 comments
00:25:24Quan Vuong

And then the earlier part of your question: why not collect lots of data and have data-collection farms? We are scaling data collection, and there are many other companies in the world that are scaling up robot data collection. The critical question is really: what do you scale? Because there are many axes. For example, do you collect data on many different tasks and only a little bit of data per task? Or do you collect data on not so many tasks but lots of data per task?

💬 0 comments
00:25:58Quan Vuong

What about human video—do you collect lots of it, too? What about YuMi? What about the robot embodiment—do you collect data on lots of robots and not as much data per robot, or do you collect data with only a few robot platforms with lots of data per platform? How do you change out the objects? In what environments do you collect the data from? I don't think these questions have an answer today, and they really need to be experimentally validated to see what is useful. We want to be good scientists, so we'll study how to scale by running experiments.

💬 0 comments
00:26:43Audience Question 3

Hi, thanks for the great presentation. I wanted to ask regarding the difference between $\pi_{0.6}$ and $\pi_{0.7}$. For $\pi_{0.6}$ in the paper, you talked a lot about the problem with interventions during training and how much human feedback you need. You also very clearly today highlighted some of the benefits of $\pi_{0.7}$ outperforming some of the specialized tasks and all of that stuff. But I wonder, with regards to training, how much were you able to reduce the interventions, and how much more efficient is it to train $\pi_{0.7}$ beyond the performance benefits you showed?

💬 0 comments
00:27:26Quan Vuong

Are you referring to the interventions during data collection or evaluation?

💬 0 comments
00:27:30Audience Question 3

Evaluation.

💬 0 comments
00:27:33Quan Vuong

Yeah. I don't remember the number off the top of my head; the number should be in the paper. We generally measure the throughput of the overall systems, and there should be some intervention numbers in there as well.

💬 0 comments
00:27:55Quan Vuong

My general summary of the results is that there are just tasks that $\pi_{0.7}$ can perform at a high level of performance that $\pi_{0.6}$ just can't do at all. For example, screwing a very tiny screw into the robot arms. $\pi_{0.6}$ really can't do that task at all unless you run specialist post-training on it.

💬 0 comments
00:28:21Quan Vuong

And there was a figure that I showed where the $\pi_{0.7}$ model is actually outperforming the $\pi_{0.6}$ specialist, especially in terms of throughput, because the model is more robust to environment changes. So it generalizes better—like maybe the lighting has changed, the object has changed, or the presentation itself has changed.

💬 0 comments
00:28:42Quan Vuong

Now, you also mentioned the question of training efficiency. There are multiple ways to define training efficiency. One that I think a lot about is how much compute goes into the training run. I would say between $\pi_{0.6}$ and $\pi_{0.7}$, they're roughly comparable, actually. This is great because you get a much higher-performing model while not having to increase the scale of your compute significantly. I'm not sure if that answers your question exactly, but I'm happy to answer a follow-up.

💬 0 comments
00:29:21Audience Question 3

Yes, that was great. Thanks a lot.

💬 0 comments
00:29:24Audience Question 4

I have one more question. Thanks a lot for your presentation. I just heard that $\pi_{0.7}$ has a small world model for goal conditioning. This directly makes me think of Dreamer-style approaches as a full world action model. So I'm wondering: what are the differences and trade-offs? I guess it's not black and white. Are you leaning more towards world action models in the future, or are they not needed because of inference costs, training costs, and data collection? Also, have you reached a fundamental bottleneck with VLAs that would be solved with world models, or do you think that fundamentally there's not a difference between the different models?

💬 0 comments
00:30:15Quan Vuong

Yeah, that's a good question. I don't think there is a definitive answer to your question today; all of what you asked is still an open research question. We're exploring some of it, and the community is exploring some of it.

💬 0 comments
00:30:30Quan Vuong

Really, what's crucial when I think about this question is: number one, model capability. For example, you want a long context length, and you want a model that is durable and has high enough capacity to ingest really large datasets. I think VLAs or video models can do both, but the way that you train them might be quite different. So, model capability is one aspect.

💬 0 comments
00:31:03Quan Vuong

Number two is the inference time constraint, as you mentioned. I actually think it's a very solvable problem. If we get to the point where you have a really large model that is intelligent enough to perform tasks at a very high performance level, and then the only remaining obstacle is to make the model inference faster, I think that is highly solvable. And so, it would be great if we get to that point.

💬 0 comments
00:31:40Quan Vuong

Now, we need to scale up the model. Currently, the models are in the single-digit billions. I have no doubt that this year, we and the community will have low double-digit billions, and then it's going to go to mid-range double-digit billions, and then it'll go to high-range double-digit billions. So, the model size will continue to increase.

💬 0 comments
00:32:07Quan Vuong

I think we'll see much more interesting emergent capabilities when the model size increases. Whether that's a VLA or a video action model that will scale better, we need to try to scale both up. I don't think I have seen enough results in the community to say that one is obviously the better choice compared to the other.

💬 0 comments
00:32:29Audience Question 4

Thanks. Do you have time for one last question?

💬 0 comments
00:32:32Quan Vuong

Yeah, go for it.

💬 0 comments
00:32:34Audience Question 5

Hey, thanks for coming here and giving the talk. I was wondering if you could elaborate on what you said before on why you don't want to do a home robot right now. In your public demos, sometimes you do go to homes, collect data in homes, and show that you can do some home tasks to a reasonable success rate, and I think even generalize to new kitchens sometimes.

💬 0 comments
00:33:00Audience Question 5

So, is it that you think if you get a reasonable, non-100% success rate on home tasks, consumers won't buy it? Or are you concerned about safety and want to wait until you have a really good model to make sure it doesn't break things or hurt people? Why would you not focus on home robots? In my mind, I would think that consumers would tolerate a 60% success rate in a robot loading a dishwasher and folding laundry because it's so useful and fun, and there's no economic output tied to it as there would be in commercial settings.

💬 0 comments
00:33:36Quan Vuong

Yeah, so I think the dream of the home robot is very real. I would like very much to build a home robot at some point too.

💬 0 comments
00:33:43Quan Vuong

There are a few things in your question. The first is: why do we show a lot of demos on home tasks? Well, when we released $\pi_0$, I think the community thought of us as a laundry company because so many of the tasks were laundry-related tasks. We pick home tasks for a very specific reason: it's a really good testbed for research and for communicating that research. It's a good testbed for research because the full complexity of dealing with open-ended environments is there. It's a good set of tasks for communicating research because the tasks are very relatable. We know robotics, but if I were to show a very precise assembly task to a layperson, they're not going to be able to really grasp how interesting or impressive it is. Whereas if they see a laundry-folding robot, they kind of get the sense that it's hard and that this would be useful to them eventually. So, communicating progress to the layperson is much easier with these tasks. We can look at an assembly task and understand that it's hard because we know robotics, but that's not the case for most people.

💬 0 comments
00:35:06Quan Vuong

Regarding why not go for a home robot immediately right now, I think the models are just not good enough. You might be okay with a 60% success rate dishwasher robot, but someone else might not be. Someone else might care a lot more about human-robot interaction, or they might care a lot about the robot not taking up too much space because they don't want it to get in their way. I think diversity in human preferences is one of the reasons why consumer robotics is really hard.

💬 0 comments
00:35:46Quan Vuong

We think the bottleneck to building a really good home robot eventually is model capability, and we want to focus on that bottleneck first. I don't think if you take the models today and try to build a home robot with them, you would be very successful. I think multiple scientific breakthroughs need to happen for that to be true.

💬 0 comments
00:36:07Quan Vuong

Now, the strategy that we take is very explicitly general-purpose robotics. You can think about how, in language models, you don't have to fully solve language before the models are useful. We want to be in a position where our surface area for success is really large. In some of the tasks and some of the environments, the models are good enough today for deployment, and we want to go after them while still building up model capability so that the surface area for success increases. Eventually, home robots will be one of the things that the model is good enough for.

💬 0 comments
00:36:49Audience Question 5

Okay, thanks.

💬 0 comments
00:36:51Oier Mees

All right. I think that's the end for Quan. Let's thank him again for an amazing talk.

💬 0 comments
Video Player