Robotics' End Game: Nvidia's Jim Fan

Sequoia Capital

Jim Fan, who leads the embodied autonomous research group at Nvidia, returns to AI Ascent to argue that robotics is entering its end game — and that the playbook is already written. He walks through what he calls "the great parallel": robotics following the LLM path from pre-training to reasoning to auto research, but with world models replacing language models, egocentric video replacing teleoperation, and world action models replacing the VLA paradigm. Along the way: why he thinks we'll pass the physical Turing test within 2–3 years, why "compute now equals environment equals data," and why this generation was born just in time to solve robotics. 00:00 Introduction 00:30 DGX One Origin Story 01:42 The Great Parallel 03:31 Robotics Endgame Setup 03:39 Why VLA Falls Short 04:32 Video World Models 06:09 DreamZero World Action 07:46 Scaling Data Collection 11:06 EgoScale And Scaling Laws 15:39 DreamDojo And The Roadmap

Hosts: Jim Fan, Dennis

📺Watch on YouTube

📅April 30, 2026

⏱️00:20:03

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Robotics' End Game: Nvidia's Jim Fan" from "Sequoia Capital". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=3Y8aq_ofEVs

00:00:02Dennis

And up first, I'm delighted to introduce my friend Jim Fan. Jim leads the embodied autonomous research group at NVIDIA, otherwise known as NVIDIA Robotics. Um, I think that robots are just one of the most thrilling things that's going to happen. A car basically is a big robot, but I'm excited for robots that can go beep boop and lift things for us. And so Jim was a standout at last year's AI Ascent, and we're delighted to have you back.

🤍0 likes💬 0 comments

Add to My Notes

00:00:28Jim Fan

[Applause] Thanks everyone. Thanks. So, it was a summer day in 2016. Actually, right in this office that we're sitting in, there's a guy in a shiny leather jacket, you know, big biceps, hurling in this large metal tray. And on this large piece of metal, he wrote, "To Elon and the OpenAI team, to the future of computing and humanity, I present you the world's first DGX-1."

🤍0 likes💬 0 comments

Add to My Notes

00:00:55Jim Fan

So, that was the first time I met Jensen. And as any good intern would do, I rushed to get in line to sign my name on it. So, can you spot it? My name is here. And can you spot another? That's Andrej right there. So, Andrej, we're going to the Computer History Museum. I feel like a dinosaur.

🤍0 likes💬 0 comments

Add to My Notes

00:01:16Jim Fan

You know, back then, I had no clue what I was signing up for. And then, no one can describe what happened next better than Ilya himself. "If you believe in deep learning, deep learning will believe in you." And oh boy, did deep learning believe in all of us big time.

🤍0 likes💬 0 comments

Add to My Notes

00:01:36Jim Fan

Three step functions, six years. That's all it took to bring us here today. The first take: GPT-3 pre-training. Next token prediction is really about learning the rules of grammar, the shape of language. It's about simulating how thoughts and code and strings in general should unfold.

🤍0 likes💬 0 comments

Add to My Notes

00:01:57Jim Fan

2022: InstructGPT. Supervised fine-tuning aligned the simulation to do useful work. O1 reasoning using reinforcement learning to surpass imitation learning, and finally auto-research accelerating the whole loop beyond what's humanly possible. So, as Andrej said, all the labs are getting to the final boss fight. So, for LLMs, they're in the thick of the endgame.

🤍0 likes💬 0 comments

Add to My Notes

00:02:28Jim Fan

And honestly, I'm very jealous. Look at how happy Andrej was. Big smile on his face. The LLM folks are having the party of their lifetime. They're speedrunning AGI on mythical clusters literally called Eos. So, why can't robotics get a piece of fun?

🤍0 likes💬 0 comments

Add to My Notes

00:02:47Jim Fan

So as any self-respecting scientist would do, I copy homework and I give it a new name. I call it The Great Parallel. So instead of simulating strings, can we simulate the next physical world state, and then we can align through action fine-tuning onto a thin slice of that simulation that matters for real robots. And we let reinforcement learning carry the last mile.

🤍0 likes💬 0 comments

Add to My Notes

00:03:14Jim Fan

And that's it. The Great Parallel copying the LLM success. If you can't beat them, join them. So, please join me in a new episode, Robotics: The Endgame. And sorry, I just couldn't resist. Nano banana is too good. Thanks, Dennis.

🤍0 likes💬 0 comments

Add to My Notes

00:03:31Jim Fan

So, how do we play the endgame? It boils down to two things: model strategy and data strategy. Let's look at a model first.

🤍0 likes💬 0 comments

Add to My Notes

00:03:39Jim Fan

The last three years were dominated by VLAs or Vision-Language-Action models, and models like Pi0 and GR00T fall in this category. So we assume that the pre-training is done by a VLM and we simply graft an action head on top of it. But really, if you think about these models, they are LVAs because the most amount of parameters are dedicated to language. So language is a first-class citizen, followed by vision and action. And by design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs. It's kind of head heavy in the wrong places.

🤍0 likes💬 0 comments

Add to My Notes

00:04:18Jim Fan

This is my favorite example from the original VLA paper: move the Coke can to a picture of Taylor Swift. Yes, it has not seen Taylor Swift before. Yes, it's able to generalize. But this is not quite the pre-training ability that we're looking for. So what's the second pre-training paradigm? And I always thought that it would be something glorious.

🤍0 likes💬 0 comments

Add to My Notes

00:04:39Jim Fan

Unfortunately, it turns out that this is AI video slop that we call it. You know, I can watch these cats playing banjo on a security cam all day. It's peak internet.

🤍0 likes💬 0 comments

Add to My Notes

00:04:51Jim Fan

But really, look at this. No one can take this seriously [laughter] until we realize that these video models are learning to simulate the next world state internally. So these are some results from V3. You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction, all by themselves. None of this is coded in; physics emerged by predicting the next blob of pixels at scale.

🤍0 likes💬 0 comments

Add to My Notes

00:05:21Jim Fan

And even visual planning emerges. Look at how V0 solves these mazes. It solves them by running simulation forward in pixel space. And draw attention to the lower right corner here. This is my favorite example. Let's watch, and you blink if you miss, how V3 solves this one. [Laughter]

🤍0 likes💬 0 comments

Add to My Notes

00:05:43Jim Fan

It's super smart. You know, V3 figures out that if you're not looking, geometry is optional. I call this physics slop.

🤍0 likes💬 0 comments

Add to My Notes

00:05:53Jim Fan

So how do we make these world models useful? Well, we do action fine-tuning. We align this superposition of all possible future states and collapse that onto a thin slice that matters for real robots.

🤍0 likes💬 0 comments

Add to My Notes

00:06:09Jim Fan

Introducing Dream Zero. It is a new type of policy model that dreams a couple of seconds into the future and acts accordingly.

🤍0 likes💬 0 comments

Add to My Notes

00:06:19Jim Fan

And you know that motor actions there are high-dimensional continuous signals. So that looks just like pixels. We can render it at the same time as we render the videos. So Dream Zero jointly decodes the next world states and next actions, and as a result, it's able to zero-shot solve tasks and verbs that it has never seen in training.

🤍0 likes💬 0 comments

Add to My Notes

00:06:44Jim Fan

And as a robot executes, we can visualize what it's dreaming about, and the correlation is very tight. If the video prediction works, the action works; if the video hallucinates, the action fails. So once again, vision and action are now first-class citizens.

🤍0 likes💬 0 comments

Add to My Notes

00:07:00Jim Fan

And we have a lot of fun with Dream Zero. So we just rolled a robot around in our lab and then typed random things into the prompt box. And of course, Dream Zero is not going to get all of these tasks 100% robust, but it's kind of like GPT-2. It's trying to get the shape of the motion correct in every case. So Dream Zero is our first step towards open-ended, open vocabulary prompting for robotics.

🤍0 likes💬 0 comments

Add to My Notes

00:07:27Jim Fan

And we call this new type of model World Action Models, or WAMs.

🤍0 likes💬 0 comments

Add to My Notes

00:07:32Jim Fan

So let's all take a moment of silence for our dear friend VLAs. They've served us well. Rest in peace. Long live World Action Models.

🤍0 likes💬 0 comments

Add to My Notes

00:07:43Jim Fan

[Clears throat] And next, data strategy.

🤍0 likes💬 0 comments

Add to My Notes

00:07:46Jim Fan

This is Nvidia's chief scientist Bill Dally operating teleoperation inside our lab. And given his salary, I think this is by far the most expensive teleop trajectory ever collected in our dataset.

🤍0 likes💬 0 comments

Add to My Notes

00:08:01Jim Fan

The past three years have been dominated by teleoperation. It's the golden era, right? VR headsets, extremely optimized latency for streaming, and these complex rigs that look like medieval torture devices. You know, so much investment in industry, so much pain and suffering.

🤍0 likes💬 0 comments

Add to My Notes

00:08:22Jim Fan

And yet, for teleop, it's upper-bounded by 24 hours per robot per day. The fundamental physical limit. And actually, who am I kidding? It's more like three hours per robot per day, and only when the robot god is merciful because they throw tantrums all the time. So how can we do better? Well, how about this? You just wear the robot hand on your own hand.

🤍0 likes💬 0 comments

Add to My Notes

00:08:45Jim Fan

So this is called UMI, or Universal Manipulation Interface, and it's a deceptively simple idea. You wear the robot actuator on your hand and directly collect the data as humans, while getting the rest of the robot body out of the loop. Yet, I would say UMI is perhaps one of the greatest papers ever written in robotics data, and it spawned two unicorn startups. On the left-hand side is Generalist, improving this design, so you can wear the gripper here. And on the right-hand side, Sundai made these three-finger data gloves.

🤍0 likes💬 0 comments

Add to My Notes

00:09:20Jim Fan

So last year we took it one step further. We designed this exoskeleton that has a one-to-one mapping with five-finger dexterous robot hands, and we call it DexUMI. Let's look at it in action. On the left, the human directly collecting data, always the fastest. On the right, look at how difficult teleop is, right? The human operator here, one of our most skilled PhDs, he has to align very carefully, and then it's super slow. Also, the success rate is very low as well. And in the middle, you just wear this exoskeleton and you collect data directly.

🤍0 likes💬 0 comments

Add to My Notes

00:09:55Jim Fan

And we train a robot policy on this data. So here, what you see is a fully autonomous rollout of a policy that's trained on zero teleoperation data.

🤍0 likes💬 0 comments

Add to My Notes

00:10:06Jim Fan

So, we're able to break the curse of 24 hours per robot per day, and see how happy these robots are because they no longer need to be in the loop for data collection. So, is this the answer? Have we solved scaling for robotics?

🤍0 likes💬 0 comments

Add to My Notes

00:10:21Jim Fan

Anyone driving Tesla or Waymo here? Anyone? Right. You know, when you're driving, you're actually contributing to the biggest physical data flywheel. And the beauty is you don't even feel it during FSD because the data upload is an ambient process. Yet wearing these UMI or data wearables, it's still cumbersome, right? It's intrusive. It's not as seamless as just driving to work. So we need an FSD equivalent.

🤍0 likes💬 0 comments

Add to My Notes

00:10:53Jim Fan

The data collection needs to get out of the way, fade into the background so we can capture the full glory of human dexterity across all walks of life, across all labors of economic value.

🤍0 likes💬 0 comments

Add to My Notes

00:11:06Jim Fan

So we are going all in on human egocentric videos that come with these detailed annotations, like hand position tracking and dense language annotations.

🤍0 likes💬 0 comments

Add to My Notes

00:11:17Jim Fan

Introducing EgoScale, where 99.9% of the training that goes into this is based on human egocentric videos, and the result is an end-to-end policy that maps directly from the camera pixels here to 22 degrees of freedom high-dexterity robot hands, which you see here is fully autonomous.

🤍0 likes💬 0 comments

Add to My Notes

00:11:39Jim Fan

We pre-train EgoScale on 21k hours of in-the-wild egocentric human data, with zero robot data whatsoever. And during pre-training, we predict these hand joints and wrist poses. Then in action fine-tuning, we collect only 50 hours of high-precision mocap data gloves and four hours of teleop. That's four hours of teleop, less than 0.1% of our training mix.

🤍0 likes💬 0 comments

Add to My Notes

00:12:09Jim Fan

And with this, EgoScale is able to generalize to these very dexterous tasks like sorting cards or manipulating a syringe right over transferring the liquid. You know, someday we might have robot nurses at home. Might as well try this. And for these tasks, it takes only one-shot demonstration at test time to learn different shirt-folding strategies.

🤍0 likes💬 0 comments

Add to My Notes

00:12:33Jim Fan

And perhaps the most fascinating finding from the paper is that we discovered this neural scaling law for dexterity. It's a very clean relationship between the amount of hours we put into pre-training and the optimal validation loss. In fact, it's a clean log-linear mathematical equation six years after the original neural scaling law for language models.

🤍0 likes💬 0 comments

Add to My Notes

00:12:59Jim Fan

So if we put all of these data strategies on this chart, X-axis is alignment to the robot hardware. Y-axis is scalability. This is what it looks like. Teleop, the least scalable. Data wearables, you can go up to hundreds of thousands of hours. And egocentric video, if we're able to spin the FSD flywheel, easily 10 million hours in the next year or so.

🤍0 likes💬 0 comments

Add to My Notes

00:13:21Jim Fan

And if we draw a line here, everything to the left of this line is a new paradigm: sensorized human data. So let me make a few predictions. In the next year or two, we'll see teleop dropping and dropping to an almost negligible amount. And then there will be an ensemble of data wearables custom-designed for different hardware and use cases. And finally, the main diet for robotics will be egocentric videos.

🤍0 likes💬 0 comments

Add to My Notes

00:13:49Jim Fan

So, a moment of silence for our dear friend Teleop. You have served us well. Rest in peace. Long live sensorized human data.

🤍0 likes💬 0 comments

Add to My Notes

00:14:01Jim Fan

Are we done with the data strategy yet? Did you notice I put two rings on data strategy? What's the outer ring here? All the LLM frontier labs have spent significant budget now on acquiring millions of coding environments to do reinforcement learning. So, robotics is the same. We're in urgent need to scale up environments.

🤍0 likes💬 0 comments

Add to My Notes

00:14:20Jim Fan

And of course, you can always do reinforcement learning directly on the real robot. So in our lab, we use RL to push certain tasks to almost 100% success rate, so you can do continuous execution for hours on end. You know, it's kind of therapeutic to see these robots assembling GPUs just by themselves, or as a wise man would say, "Good boy, this task has been approved by my boss."

🤍0 likes💬 0 comments

Add to My Notes

00:14:46Jim Fan

Yet we can't get to one million environments because that will require one million robots if you do it the previous way. So we need a better way here. Let's say you take an iPhone picture and you can pass it through this 3D world scan pipeline to extract all the objects, and then automatically synthesize them again inside a classical physics simulator.

🤍0 likes💬 0 comments

Add to My Notes

00:15:11Jim Fan

So all these objects are actually interactive after the scan, and then you can augment this infinitely in simulation with variations that we call digital cousins.

🤍0 likes💬 0 comments

Add to My Notes

00:15:21Jim Fan

So now the iPhone basically becomes a pocket world scanner. In this process that we call real-to-sim-to-real, and in this way, we have a scalable way to port the physical world into the digital world.

🤍0 likes💬 0 comments

Add to My Notes

00:15:35Jim Fan

But still, this method relies on a classical graphics engine. Can we do better? Introducing Dream Dojo.

🤍0 likes💬 0 comments

Add to My Notes

00:15:44Jim Fan

So, it's always been on video world models and turning them into full-fledged neural simulators. Dream Dojo takes as input these continuous action signals and outputs the next RGB frames as well as sensor states in real time. Not a single pixel you see here is real. And Dream Dojo is able to capture and learn the mechanics of different robots through a purely data-driven approach. There's no physics equation, no graphics engine involved in this process.

🤍0 likes💬 0 comments

Add to My Notes

00:16:16Jim Fan

So the new post-training paradigm for robotics is a massively parallel RL system that runs on a few real robot stations, on a bunch of graphics cores running world scans, and heavy inference compute running world models.

🤍0 likes💬 0 comments

Add to My Notes

00:16:32Jim Fan

Or as this equation goes, compute now equals environment now equals data. Or as a wise man would say, "The more you buy, the more you save." And this message has been approved by my boss.

🤍0 likes💬 0 comments

Add to My Notes

00:16:47Jim Fan

So that's it. Putting it together, The Great Parallel that robotics will follow. And it's happening as we speak. And we're looking at the beginning of the endgame.

🤍0 likes💬 0 comments

Add to My Notes

00:16:59Jim Fan

You guys play the video game Civilization? Still my favorite. I like to think of my research as unlocking game achievements on this civilizational technology tree.

🤍0 likes💬 0 comments

Add to My Notes

00:17:13Jim Fan

[Clears throat] And there are three more achievements to unlock for robotics and then we're done. I can retire, and I can't wait for that.

🤍0 likes💬 0 comments

Add to My Notes

00:17:21Jim Fan

The first is passing the physical Turing test. Across a wide range of activities, you cannot tell the difference between a human doing the task or a robot doing it. Maybe not drunk humans, but you know, the physical Turing test is about unit energy in and unit labor out. And just by judging at the sexy pose of this robot, I think the work is cut out for us. So maybe it's two to three years away.

🤍0 likes💬 0 comments

Add to My Notes

00:17:51Jim Fan

And next, physical API. You have a whole fleet of robots and they can be configured just like any other software using APIs and command lines orchestrated someday by Opus 9.0. And if we have this physical API, we'll be able to realize lights-out factories. Those are essentially printers of atoms. They take as input designs in markdown files and then output fully assembled products completely autonomously. Or these wet labs that automate scientific discoveries in chemistry, biology, and medicine.

🤍0 likes💬 0 comments

Add to My Notes

00:18:28Jim Fan

And the final stop, physical auto-research. When robots start to design, improve, and build the next iteration of themselves far beyond what's humanly possible.

🤍0 likes💬 0 comments

Add to My Notes

00:18:40Jim Fan

So, you might ask, is this too science fiction? Like, are we going to see this in our lifetime?

🤍0 likes💬 0 comments

Add to My Notes

00:18:47Jim Fan

Well, it took the AI community 14 years to go from the first forward pass of AlexNet in 2012, a model that barely recognized cat versus dog, to AI Ascent today. 2026, well, we talk about agentic auto-research, and let's just add another 14 years. How about that?

🤍0 likes💬 0 comments

Add to My Notes

00:19:08Jim Fan

2026 is right in the middle of 2012 and 2040. And technology does not advance linearly. It advances exponentially. So, I can say with 95% certainty that we'll get to the end of the endgame, the end of the technology tree by 2040.

🤍0 likes💬 0 comments

Add to My Notes

00:19:30Jim Fan

And we'll still be young.

🤍0 likes💬 0 comments

Add to My Notes

00:19:34Jim Fan

If you believe in robotics, robotics will believe in you. And to all of us sitting here, I think our generation was born too late to explore the earth and too early to explore the stars. But we are born just in time to solve robotics.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here