Robotics' End Game: Nvidia's Jim Fan
Sequoia CapitalDisclaimer: The transcript on this page is for the YouTube video titled "Robotics' End Game: Nvidia's Jim Fan" from "Sequoia Capital". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=3Y8aq_ofEVs
And up first, I'm delighted to introduce my friend Jim Fan. Jim leads the embodied autonomous research group at NVIDIA, otherwise known as NVIDIA Robotics. Um, I think that robots are just one of the most thrilling things that's going to happen. A car basically is a big robot, but I'm excited for robots that can go beep boop and lift things for us. And so Jim was a standout at last year's AI Ascent, and we're delighted to have you back.
[Applause] Thanks everyone. Thanks. So, it was a summer day in 2016. Actually, right in this office that we're sitting in, there's a guy in a shiny leather jacket, you know, big biceps, hurling in this large metal tray. And on this large piece of metal, he wrote, "To Elon and the OpenAI team, to the future of computing and humanity, I present you the world's first DGX-1."
So, that was the first time I met Jensen. And as any good intern would do, I rushed to get in line to sign my name on it. So, can you spot it? My name is here. And can you spot another? That's Andrej right there. So, Andrej, we're going to the Computer History Museum. I feel like a dinosaur.
You know, back then, I had no clue what I was signing up for. And then, no one can describe what happened next better than Ilya himself. "If you believe in deep learning, deep learning will believe in you." And oh boy, did deep learning believe in all of us big time.
Three step functions, six years. That's all it took to bring us here today. The first take: GPT-3 pre-training. Next token prediction is really about learning the rules of grammar, the shape of language. It's about simulating how thoughts and code and strings in general should unfold.
2022: InstructGPT. Supervised fine-tuning aligned the simulation to do useful work. O1 reasoning using reinforcement learning to surpass imitation learning, and finally auto-research accelerating the whole loop beyond what's humanly possible. So, as Andrej said, all the labs are getting to the final boss fight. So, for LLMs, they're in the thick of the endgame.
And honestly, I'm very jealous. Look at how happy Andrej was. Big smile on his face. The LLM folks are having the party of their lifetime. They're speedrunning AGI on mythical clusters literally called Eos. So, why can't robotics get a piece of fun?
So as any self-respecting scientist would do, I copy homework and I give it a new name. I call it The Great Parallel. So instead of simulating strings, can we simulate the next physical world state, and then we can align through action fine-tuning onto a thin slice of that simulation that matters for real robots. And we let reinforcement learning carry the last mile.
And that's it. The Great Parallel copying the LLM success. If you can't beat them, join them. So, please join me in a new episode, Robotics: The Endgame. And sorry, I just couldn't resist. Nano banana is too good. Thanks, Dennis.
So, how do we play the endgame? It boils down to two things: model strategy and data strategy. Let's look at a model first.
The last three years were dominated by VLAs or Vision-Language-Action models, and models like Pi0 and GR00T fall in this category. So we assume that the pre-training is done by a VLM and we simply graft an action head on top of it. But really, if you think about these models, they are LVAs because the most amount of parameters are dedicated to language. So language is a first-class citizen, followed by vision and action. And by design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs. It's kind of head heavy in the wrong places.
This is my favorite example from the original VLA paper: move the Coke can to a picture of Taylor Swift. Yes, it has not seen Taylor Swift before. Yes, it's able to generalize. But this is not quite the pre-training ability that we're looking for. So what's the second pre-training paradigm? And I always thought that it would be something glorious.
Unfortunately, it turns out that this is AI video slop that we call it. You know, I can watch these cats playing banjo on a security cam all day. It's peak internet.
But really, look at this. No one can take this seriously [laughter] until we realize that these video models are learning to simulate the next world state internally. So these are some results from V3. You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction, all by themselves. None of this is coded in; physics emerged by predicting the next blob of pixels at scale.
And even visual planning emerges. Look at how V0 solves these mazes. It solves them by running simulation forward in pixel space. And draw attention to the lower right corner here. This is my favorite example. Let's watch, and you blink if you miss, how V3 solves this one. [Laughter]
It's super smart. You know, V3 figures out that if you're not looking, geometry is optional. I call this physics slop.
So how do we make these world models useful? Well, we do action fine-tuning. We align this superposition of all possible future states and collapse that onto a thin slice that matters for real robots.
Introducing Dream Zero. It is a new type of policy model that dreams a couple of seconds into the future and acts accordingly.
And you know that motor actions there are high-dimensional continuous signals. So that looks just like pixels. We can render it at the same time as we render the videos. So Dream Zero jointly decodes the next world states and next actions, and as a result, it's able to zero-shot solve tasks and verbs that it has never seen in training.
And as a robot executes, we can visualize what it's dreaming about, and the correlation is very tight. If the video prediction works, the action works; if the video hallucinates, the action fails. So once again, vision and action are now first-class citizens.
And we have a lot of fun with Dream Zero. So we just rolled a robot around in our lab and then typed random things into the prompt box. And of course, Dream Zero is not going to get all of these tasks 100% robust, but it's kind of like GPT-2. It's trying to get the shape of the motion correct in every case. So Dream Zero is our first step towards open-ended, open vocabulary prompting for robotics.
And we call this new type of model World Action Models, or WAMs.
So let's all take a moment of silence for our dear friend VLAs. They've served us well. Rest in peace. Long live World Action Models.
[Clears throat] And next, data strategy.
This is Nvidia's chief scientist Bill Dally operating teleoperation inside our lab. And given his salary, I think this is by far the most expensive teleop trajectory ever collected in our dataset.
The past three years have been dominated by teleoperation. It's the golden era, right? VR headsets, extremely optimized latency for streaming, and these complex rigs that look like medieval torture devices. You know, so much investment in industry, so much pain and suffering.
And yet, for teleop, it's upper-bounded by 24 hours per robot per day. The fundamental physical limit. And actually, who am I kidding? It's more like three hours per robot per day, and only when the robot god is merciful because they throw tantrums all the time. So how can we do better? Well, how about this? You just wear the robot hand on your own hand.
So this is called UMI, or Universal Manipulation Interface, and it's a deceptively simple idea. You wear the robot actuator on your hand and directly collect the data as humans, while getting the rest of the robot body out of the loop. Yet, I would say UMI is perhaps one of the greatest papers ever written in robotics data, and it spawned two unicorn startups. On the left-hand side is Generalist, improving this design, so you can wear the gripper here. And on the right-hand side, Sundai made these three-finger data gloves.
So last year we took it one step further. We designed this exoskeleton that has a one-to-one mapping with five-finger dexterous robot hands, and we call it DexUMI. Let's look at it in action. On the left, the human directly collecting data, always the fastest. On the right, look at how difficult teleop is, right? The human operator here, one of our most skilled PhDs, he has to align very carefully, and then it's super slow. Also, the success rate is very low as well. And in the middle, you just wear this exoskeleton and you collect data directly.
And we train a robot policy on this data. So here, what you see is a fully autonomous rollout of a policy that's trained on zero teleoperation data.
So, we're able to break the curse of 24 hours per robot per day, and see how happy these robots are because they no longer need to be in the loop for data collection. So, is this the answer? Have we solved scaling for robotics?
Anyone driving Tesla or Waymo here? Anyone? Right. You know, when you're driving, you're actually contributing to the biggest physical data flywheel. And the beauty is you don't even feel it during FSD because the data upload is an ambient process. Yet wearing these UMI or data wearables, it's still cumbersome, right? It's intrusive. It's not as seamless as just driving to work. So we need an FSD equivalent.
The data collection needs to get out of the way, fade into the background so we can capture the full glory of human dexterity across all walks of life, across all labors of economic value.
So we are going all in on human egocentric videos that come with these detailed annotations, like hand position tracking and dense language annotations.
Introducing EgoScale, where 99.9% of the training that goes into this is based on human egocentric videos, and the result is an end-to-end policy that maps directly from the camera pixels here to 22 degrees of freedom high-dexterity robot hands, which you see here is fully autonomous.
We pre-train EgoScale on 21k hours of in-the-wild egocentric human data, with zero robot data whatsoever. And during pre-training, we predict these hand joints and wrist poses. Then in action fine-tuning, we collect only 50 hours of high-precision mocap data gloves and four hours of teleop. That's four hours of teleop, less than 0.1% of our training mix.
And with this, EgoScale is able to generalize to these very dexterous tasks like sorting cards or manipulating a syringe right over transferring the liquid. You know, someday we might have robot nurses at home. Might as well try this. And for these tasks, it takes only one-shot demonstration at test time to learn different shirt-folding strategies.
And perhaps the most fascinating finding from the paper is that we discovered this neural scaling law for dexterity. It's a very clean relationship between the amount of hours we put into pre-training and the optimal validation loss. In fact, it's a clean log-linear mathematical equation six years after the original neural scaling law for language models.
So if we put all of these data strategies on this chart, X-axis is alignment to the robot hardware. Y-axis is scalability. This is what it looks like. Teleop, the least scalable. Data wearables, you can go up to hundreds of thousands of hours. And egocentric video, if we're able to spin the FSD flywheel, easily 10 million hours in the next year or so.
And if we draw a line here, everything to the left of this line is a new paradigm: sensorized human data. So let me make a few predictions. In the next year or two, we'll see teleop dropping and dropping to an almost negligible amount. And then there will be an ensemble of data wearables custom-designed for different hardware and use cases. And finally, the main diet for robotics will be egocentric videos.
So, a moment of silence for our dear friend Teleop. You have served us well. Rest in peace. Long live sensorized human data.
Are we done with the data strategy yet? Did you notice I put two rings on data strategy? What's the outer ring here? All the LLM frontier labs have spent significant budget now on acquiring millions of coding environments to do reinforcement learning. So, robotics is the same. We're in urgent need to scale up environments.
And of course, you can always do reinforcement learning directly on the real robot. So in our lab, we use RL to push certain tasks to almost 100% success rate, so you can do continuous execution for hours on end. You know, it's kind of therapeutic to see these robots assembling GPUs just by themselves, or as a wise man would say, "Good boy, this task has been approved by my boss."
Yet we can't get to one million environments because that will require one million robots if you do it the previous way. So we need a better way here. Let's say you take an iPhone picture and you can pass it through this 3D world scan pipeline to extract all the objects, and then automatically synthesize them again inside a classical physics simulator.
So all these objects are actually interactive after the scan, and then you can augment this infinitely in simulation with variations that we call digital cousins.
So now the iPhone basically becomes a pocket world scanner. In this process that we call real-to-sim-to-real, and in this way, we have a scalable way to port the physical world into the digital world.
But still, this method relies on a classical graphics engine. Can we do better? Introducing Dream Dojo.
So, it's always been on video world models and turning them into full-fledged neural simulators. Dream Dojo takes as input these continuous action signals and outputs the next RGB frames as well as sensor states in real time. Not a single pixel you see here is real. And Dream Dojo is able to capture and learn the mechanics of different robots through a purely data-driven approach. There's no physics equation, no graphics engine involved in this process.
So the new post-training paradigm for robotics is a massively parallel RL system that runs on a few real robot stations, on a bunch of graphics cores running world scans, and heavy inference compute running world models.
Or as this equation goes, compute now equals environment now equals data. Or as a wise man would say, "The more you buy, the more you save." And this message has been approved by my boss.
So that's it. Putting it together, The Great Parallel that robotics will follow. And it's happening as we speak. And we're looking at the beginning of the endgame.
You guys play the video game Civilization? Still my favorite. I like to think of my research as unlocking game achievements on this civilizational technology tree.
[Clears throat] And there are three more achievements to unlock for robotics and then we're done. I can retire, and I can't wait for that.
The first is passing the physical Turing test. Across a wide range of activities, you cannot tell the difference between a human doing the task or a robot doing it. Maybe not drunk humans, but you know, the physical Turing test is about unit energy in and unit labor out. And just by judging at the sexy pose of this robot, I think the work is cut out for us. So maybe it's two to three years away.
And next, physical API. You have a whole fleet of robots and they can be configured just like any other software using APIs and command lines orchestrated someday by Opus 9.0. And if we have this physical API, we'll be able to realize lights-out factories. Those are essentially printers of atoms. They take as input designs in markdown files and then output fully assembled products completely autonomously. Or these wet labs that automate scientific discoveries in chemistry, biology, and medicine.
And the final stop, physical auto-research. When robots start to design, improve, and build the next iteration of themselves far beyond what's humanly possible.
So, you might ask, is this too science fiction? Like, are we going to see this in our lifetime?
Well, it took the AI community 14 years to go from the first forward pass of AlexNet in 2012, a model that barely recognized cat versus dog, to AI Ascent today. 2026, well, we talk about agentic auto-research, and let's just add another 14 years. How about that?
2026 is right in the middle of 2012 and 2040. And technology does not advance linearly. It advances exponentially. So, I can say with 95% certainty that we'll get to the end of the endgame, the end of the technology tree by 2040.
And we'll still be young.
If you believe in robotics, robotics will believe in you. And to all of us sitting here, I think our generation was born too late to explore the earth and too early to explore the stars. But we are born just in time to solve robotics.