Yann LeCun on What Comes After LLMs
Unsupervised Learning: With Jacob EffronDisclaimer: The transcript on this page is for the YouTube video titled "Yann LeCun on What Comes After LLMs" from "Unsupervised Learning: With Jacob Effron". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=ngBraLDqzdI
You're one of the godfathers of AI. What's your kind of view of the path of progress here?
Five years, complete world domination. The best way to get breakthrough research is you hire the best people. You get the [ __ ] out of the way. Pardon my French.
You share the Turing Award with two others. When did your views start diverging?
In 2023.
How do you know it was time to leave Meta? It sounds like you were thinking through some of these things over a period of time.
Here's a big misconception about my role, my relation to Alex, and how AI was run at Meta.
What's like one thing you've changed your mind on in the last year? I mean, the whole idea of, uh...
Yann LeCun is one of the godfathers of AI. He's an absolute legend in the field, someone I've admired for a long time. And so it was such a treat to get him on Unsupervised Learning. He's been a noted skeptic of LLMs in many ways, and so we dug into what LLMs can do, what they can't do, some of the limitations he sees, and why he ultimately decided to pursue a different architecture.
We also talked about his time at Meta, the things he's proud of in setting up FAIR, how the last few years proceeded, and what ultimately led him to spin out and start his own company, AMI. I think it's just fascinating to get Yann's thoughts on everything happening in the AI ecosystem today, this tension between basic research and then pushing LLMs forward, and how that's happening in a bunch of organizations today, as well as his thoughts on just where the whole space is headed.
He's just an absolute giant in the field, and when I started this podcast, I hoped we'd get guests like him, so it is just such a treat. I think folks will really enjoy hearing the conversation we had. Without further ado, here's Yann.
Yann, this is such a pleasure. You're one of the godfathers of AI. I feel like when I started doing this podcast years ago, I was really hoping we might one day get someone like you on.
You know, I don't like that term because I live in New Jersey. When you're a godfather in New Jersey, it doesn't mean the same thing.
Very fair. Very fair. You know, obviously, your bet on neural nets when everyone doubted them is legendary. And I feel like today you're making a similar bet in many ways against LLMs and the predominant generative architectures that so many believe in. You've recently started a new company behind this theme.
And so, our goal today in the conversation is to leave our listeners with a lot more information about AMI, what you're doing there, some of your work at Tapestry, why you think the rest of the field is pointed in the wrong direction around some of these generative models, and then also just get your reflections on the way the field's unfolded, your time at Meta, and all that. So, modest goals for a single podcast episode.
I figured it'd be great to start with the meat, because the company feels like the clearest statement of your technical thesis going forward. You recently launched the company. It's focused on world models and scaling the JEPA architecture, which you obviously pioneered over at Meta. And so, I'm wondering if you could talk a little bit about the origins of that architecture and the extent to which you drew inspiration from the human brain and the way that works.
So first of all, I want to say there's nothing wrong with LLMs in the sense that LLMs are the basis for a lot of very useful AI products that all of us use, including me. They're great for what they do. They're just not a path towards human-level or human-like intelligence, or even animal-like intelligence. So that's my claim, okay? I'm not saying they're useless, right? I'm just saying they're not a path towards human-level intelligence.
I mean, you helped build some of the first major open-source ones, right?
Absolutely. So, what is AMI? AMI really stands for Advanced Machine Intelligence, and the subtitle, the motto if you want, is "AI for the real world." Basically, a lot of AI techniques that people know about today are good for language manipulation, either human language, computer code, mathematics, or legalese, which barely qualifies as human language.
Right, sadly.
You know, language is very special in a way, and it's particularly well-suited for the type of architectures that have been so successful recently—the large language models, GPT-style architectures. But what about the real world? What about understanding the physical world? Turns out reality is way more complicated than language, because it's high-dimensional, it's continuous, it's noisy, it's messy. And training a system to understand the real world is much, much harder.
So that's really what we're after. That's what I've been after for most of my career, and really working on in an accelerated fashion over the last five, six years or so, and making significant progress over the last two years. And so it made sense to really do a startup around it and sort of go into high gear pushing that. It became clear by the end of last year that Meta was really not the right place for that. So, which is why I left and started AMI Labs.
I think it's an interesting trend that we're seeing across the board, right, where it feels like there's many folks spinning out of either some of the large companies or research labs, that have a particular direction of research they're excited about. And you have such an interesting vantage point of this from your time at FAIR.
There's this almost tension that exists between pursuing as many different research directions as possible in these companies, versus "Hey, something's really working, this is the thing that we're going to sell for the next 6 to 12 months, go focus on that." I'm curious your thoughts on that and what you've seen in the industry at large.
Well, it's a strange trade-off. There's really two modes of R&D, right? There's a lot of exploratory research, a lot of research directions, right? And sometimes something kind of seems to work and you need to push it further, and it's not research anymore. I mean, the people working on it are researchers, or they're called researchers at least in the press, but really it's becoming more engineering and pushing for products, right?
So that happened a number of times at Meta because of things that were started at FAIR. Such a thing happened in early 2023 essentially, when LLaMA, which was developed at FAIR—LLaMA 1—was very promising. Meta created a whole organization, GenAI, to turn it into something real and a series of products, and produced LLaMA 2, LLaMA 3, LLaMA 4, which was a bit of disappointment.
And because Mark Zuckerberg was disappointed by it, he kind of rebooted the entire organization, reorganized it, and hired new people, etc. But what also happened over the last year is that basically the company, Meta, realized that they had fallen behind a little bit. And so that kind of refocused the strategy on trying to catch up with the industry.
The sad side effect of it is that a lot of the exploratory research was basically not given high priority anymore. I mean, it didn't concern the stuff I was working on, all the JEPA and world models, because Mark himself and Andrew Bosworth, the CTO, and a bunch of other people in the company were really interested in that project and really believed in the long-term impact. But the rest of the company was just totally, entirely focused on LLMs.
It made it clear to me that Meta was really not the right place to push for that project anymore, and then we started to have good results. So it was clear that we had to make that transition between research and actually developing the technology, scaling it up, and building products out of it. And we realized also that most of the applications were probably for things that Meta was not particularly interested in. A lot of applications of the kind of stuff that we've been working on is in industry, like manufacturing industry and stuff like that.
Obviously, you're pursuing world models in that broader world. And I think there's other people that have come at the world model space from a more generative approach. So I think you've got the Google folks in Genie and the video models. You've got folks building VLAs on the robotics side. You've got Fei-Fei and the 3D spatial models.
As you think about the body of evidence that got you excited about the JEPA models and how you compare them to what the generative folks have done, where do you think we are today in terms of comparing these architectures and approaches?
Okay, so "world model" is quickly becoming a buzzword right now, certainly in research, but also in industry to some extent. And there are two factions, if you want. I'm not going to talk about VLAs, because VLAs are clearly now being seen as not going anywhere, like it's really not working. So VLAs are Vision-Language-Action models, right?
The idea is to basically use the LLM technology to train a system to produce actions, for like controlling a robot or something like this, right? So you have vision in, language in, action out, maybe language out too. And that's pretty much now seen as a failure. Not being reliable enough, requiring too much training data, things like that. Okay.
Then there are world models. Okay. So what is a world model? A world model at a very general level is something that allows an agentic system to anticipate the consequences of its own actions. Predict the consequences of its own actions. From my point of view, I cannot imagine how you can even think of building an agentic system without that system having the ability to predict the consequences of its actions. I mean, that's pretty essential, right?
When we act in the world, we have this ability, and when we take an action without thinking about the consequences, we are taking a big risk. And very often, other people think we're an idiot. We have plenty of examples on the international political scene at the moment of people who have complete inability to predict the consequences of their actions.
So that's the world model. That's all it is, right? Ability to predict the consequences of your own actions. If you have this ability, then you can plan a sequence of actions to accomplish a task, to satisfy a goal. And you do this by planning, reasoning, by a process of search and optimization. You don't do this by predicting one action after the other autoregressively.
You do this by searching for a sequence of actions that will accomplish the task you set for yourself. So the blueprint for this is completely different from what LLMs can do at the moment. LLMs do not have the ability to predict the consequences of their actions and they do not have any planning abilities, because inference is by predicting the next token, right? It's not by search.
So right there, you have the two characteristics that I think are essential for intelligent behavior: ability to predict consequences of your actions, and second, ability to plan by optimization, by search, to find a good sequence of actions that will produce the correct outcome.
And then there is a third characteristic, which is how do you predict the consequences of your actions? Okay, so if I have a water bottle in front of me—I realize some people will just listen to this and not have the picture—so I have an open, uncapped water bottle in front of me. If I push at the bottom, it's going to slide on the table. If I push near the top, it's probably going to flip.
We can't predict exactly how the bottle will fall, in which direction. We can't exactly predict how it's going to slide, how the water will spill, whether the table is tilted in one way and the water will flow in one direction or another. There's no way we can predict this at a pixel level. And so our mental model of the world predicts, but at an abstract level of representation.
So as you were working on this architecture, was a lot of it inspired by the human brain? I mean, obviously, the way you're articulating things is exactly how we do things.
Right, or at least by cognitive science, right? Whether you can sort of translate this into a neural architecture and things like this, there's a big gap there. So certainly cognitive science was a bit of a motivation. Or what psychologists call System 2, which is this idea of the way you behave in sort of deliberate, reflective behavior, is that you do imagine and predict the consequences of your actions, and you plan accordingly, contrary to System 1 where you just act reactively and instinctively.
So yeah, there is an inspiration, but also there is a lot of empirical evidence that you don't want to generate pixels. Okay, I've been really interested in that problem of learning models of the world by prediction for a very long time. And then had an epiphany about five years ago, realizing that all of the architectures that have been successful to learn representations of images and videos are non-generative architectures, and all the generative ones basically have been failures, right?
VAEs, right? Variational Autoencoders or Autoencoders more generally. It's kind of a natural way to think about learning abstract representations of inputs, right? So you put an image at the input of a neural net and then you train it to just reproduce the input on its output. Now with a big neural net, if you just do it this way, your neural net will not do anything interesting. It will just learn the identity function.
Yeah.
Completely uninteresting. It doesn't work. Like, if you train a VAE to learn representations of images, you get something, but it's really not that great. Same with Sparse Autoencoders.
Then you have another set of techniques, and it's kind of derivative of something called the Denoising Autoencoder. Masked Autoencoder is a version of this; BERT is a version of this for NLP. So you take the image, you corrupt it in some way, and then you train this big neural net to recover the original image. There's a huge project at FAIR on this called MAE.
It was very disappointing. A lot of computation and not really a great, satisfying result. Simultaneously, some of the same people working on MAE, and some other people in Paris and in New York, were working on other techniques using non-generative architectures, Joint Embedding Architectures.
So take an image, corrupt it in some way, and then run the two images through encoders, and then try to predict the representation of the original image from the representation of the corrupted one. That's JEPA.
Yeah.
Okay. So JEPA means Joint Embedding Predictive Architecture, right? So you have one encoder that makes an observation, another encoder that makes a different observation. You try to predict the representation of the first one from the second one with a predictor. And those techniques turned out to work much better for representing images and video.
So things like DINO, DINOv1, DINOv2, DINOv3—a project that is still going on at FAIR in Paris—projects like JEPA, and then V-JEPA, and then before that there were like SimCLR and MoCo, and a whole bunch of different techniques mostly for Meta. There was a bunch of others from other groups. But that turned out to be a much better way of learning representations of images than predicting pixels.
And so it just clicked in my mind, but you know, not just mine, that this was the way to go and predicting pixels was kind of a losing proposition.
You know, it feels like there's all these robotics demos that are released from some of the model companies that feel increasingly impressive, and maybe seem to resemble things like planning and reasoning when they maybe haven't seen a room or a specific version of a task before and are still able to execute that task. What would you say to our listeners that observe that stuff and feel like, "Ah, it feels like we're trending toward some real progress with some of the generative approaches"?
Well, there is real progress, and some of those demos are really impressive. But they are trained with enormous amounts of data collected either from operation or from just human action with things you hold in your hand that look like grippers, and you collect the data for that. Or just tracking hands and fingers of a person and then translating this into commands for a robot.
And so those things are trained with imitation learning mostly, right, and a little bit with reinforcement learning to fine-tune, mostly in simulation. So the issue with this is that you need a lot of data to train those systems through imitation, and it becomes expensive and it's a little brittle in the sense that you need to collect lots of data for every task you want the robot to solve.
Whereas if the system had a world model that allowed it to predict the outcome of an action, it would just plan an action to solve a new task without actually having to be trained to accomplish this task. So the degree of generalization you would get with a world model-based system is much, much larger, a wider spectrum of tasks, with less training data that would be required than a system trained with imitation learning and fine-tuning.
No doubt those approaches require more data, and I guess this question of generalization really is the big question, right? You know, and I think some folks have shown some results around getting better at Task A helps with Task B, but that obviously feels like the still big unanswered question around those architectures.
I mean, you get this synergy between tasks. So the more tasks that you train the system to solve, the more tasks it is going to be able to acquire with a small amount of data, regardless of what technique you use. But the hope with world models is that the system can solve tasks zero-shot, which humans are completely capable of doing, right? And many animals as well.
So that's really the hope, like solving a lot more problems with either a small amount of training data or no training data at all, and just a little bit of maybe RL-style fine-tuning. Like, how is it that a 17-year-old can learn to drive in like a dozen hours or maybe 20 hours? We have millions of hours of training data of people driving cars. We still don't have level five self-driving cars, right? So imitation learning obviously doesn't work, even for just the task of autonomous driving.
Yeah, I guess it'll be a race between the ability to develop some of those capabilities which may take time and lots of data, versus this kind of architecture. I feel like there's this dream of using video models to just generate tons of synthetic data for simulation. And, you know, even if it's not perfect, these video models from a physics perspective, it's helpful enough to improve robotics and the underlying physical world. What have you made of some of those approaches? Obviously, video has been focused there, Google seems to be going down that road.
I'm sort of asking again the question, why can a 17-year-old learn to drive in 20 hours? You don't need millions of hours of demonstration, and you don't need synthetic data. You don't need any of that. So I want a system that can learn as fast as that.
If we crack that, then we don't need generated data, right? I mean, we might need to train a system in simulation, but not with the same amount of time or trials as current systems require. It's really a question of data efficiency.
You know, I interviewed Jerry Tworek on the podcast as he was at OpenAI and spun out to start his own lab. And you could sense a similar tension where I think he actually might even agree that if you continued scaling RL the way we're scaling, you get more very impressive results, but I think he felt, "God, there's just got to be some way more efficient way to do this."
And it's an interesting tension because you could imagine if you're OpenAI and you know something is going to continue like you could continue scaling it and it will keep getting better, there's not a ton of incentive necessarily from a business perspective to do something more data efficient.
Right, there's no incentive for the other companies to do anything different either, because they're all chasing the same... like, they can't afford to fall behind the others, right? So, they all work on the same thing.
Yeah.
And there's a bit of this sort of herd behavior, mostly in Silicon Valley where everybody is digging the same trench.
Yeah.
Uh, and you know, so I purposely set up the headquarters of AMI Labs in Paris.
Yeah.
Uh, the American office being in New York, not Silicon Valley.
It's really interesting because I think it points to a tension that exists in the broader ecosystem today, where you could imagine the other side being, "Sure, maybe there are more data efficient methods out there, but like almost who cares because we can keep scaling what we have to better and better results." And then obviously, I think from both new things you can accomplish from these models, as well as just the joy of being a researcher and finding these new things, I get why there's such an attraction to these other architectures as well.
And it's a bet, but you know, we're pretty confident because we have results already.
And as you think about the initial spaces you're most excited about for the AMI technology, where do you think the technology goes and what are you most excited about?
Well, I mean, AI for the real world. Like, where is your domestic robot? Where is your level five self-driving car?
Yeah. When am I going to get a domestic robot? I'm excited about this.
Well, so this is several years down the line, okay? Despite the fact that there is like a huge number of companies building robots, none of those companies actually has any idea how to make them smart enough to be useful, right?
Or trust it around with a baby in the house or something.
Certainly not that. But even for a relatively narrow manufacturing task, right? I mean, none of them really knows how to do this reliably, other than by imitation learning for a small number of tasks. So how do we make those things useful? So that's kind of a relatively long-term objective.
Shorter term, there is a huge amount of applications in industry where you need to have an intelligent system that has the ability of predicting what's going to happen if I change this control variable on this complex system. Be it a jet engine, a chemical plant, a power plant, some manufacturing line, a patient, a human cell, right?
Those are systems that are sufficiently complex that you can't model their behavior with a small number of equations. Right? So the traditional way of modeling does not work. And what you need to do is train a deep learning system to model the dynamics of that system from data. And what you get at the end is a phenomenological model of that process, of that system.
And if it's action-conditioned, then you get basically a world model of that system that allows you to control it optimally for whatever purpose you have. And I think the number of applications of this in industry is mindboggling.
Where do you think we'll be with JEPA models over the next couple of years? Are there milestones you'd point to, or what's your view of the path of progress here?
Okay, a couple of years is a little short. Like five years, complete world domination essentially.
Okay.
So somewhere on the path to world domination in five years. I mean, this is kind of a joke obviously, but this is a quote from Linus Torvalds, right? When people asked him, "What's your goal with Linux?", he said, "Total world domination." He actually managed to do that.
Yeah, very fair.
To a first approximation, every computer in the world runs Linux, right? So that's kind of a joke, but in the end, I think this is the blueprint for intelligent systems of the future. There will still be a small place for LLMs, you know, for like a language interface basically.
But what we're designing are systems that are capable of thinking. They may not be capable of talking or listening initially, but they'll do the thinking, and then you can add the talking and listening on top of that.
I'm sure you and the team are eagerly working to get the early proof points of this, and obviously you've already had some in the work you've done. How do you think about the interim steps of what you'll be able to show on that path to five-year world domination?
Well, so I think within a year or so, we'll have a general methodology to train hierarchical models on a very wide variety of modalities. We know we can do a good job on video, with some techniques that we're not completely happy with because they have some shortcomings, but we have sort of small-scale demonstration of a methodology that we think is really what we want.
So we need to scale that one up and get it to the same level of performance as the other techniques that are not as satisfying, if you want, on things like video, but also on other types of datasets that we would get from industry partners. Okay.
So we'll have demonstrations that we can train world models, perhaps action-conditioned world models that allow us to plan for a number of different use cases. Some of them will be robotics, some of them will be industrial process control of various types. Maybe some of them in healthcare as well, because we have partners in that domain.
And that should be within a year to 18 months. And then we'll push this methodology and those models into those use cases with partners, some of which are investors already in our company, and gain experience on how to essentially build a somewhat universal world model, if you want.
I mean, and you've obviously had this experience before of making this really contrarian bet on neural nets and being certainly proven abundantly right in the history books. I guess as you think about this bet, which I think if you talk to the majority of people at the cutting edge of various parts of AI maybe would say is contrarian today, in what timeframe do you think it will become apparent like, you know, "This was right"?
I think it'll happen faster than expected, perhaps, because you can see that "world model" is already becoming a buzzword, right? At least at a research level, and it's starting to permeate into the industry.
Yeah.
And a lot of people are realizing like, VAEs suck and LLMs don't work for real world data. Industry has realized this already, certainly on the user side. And I think because of the importance of the robotics industry, a lot of people are trying to figure out, how do we get there? How do you make those robots useful?
So I think the realization that you need a change of paradigm is happening as we speak, and will become completely obvious to people by early 2027, I think. Now, that doesn't mean we'll have a solution by then. We hope we will, but you know, we'll see.
I guess switching gears to the LLM side, you mentioned some of this work you're doing with Tapestry, which I think would be really interesting for our listeners, and so maybe just speak to that a little bit.
Okay. So, this is kind of a little bit orthogonal to AMI Labs.
As if that wasn't enough to keep you busy.
Well, it's an idea I've been forming over the last three years or so, is the fact that people increasingly use AI assistants for various things, right? I mean, you see a decrease in the use of traditional search engines, and you just ask a question to your favorite AI assistant.
And if the plan that Meta and others are developing, of having smart devices like smart glasses and stuff like that, is realized, basically you would just be talking to your AI assistant by voice through your smart glasses or maybe some other smart device. And so all of your information diet will be mediated by AI assistants.
And if you are someone somewhere in the world, let's say outside the US or China, and you have an AI assistant, and that AI assistant was built in California or Beijing or Shanghai or Shenzhen, it's not good for you. Like, you may speak a language that those systems really haven't been trained to handle particularly well.
You may have a culture that is not particularly well understood by people in Silicon Valley and China, not well represented by the training data that is publicly available on the internet. You may have a value system that is absolutely not represented by people building those models. And certainly, you'll almost certainly have political opinions that are absolutely not represented by the handful of AI assistants you might be able to get from the West Coast tech companies or from Chinese companies.
So what is the solution to this? Like, how do you serve a farmer in India, or even a philosopher in France or Germany? And what you need is a platform which basically is an open, free foundation model, LLM-style, that is fine-tunable by anyone to cater to the interests of people speaking a particular language, having a particular culture, having particular value systems, political biases, creeds, whatever it is.
And so what you need is a wide diversity of AI assistants. There's a lot of countries around the world that are in neither the US nor China who absolutely want some level of sovereignty for AI, not just for their industry but also for the citizens. They don't want the citizens to get brainwashed by a Chinese model or a California model, actually. And so they want sovereignty. How do you get that?
So the way you get a platform, like an open platform like this, to get to the frontier, is you just train it on more and higher quality data than the proprietary systems. If you talk to people in India, in France, in Vietnam, in Morocco, in Switzerland, in Korea, Japan, Kazakhstan, everyone wants basically sovereignty.
And you tell them, "Like, you guys have been training your model locally, you don't have to share your data." So that's the crucial aspect of Tapestry. You would have international contributors to Tapestry, contributing to training a global model that would basically constitute a repository of all the world's knowledge and culture, if you want. But the contributors would contribute data and computing resources, but they would preserve the control on their data. They would not have to share that data with the other contributors. What they would contribute is parameter vectors.
Right? So it would be a kind of federated learning style thing where...
Uh, you have a bunch of data centers, uh, you know, they get the parameter vector from the...
The global consensus of a model.
Think of it as an average of all the parameter vectors of all the contributors, right? So all the contributors periodically tell everyone else, through maybe a central server, "Here is my parameter vector, what is yours?" Okay. And so you exchange parameter vectors like this, and a local worker basically, whenever it updates its parameter vector, it tries to also make it as close as possible to the global consensus vector.
So as the training of this thing kind of progresses, all those parameter vectors converge towards a consensus model, essentially, which is kind of a repository of all human knowledge. Now you have an open model that is as good as if it had been trained on all the data in the world. And now you can fine-tune it for your own purpose, your own political, cultural, and linguistic biases, whatever you want, or centers of interest.
And I think there is a natural force for this to happen, because most countries that are not the US nor China want sovereignty, but also because AI is fast becoming a platform, and there is a natural tendency for platforms to become open. That's what happened with Linux, right? And that's what happened with the software infrastructure of the internet or the wireless network. It's all open-source. It was proprietary initially, but that was all wiped out.
It's a really clever way to get around what would seem to be this trend of decreasing open-source. And obviously, I think there's been many fears that as the closed-source models get better, they'll be held back and they'll be used to train the next generation, and there'll kind of be this almost escape scenario for closed-source models where they get so much better than their open-source counterparts.
So remember who the big players of the internet infrastructure were in 1996: Sun Microsystems, HP, Dell, and a few others. So Sun Microsystems was selling you Solaris with their proprietary hardware. HP with HP-UX. They were claiming, "You know, Unix is so much more reliable than Windows, you're not going to run a web server on Windows." Dell was doing this, you know, with Windows NT, but like who is running Windows NT now as a web server?
All of this was totally wiped out by Linux. Like the entire internet runs on Linux. Even Azure, right? Even Microsoft, it runs Linux. So basically, OpenAI, Anthropic, etc., of today are the Sun Microsystems and HP-UX of yesterday.
Yeah, I mean, I guess implicit in that is obviously, you know, your view of the limitations of these models, that they can only get so good, and so it'll be possible over time for the open-source folks to catch up.
They've already run out of data, right? I mean, the openly available, publicly available text data is already all used. I mean, there's not more of it, right? So what those companies are doing is licensing commercial copyrighted data or training on synthetic data.
And I guess I'm curious because obviously there's been some impressive results in the last few years that they have been able to drive post these large-scale free trainings—IMO Gold, you know, the MMLU task, reasoning benchmarks keep going up.
Okay, that's very interesting. Now think about those two domains, right? Mathematics and code. Those are two domains where the language itself is the substrate of reasoning. It's not the only substrate of reasoning, but when you do mathematics, right, the formal way on a piece of paper, not the intuitive stuff, you manipulate language, right? And LLMs are really good at this.
So proving theorems and stuff like that, that's what LLMs are really good at. They're not so good at the sort of coming up with good concepts and definitions and things like that. It's more like, "Here is a problem, solve it." They're problem solvers. Mathematics is not just problem-solving, right? Most of it is actually a creative act that those things don't do.
And same for code. So LLMs are good programmers. They're not software architects. They're not computer scientists, right? But they can program for us. So they're not in a state where they can just replace humans entirely. It changes the world of humans. So humans now go one level up in the abstraction hierarchy, and our role is to decide what to build. But building it, you can get help from LLMs.
The important point is that LLMs are particularly successful at domains where the language itself is the substrate of reasoning, not for anything else.
Yeah. What would an LLM need to do to convince you otherwise?
So like a zero-shot agentic system, right? You have an agentic system, give it a new problem. It's not been trained to solve that particular problem, doesn't have a script for it. Is it going to be able to accomplish this task that it's never been trained to solve?
And unless this system has the ability of predicting the consequences of its actions and then using that for planning, it's not going to be able to do it, and you're not going to do this with an LLM. You're going to do this perhaps with a significantly augmented LLM that is capable of search and planning. And currently, LLMs that do math and code actually do this.
Yeah.
Right. Because they search for sequences of tokens that actually accomplish a particular task, and they can run the code or verify that the proof is correct or whatever. So you have a way of checking whether something that's produced is correct. But that's not a very efficient way of doing planning, and it only works in domains where this type of search can be performed in token space. What I'm talking about with JEPA is you don't do this in token space. You do this in abstract thought space.
And I'm sure some people listening might think, "Well, you know, hey, even if it's inefficient and it works, and it works at things that are done in token space, that's still a large part of the economy."
I mean, if it works, it's fine. I mean, there's nothing wrong with using it for what they're good at. It's just not a path towards human intelligence; you're missing a huge domain.
You seem like, "Hey, it's going to tap out before it can become a software architect," whereas I'm sure...
It's not going to tap out. It's just going to have a limited ability to be deployed. It's going to become increasingly difficult to deploy it for an increasingly large number of use cases, because you're going to have to collect tons of training data for each of those use cases. And you're not going to be able to make those systems completely reliable, without hallucinations or dangerous stuff, unless those systems have the ability to predict the consequences of their actions, which means they're going to have to have explicit world models.
Yeah. So, I guess to bet against the 100% accuracy and then also the generalization across different tasks.
Right.
I guess one thing that's so interesting about the way the field has developed is obviously you shared the Turing Award with two others, and I feel like they seem much more convinced of the power or potential threats or safety risks of LLMs over time. I'm wondering, when did your views start diverging?
Uh, in 2023.
And what drove that in your mind?
I didn't change my mind. They changed their mind. Okay, and at just about the same time, and it was basically GPT-4. I mean, Geoff basically was not connected to any of that; he was never really interested in LLMs, and discovered GPT-4 in 2023 when it came out, and basically had an epiphany and said, "Oh my god, those systems are really close to human-level intelligence and possibly they have subjective experience."
And he did a quick calculation saying like, "Okay, the human cortex has about 16 billion neurons. If you want to do something like backprop, okay, the brain doesn't do backprop directly, but if it does something like backprop—some sort of gradient estimation for some sort of objective function—you probably need a network of a few neurons to reproduce the functionality of a virtual neuron in a neural net."
And so he said, "Let's assume maybe you need a circuit of 10 actual neurons to reproduce what a backprop neuron does. Then all of a sudden your cortex is only 1.6 billion neurons. Oh my god, GPT-4 is really close to this. So maybe it's going to get as smart as humans." I do not believe in this claim at all.
This is kind of Geoff's way of saying, "Okay, basically I can retire, I can declare victory. I searched for the learning algorithm of the cortex all my career. Maybe I didn't discover what it really was, but backprop seems to be a good substitute for it, works really well, and so maybe that's what we need. So I can retire and go around the world and give talks about the potential promises and dangers of AI."
That's basically what I think his intellectual trajectory has been. He's much less vocal about the potential dangers now than he was a year or two ago. He kind of realized it's probably a way to design truly intelligent systems. So first of all, he probably realized that current LLMs are not that smart, and second, that there's probably a need for a few conceptual breakthroughs before we get to human-like intelligence.
And third, that the blueprint of those systems would be quite different from LLMs, and we probably have a way of making them controllable and things like that.
Yeah.
I've been saying this for years, but okay, he sort of discovered this recently. There's a similar thing with Yoshua. I think what they are both worried about is the ability of society and the political system to make sure that the benefits of AI would be maximized and AI would not just make a few rich people even richer, accentuate inequalities, and cause major catastrophes because of bad usage. Okay, this is not the doomer scenario of AI taking over the world. It's more bad users.
Which seems possible with the LLMs of today.
Which is a danger, but I don't think it's as apocalyptic as what some people have claimed it is. Certainly not as apocalyptic as what even Anthropic has claimed and has tried to lobby governments into, scaring governments into regulating AI because of that. I don't subscribe to this at all.
They seem to genuinely believe it.
I think they genuinely believe it, but also I think there are some good commercial reasons for them to believe that, and to kind of brainwash some people and governments into thinking their systems are dangerous.
And it sounds like with these other architectures, because obviously it doesn't seem like you think we're particularly far away from some very compelling capabilities. How do you think about the safety around if these breakthroughs end up coming from newer architectures, and whether that should make us rest easier or not?
I'm going to say something that again might be controversial, and certainly some of my colleagues at Meta didn't like me saying this, but I think LLMs are intrinsically unsafe. I don't think they can be made reliable and safe. Okay, they cannot be made reliable because you can't stop them from hallucinating. And if they are agentic, you cannot guarantee they're not going to take an action that they didn't predict the outcome of.
I mean, does it surprise you they can do these like 15-hour coding tests given the concerns around reliability?
Well, coding is something where you can actually verify that the code you generate satisfies your specification. But not everything is coding, and there are examples of coding agents wiping out your hard drive, right? Or doing stupid things that makes you lose a lot of money or data or whatever.
So I think LLMs in their current forms are intrinsically unsafe because they cannot predict the consequences of their actions, and because the task that they accomplish is determined subject to their training. You give them a prompt, and then they will accomplish a task that corresponds to that prompt only to the extent that their training has conditioned them to actually do the right task corresponding to this prompt. But there's no hardwired constraint that will force them to accomplish this task and then predict that the task would be accomplished properly.
Yeah. I mean, I think famously in the early days, right, you'd ask them a question and they'd keep asking the question.
Right, for example. Or I mean, also they don't have common sense, right? So I mean, there's the joke that was circulating like a month ago of, "I need to wash my car and the car wash is 100 yards from my house. Should I walk?" I tried it again like maybe two weeks ago. They all say yes, you should walk, except Germany.
Germany says...
So they're training on your video of having given that speech before. It was not my video because I didn't come up with this.
I remember whoever came up with it.
Yeah. Right. Whoever came up with it. But there are a few instances where I said like, "An LLM can't do this," and then six months later it was capable of doing it. And it's simply because as soon as people watch the podcast of me saying an LLM can't do this, they of course type it into ChatGPT. So now it becomes part of the training set, right?
And now of course, the next version has that in the fine-tuning set. And of course, it can answer the question, but it's not because it became smart all of a sudden. It's just because it was explicitly trained with that question. So LLMs are intrinsically unsafe. I don't think there is any way to fix that in the current paradigm.
And what I've been proposing, the architecture I've been talking about, is objective-driven AI. So basically you give an objective to an AI system, which is, "Accomplish this task." Now, how does the system know it will accomplish this task? It has a world model, and it predicts the outcome of a sequence of actions it imagines taking.
And if this outcome satisfies a cost function that describes to what extent the task has been accomplished or not accomplished, then that system—if the way that system works is by optimization, finding a sequence of actions that accomplishes this task and minimizes this cost according to its model—it can do nothing else.
Yeah.
Okay. And of course, there's many things that can go wrong there. In particular, the cost function might be inaccurate. It could be that the cost function you think is actually measuring to what extent the task has been accomplished is not accurate. The world model might be inaccurate. So the prediction that the system makes is actually not the right one. Its prediction of what was going to happen as a consequence of its action wasn't right.
Okay. So the system can still make mistakes, but it can predict the consequences of its actions to some extent, which is, I think, indispensable for any agentic system. Now, what you can add to that system is not just a cost function that guarantees a task has been accomplished, but you can also add a bunch of other objective functions, other cost functions or even constraints that are safety constraints, that say, "Okay, you know, don't hurt anybody on the way," right?
And you cannot specify this at an abstract level, but you can have low-level objective functions that, put together, will guarantee that the system will not be dangerous. And the system cannot violate those things by construction. It will have to satisfy those conditions. Not the case for an LLM. The LLM can always escape. There's a gap between your training error and test error. There's always going to be a prompt where the system is going to do really stupid things.
To talk through one specific space around LLMs, like you know, I think you're obviously really excited about AMI and healthcare. And people have been using LLMs in healthcare for all sorts of things. And so I'm curious how you think about the set of things where LLMs are just not going to work in healthcare and you need a model that understands the world better.
So I mean designing a course of treatment for a chronic disease, for example, or even a non-chronic disease, for a particular patient which may not completely fit into templates that you've observed before. But if you have a good mental model of the dynamics of the physiology of the patient, you might design a course of treatment that will actually bring the patient to a good state.
Yeah.
When I'm saying "a patient", it can be a cell. Okay. How do you tell a stem cell to turn into a pancreas beta cell that produces insulin? Okay, you have a patient with type 1 diabetes and you know their immune system basically eats up their own beta cells, right? It's autoimmune. How do you keep making beta cells? Can you send a message? Do you have a model of a human cell that will allow you to figure out what sequence of messages you need to send to a stem cell so that it turns into a beta cell?
The less LLM-pill camp and the LLM-pill camp talk past each other. It's like, I think it's actually very possible that what LLMs can do—which is maybe scaling the treatment you get at the top doctor or the top place, scaling that around the world—like unbelievable potential impact of that if you're able to do that. And then what you're talking about, which is certainly still on the come for a lot of these things, is, "Okay, and well even better than the top doctor, like how do you go do that?"
But it's more than just a top doctor, right? Because what the LLM can do well is, it can sort of regurgitate knowledge that you can read in books mostly. But if medicine was only about accumulating declarative knowledge that exists in books, you could be a doctor by just reading books. And you can't be a doctor by reading books. You have to do residency and actually listen to the heart and press on the belly and things like that to diagnose appendicitis or whatever it is.
Yeah. Yeah. Right. It's interesting. I would be very curious to see whether LLMs themselves can provide top-quality healthcare globally. We'll have to check back in on that one. It seems like it's pretty close.
You know, I definitely also want to hit on your time at Meta, because you spent over a decade building one of the most respected research labs in the world. Obviously, you recently left. As you reflect back on the time there, what do you think you got most right and most wrong in your time running FAIR?
So the thing we got right is building a top research lab that really innovated, produced a lot of the basic methods and science and tools like PyTorch that are useful to the entire industry, right? I mean, the entire industry is built on PyTorch basically, except for a few people at Google. And I think a culture of openness and scientific process, which I think is necessary for breakthrough innovation.
Yeah.
Because there's a whole chain of innovation, right? You have blue-sky research, new concepts. A lot of that takes place in universities. Some of that takes place in advanced research labs in industry, which can be counted on the fingers of one hand—Google is a good one, FAIR was a good one, hopefully will still be, I'm not sure, and a few others.
Then you have, "Okay, this is a good idea, let's push it forward and see if it can be made useful." But still at the research level, in the sense of, we're not going to fool ourselves, we're not going to try to just find a solution that just works for this problem. We're going to see if this technique that we imagined or picked up from other people in the community can actually be pushed and be made practical—not as a product, but like we can show that it beats some record on some task or benchmark.
And then the next stage is for the company that hosts the research lab to say, "Okay, now we're going to push the button, devote big engineering effort to that vision, and then push it forward." That is where a lot of projects fail. That's where a lot of companies kind of fail to pick up. Meta was actually pretty good at this. Okay, but far from perfect. It was not a textbook example of how you do it wrong, like Xerox PARC totally missing out on GUI interfaces and the mouse and windowing systems, right?
Meta kind of missed a few steps essentially, and it's partly just organizational. It's partly because you need an organization that is pretty close to research but not completely a product organization, to take the relay of pushing a technology a little further. Not making a product with a three-month deadline, but pushing things. And we had that at one point.
Yeah.
At Facebook and Meta. And then we lost it, and FAIR was basically isolated within the company. Had lots of ideas that nobody picked up on. And then in 2023, the GenAI organization was created by taking about 60 or 70 scientists and engineers from FAIR right initially, and then it built up. But then it was under so much short-term pressure that basically that organization, GenAI, didn't have time to talk to FAIR.
And so instead of being at the forefront and innovating in LLMs, GenAI basically had to focus on short-term things and became very conservative. So there was a gap, basically an impedance mismatch between research...
Yeah. And is that kind of what happened with Llama 4?
Yeah. Well, even starting with Llama 3. So Llama 1 was a small project within FAIR. In early 2023, GenAI was created. The Llama people were basically moved to GenAI. They started working on Llama 2, and then a bunch of them realized, "Like, I could do a startup." So that was the genesis of Mistral.
Yeah.
Okay. Two of the authors of Llama basically created Mistral with another guy from Google, and a few people kind of left and sort of did other things. This was not a happy time at Meta for various reasons. And so there were a bunch of people who kind of left.
And then the GenAI organization which kind of took over Llama 2 to some extent, and Llama 3 and 4, was under so much short-term pressure that they became very conservative. And you know, it's a combination of disparity of the groups, pressure from the leadership, and I mean there's many ways things can go wrong, and you can't blame anyone in particular. But yeah, that's kind of what happened.
I mean, it feels like a lot of these organizations obviously are under short-term pressure right now because there's just an incredible race going on. And so I'm curious, like obviously this FAIR setup you had, and there's a similar one at Google for many years, and certainly many researchers running around OpenAI and Anthropic trying many different things. Do you think that is still possible going forward? Or is one of the only paths to leave and do your own company? Are there still places within the industry that you think have this original ethos of FAIR even amidst the race dynamics that are happening?
I think there are a few places within Google research and DeepMind where people actually do research. But increasingly, the industry has become more closed, right. I mean, Google certainly clammed up, and Meta and FAIR even are kind of going a bit in the same direction. There are more restrictions on publication now.
And so it's less appealing for people who really want to do breakthrough research, and they don't get as much resources. If they do something that is relevant in the medium term, they're told not to talk about it, and so it's not a good atmosphere, I think, for breakthrough research. It's not conducive, you know.
I mean basically, the best way to get breakthrough research of the type that we were getting in the early days of FAIR, and at Bell Labs in the good days, in Xerox PARC, is you hire the best people—and those are people who have a good nose to know what to work on, what projects to attack—you give them the means to succeed, and you get the [ __ ] out of the way, all right? Pardon my French.
Yeah, I mean I'm curious what impact it then ends up having on the broader research communities. Obviously, one of the legacies of FAIR is you trained so many researchers, right? And they're all throughout the ecosystem. And it feels like now the equivalent of those people that came in younger in their careers at FAIR, they're joining these labs with maybe shorter-term priorities and focus. I guess I'm wondering, in this current ecosystem where it feels like a lot of younger people getting into the field are thrust much more into these short-term dynamics, does that change anything about the way the ecosystem evolves?
Well, I mean, the people who tend to want to work with me are generally people who are sufficiently crazy to do it first.
Very fair.
And they kind of subscribe to the whole idea that in academia and during your PhD, you should work on the next generation of AI systems. You shouldn't work on the current generation.
Yeah.
Like if you work on LLMs in academia now, it's incredibly boring. At least to me, it's boring. It's basically studying how and why LLMs work and explaining why they work or what the limitations are. It's like descriptive science. It's really not very creative, like I don't find that particularly interesting. It's useful.
Yeah.
And you know, if you really want to kind of show how to do new things with LLMs, like you're not going to have the GPUs you need for that.
Totally.
So like forget that. Don't work on LLMs if you're doing a PhD. Like there's no point, you cannot contribute.
How do you know it was time to leave Meta? It sounds like you were thinking through some of these things over a period of time. You know, was there a moment that it crystallized, or?
Well, it was a combination of things, right. So first of all, you have to understand, a lot of people have a completely wrong idea about what my role at Facebook was. So I joined in late 2013, really kind of started early 2014. The first four and a half years, I was director of FAIR. So I built the FAIR organization, set up the culture, hired the key people, and sort of managed it.
And after four and a half years, I stepped down from that role for a number of reasons, and I became Chief AI Scientist. Okay. So, the reason is I was basically getting close to turning 60, first of all, 58, and I just don't want to do management. Okay. I mean, I was ready to do it for a while to get the organization started, but I'm just not good at it. It's not my thing. I'm more like a scientific or technical visionary and engineer and scientist.
So other people are much better at management than I am. So I basically stepped down. Joelle Pineau and Antoine Bordes basically, yeah, took over the directorship of FAIR, and I became Chief AI Scientist. So I was reporting to the CTO.
And I had goals of basically restarting a research project that I thought was necessary, because the ambition of FAIR was always to build intelligent systems.
Yeah.
Right. And I thought, you know, I put my own research in parenthesis while I was running FAIR. I just didn't have the time, and I thought it was important to basically kind of design the architecture of human-level, human-like AI systems.
And I had come up with the concept that this was going to be based on self-supervised learning, and on prediction from sensory signals like video. I mean, these are old ideas. And world models. I actually gave a keynote at NeurIPS in 2016 where I said like, this is the way AI research should go, like world models, predict the consequences of your actions, and plan. And I said, you know, RL is not the thing that will take us there because it's too inefficient, supervised learning has shown its limits, and so the future is self-supervised learning and world models.
So how do we do self-supervised learning and world models? I started a few projects on this with a few avenues that didn't pan out, some projects on video prediction and stuff like that, and then came up with this concept that you could train self-supervised learning from video, but you have to train the system to make predictions in representation space. So that's the idea of JEPA, and if you have JEPA, you can turn it into a world model by making it action-conditioned, and then you can use it for planning.
So I had this idea around 2020, and in 2022 I wrote a long vision paper. So I said I'm just going to write a paper with my entire vision. Okay, spill all my secrets, like I don't care. But maybe they will rally a bunch of people to that vision. And boy did it work. Because not only did I rally, you know, a bunch of students who kind of came working with me at NYU or in Paris because they wanted to work on this, but also a whole team at FAIR who said, like, "This sounds great. That's what we want to work on."
And then Joelle Pineau said, "Well, maybe this should be like a major mission of FAIR." We called it Advanced Machine Intelligence. That was the internal name of the project.
Interesting. Okay. And they let you run with it?
And now it's the aim of the company. Um, and you know, Mark Zuckerberg, you know, kind of read that paper and knew what it was about and subscribed to the project. And Andrew Bosworth, the CTO, also, and Mike Schroepfer, the previous CTO, Chris Cox, who was my direct manager, Chief Product Officer, also loved the idea. So, like, you know, there's a lot of support in the leadership about this project that we internally called AMI.
And it started really kind of working for video. But, you know, the company kind of refocused all of its effort on LLMs. Despite support from Mark and Andrew—Boz, we call him Boz—um, you know, all the layers below didn't see the point, I think. And so politically it sort of became a little difficult.
The applications, as I said, of JEPA world models, there are applications in, like, wearable agents and stuff like that, and robotics. But Meta chose to get rid of its entire robotics AI group that was led by Jitendra Malik, who is now at Amazon. And so, you know, clearly it wasn't the right environment anymore. Most of the applications were in industry that Meta had no interest in. FAIR was increasingly getting pressure to kind of basically help mostly with LLMs.
So yeah, you know, it made it clear. And that framing worked really well with investors too, because when they had to raise money for AMI, everybody knew my story. And anybody knew, you know, many investors, staff at various VCs that read my paper and had listened to my talks and had bought my story. They were realizing, you know, LLMs had limitations and were kind of interested by the idea of building the next generation AI systems.
I guess, was the Scale acquisition like part of this catalyst of the pure LLM focus internally?
Yeah, definitely. I mean, there's probably some, you know, other reasons to it. I think, you know, maybe—I don't have any sort of inside information to comment on this—but it's possible that Mark sees in Alex kind of a potential successor to himself, like a younger version of himself.
Yeah, I feel like a lot of the popular narrative, you know, in the media has been like, "Oh, like, you know, when Alex comes in, it then gets harder to run like a research organization." You know, I don't know to the extent you felt that or...
Well, okay. So here's a big misconception about my role, my relation to Alex, and how AI was run at Meta. I had zero technical contribution to Llama, like none whatsoever. My one contribution to Llama was to argue for open sourcing Llama 2, because there was a big internal debate whether we should open source. Like, the legal department was against it, the policy department was kind of against it, the comms department was for it. All the engineering side was for it, like Boz was for it.
So there were enormous internal discussions at a very high level, you know, 40 people from Mark Zuckerberg down, every week for two hours for months. So really it was, you know, kind of a big debate internally. And I really, really, really, you know, pushed, argued for the fact that—and Boz also was very vocal about it—that the safety risks were basically overblown. The opportunities to create an industry were extremely strong, and that we were going to jumpstart the AI industry by open sourcing Llama 2. And in fact, that's exactly what happened.
But I had zero contribution to Llama, positive or negative. Like, I didn't do anything to stop it or slow it down or anything. There was a lot of people working on LLMs within FAIR, and it was fine. I never said anything against it. Okay. Um, other than saying this is not a path to human intelligence, but it's fine. It's useful. You know, same thing for speech recognition, translation, right?
And particularly since 2018, when I stepped down from being Director of FAIR, I didn't have any direct influence on what people were working on other than, you know, basically publishing my vision and then rallying people around my project. But you know, they were working with me because they wanted to, not because I was their boss. I wasn't telling them to work with me.
And so I had no positive or negative influence on LLMs, okay, within Meta. And I had some influence on the strategy, but it was more like the long term and how you maintain a research lab and things like this. And in the last year, and you know, I mean starting maybe early '24 and certainly in '25, the way FAIR was kind of... the direction in which it was moved and managed basically did not correspond to what I thought was necessary to preserve innovation, research, and breakthrough, and preserve the good people. Like, a lot of good people have left already.
Yeah. And I guess a lot of, you know, it probably was harder to get people to work on the stuff you were working on internally, and I'm sure there was pressure for you yourself to work on a lot of the LLM stuff.
Yeah. Yeah. No, but a lot of other people also have left, right?
No, it's fascinating. I mean, one thing I'm struck by throughout our whole conversation is I feel like you've had a remarkably consistent point of view, like you know, in the space for a long time. And you can go back to a bunch of the earlier talks you referenced. You know, obviously it is a fast-moving space and a ton of interesting things have happened in the last year. What's like one thing you've changed your mind on in the last year?
I mean, the whole idea of what we used to call unsupervised learning that we now call self-supervised learning. You know, until about 2003, the whole idea of unsupervised pre-training, where you get a good representation for the input data and then you either fine-tune the model with a little bit of supervised labeled data, and it sort of gave us, you know, some evidence that this whole technique could work. I tried to apply this to video because ultimately what I wanted to do is train a system to understand how the world works by just watching the world go by.
Yeah. Right. I mean, that's the basic idea.
And sort of started to argue for this in the early 2010s. Did some work on simple video prediction. We didn't have GPUs. Okay. And then sort of doing this more seriously after the creation of FAIR by doing pixel-level video prediction, realizing that wasn't working. But then arguing for self-supervised learning. Okay, this whole idea of, like, training a system generically, not to solve a task but to basically just predict, and then using the representation that is learned this way as input to a downstream task that you can train supervised or reinforcement or whatever.
So that was a bit of the topic of my second half of my keynote at NIPS in 2016. It was still called NIPS at the time.
Yeah, of course.
In 2016. And then I kept kind of pushing for this idea and tried to discover some methods to get that to work. And what surprised me is that that became incredibly successful, but not for video, for language. LLMs basically are a blindingly successful example of self-supervised learning.
No, that they are. Um, well, I feel that's almost like the perfect note to end on, but I want to make sure to leave the last word to you. Um, I feel like all our listeners are very familiar with you, but I want to at least give you the mic to point them to anything that you think they should check out with some of the new stuff you're doing, or any of your work you want to point to. The mic is yours.
Okay. Let me tell you one thing. An LLM works because when you have a sequence of discrete symbols, making predictions is easy because there's only a finite number of possible symbols in your language, 100,000 possible tokens or something like that, right? And you can have your neural net produce a probability distribution over all possible tokens. And then you can sample from that distribution, shift the token into the input, and then produce the next token. And you can do autoregressive prediction.
Okay, so that's a special case. If you have the real world, you can't use a generative model. So now you have to train a system that learns a representation and makes predictions in the representation space. There's a big issue with this which I didn't think until about five years ago that was easily solvable, even though I invented one technique to solve it, you know, decades before that.
And it's the problem that if you take two inputs, let's say the initial segment of a video and the continuation of that video, or you take one image and a corrupted version of it, you run them both through an encoder and you train a predictor to predict the representation of one from the representation of the other. There's a very simple solution where the system basically predicts a constant representation. And the prediction problem becomes trivial. That's called representation collapse.
So the big question of self-supervised learning for JEPA, for the Joint Embedding Predictive Architecture, is how do you prevent collapse? Yeah. The solution that I came up with many years ago, 1993, is contrastive learning. So basically you have examples of things that should be predictable from one another and an example of things that should not be predictable from one another. It turns out this method works, but it doesn't scale with dimension, doesn't scale very well.
There's another technique that was actually invented by Geoff Hinton and Sue Becker in the late 80s, I'm sorry. Where you have those two networks and you try to maximize the mutual information between them. Jürgen is mad at me because he also came up with a version of this in 1992 and he says that's JEPA. It's not JEPA. It's just another way of preventing collapse of a joint embedding architecture. Okay. Um, which is fine, but it's not, you know, it's a particular way of doing it, which I don't think is particularly good.
So, okay. So now you have this JEPA architecture. You have to come up with a good way of preventing collapse and there is a couple ways. So as I already said, contrastive methods I think is not a good approach. There's another set of methods that are kind of called distillation methods, and they do prevent collapse. We don't know why. So a good example of that is DINO. That's a joint embedding method using the distillation method.
Basically, one of the encoders trains the other one, is like used as a teacher for the other encoder. And the encoder that is being trained, you do backprop to it. The one that is not being trained, you don't do backprop, but you share the weight with the other one with some exponential moving average. It's a collection of recipes. There's a paper from DeepMind about it called Bootstrap Your Own Latent, which uses this trick. That trick is derived from some intuition from reinforcement learning and somehow it prevents collapse, but we don't know why.
Okay, there's a few theoretical papers on it that explain why it possibly might work in some simple cases, but it's not satisfactory. The cost function you think you're minimizing, you're not actually minimizing, and so you can't monitor it. It actually goes up when you train. I mean, it makes no sense, so we don't like this method. But it works, and some of the models we've trained—large scale video representation learning system, V-JEPA, V-JEPA 2, V-JEPA 2.1—they train using this method. I-JEPA also.
But we're moving away from this and now we have a few papers that came out recently on an explicit regularizer to prevent this collapse, which basically tries to maximize the information content coming out of the encoder. So it's in the same family as the Becker and Hinton from '89, and the Schmidhuber 1992, and a bunch of others since then. And to some extent also contrastive techniques also, although it's not sample contrastive.
And then the question is how do you measure information content? How do you maximize the information content coming out of a neural net? And the problem is if you want to maximize the quantity, you either need to be able to measure it or you need to have a lower bound on it. Information content, we only have upper bounds. We cannot measure it. We can only come up with upper bounds.
And so we take an upper bound and we cross our fingers. Okay. And it kind of works. So the latest one is called SIGReg. That means Sketch Isotropic Gaussian Regularization. We had a previous one called VICReg, Variance Invariance Covariance Regularization.
And the SIGReg stuff is really cool. So this is some work by Randall Balestriero, who was a postdoc with me, is an assistant professor at Brown now. And it basically consists in forcing the distribution of variables coming out of the encoder to be joint Gaussian, essentially sort of maximize information if you want. It's just a very different way of doing it than what Jürgen Schmidhuber and Sue Becker and Geoff Hinton were doing.
And so this is super promising in my opinion, and we have, you know, variations of it. One that can produce sparse representations, another one that can produce isotropic representations but not necessarily Gaussians. And we have a paper with Randall, a student at Mila, where we train a world model with this. It's still small scale, but I think is super promising. So if you want to read one paper, read that paper. It's LeWorld Model.
Awesome. We'll definitely link to it too.
Yeah, I'm not responsible for the name. Randall picked it.
Amazing. Well, Yann, seriously, thank you so much. It is such a privilege to get to spend the last bit of time with you and really appreciate you coming on the podcast.
Thanks for having me. That was fun.
I'm Jacob Effron and this has been Unsupervised Learning, a podcast where I get to talk to the smartest people in AI and ask them tons of questions about what's happening with models and what it means for businesses in the world. As I hope is clear, I have a ton of fun doing this. It's a nights and weekends project in addition to my day job as an investor at Redpoint. But our ability to get these incredible guests on really comes from folks like you subscribing to the podcast, sharing it with friends. It's really what ultimately makes this whole thing work. And so, please consider doing that. And thank you so much for your support and listening. We'll see you next episode.