TextPurr Logo

TextPurr

Loading...
Loading...

Stanford CS25: Transformers United V6 I From Representation Learning to World Modeling

Stanford Online
For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education April 9, 2026 This seminar covers: • How world models are increasingly moving away from reconstruction and toward prediction in latent space • Two recent JEPA-based approaches that illustrate this shift from complementary angles Follow along with the seminar schedule. Visit: https://web.stanford.edu/class/cs25/ Guest Speakers: Hazel Nam & Lucas Maes (Brown University) Instructors: • Steven Feng, Stanford Computer Science PhD student and NSERC PGS-D scholar • Karan P. Singh, Electrical Engineering PhD student and NSF Graduate Research Fellow in the Stanford Translational AI Lab • Michael C. Frank, Benjamin Scott Crocker Professor of Human Biology Director, Symbolic Systems Program • Christopher Manning, Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science, Co-Founder and Senior Fellow of the Stanford Institute for Human-Centered Artificial Intelligence (HAI)
Hosts: Hazel, Lucas Amaze, Host
📅April 22, 2026
⏱️01:11:03
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "Stanford CS25: Transformers United V6 I From Representation Learning to World Modeling" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=GBd7iuJkW08

00:00:05Host

All right, great. So, hi everybody. Welcome to the second lecture of CS25 this quarter. And so, today we're very lucky to have two speakers here with us. So Hazel, or Hejang Nam, is here in person today. She's a master's student at Brown University working on representation learning, causality, and self-supervised learning. And we also have Lucas Amaze over Zoom who will be speaking afterwards, and he's a PhD student at MILA and the University of Montreal working on JEPA and planning. So I'm sure they're going to give us a very insightful talk today. And so without further ado, I'll hand it off to Hazel.

💬 0 comments
Add to My Notes
00:00:48Hazel

Yeah. Thank you for introducing. Um, Hazel, first-year master's student at Brown University working with Professor Randall Balestriero. Our lab is working on JEPA, self-supervised learning, and some theory as well. Today I'm a bit nervous, this is my first time giving a talk in English, but at the same time I'm really, really excited to talk about JEPA, world models, and my recent work, causal world models. I'm pretty sure Lucas is also very excited to share his very recent world model work.

💬 0 comments
Add to My Notes
00:01:19Hazel

Okay, so the first part of this talk will be about the concepts of JEPA and world models, and after a brief introduction of those two concepts, we will talk about the Causal JEPA paper first, which asks the question about how to make a model understand object interaction. And then I will hand over to Lucas and he will talk about his L-World Model, and his model is about the end-to-end JEPA training without collapse.

💬 0 comments
Add to My Notes
00:01:49Hazel

So first of all, I think most of you guys may have heard of world models, and some of you guys may be really familiar with what a world model is. To talk about the world model, we should have this first: so this is an autoregressive model, right? You get the previous state and you predict the next step. However, this is sometimes not enough to describe the world. Why? Because our world has uncertainty. At the time that you're observing, someone might do something at that time, right? Someone can throw something. There's an inherent uncertainty. To handle this, you need an action. And this now is a world model. So the world model basically is a function that gets the previous state and the action to predict the next state. And in this sense, I perceive the world model terminology as a simulator.

💬 0 comments
Add to My Notes
00:02:42Hazel

So as an AI researcher, what would you do to make this world model better? In my opinion, I think there are three points of designing. The first one is having a good state representation. This is about—so you probably are not going to give the raw pixel images or raw pixel values to the predictor. You somehow have to have a good representation that reflects the world faithfully.

💬 0 comments
Add to My Notes
00:03:10Hazel

And the second component will be a good transition model. For example, there should be a grounding rule underlying the environment. For example, there's a gravity that exists. We have to understand these rules to predict the world and, you know, to simulate the world. And the last component is a good dynamics model. Now you're going to do some action, and your model should react to your action in the appropriate way. Keep these three things in mind because I'll revisit this at the end of the Causal JEPA part to discuss how I address these three questions.

💬 0 comments
Add to My Notes
00:03:50Hazel

Today we will not cover generative world models. There are so many industries now that are doing the world modeling stuff. For example, GAIA is an autonomous driving model that gets the action and the previous state to predict the next driving scenario. And Genie, they want to scale up the world model, so they learned something called latent action, which is a proxy of the action because usually online data doesn't have per-frame action annotation. And Sora is technically a video generation model, but the fidelity of their generated scenes is very good, so they got a lot of notice. And Marble from Professor Fei-Fei Li at Stanford, they are doing 3D interactive environments so that we can interact or explore.

💬 2 comments
Add to My Notes
00:04:42Hazel

But today we will talk about Joint Embedding Predictive Architecture (JEPA), which is somewhat different from the generative world model. So on the left side, the thing you see is a generative world model because, as you can see, $X$ is the current state and let's say $Y$ is the thing that we have to predict, which is like a future state. You put the encoded current state to the predictor and you directly compare it to $Y$, which means that your model should output in the same pixel space as your target is.

💬 0 comments
Add to My Notes
00:05:16Hazel

However, on the right side, that is a Joint Predictive Architecture. You see both $X$ and $Y$ are going to the encoder. So now we are comparing our prediction and the true target in the latent space. So why does it matter? We don't have a decoder in the JEPA architecture, but it is not merely about having no decoder. It is about uncertainty in the world.

💬 0 comments
Add to My Notes
00:05:47Hazel

So when a human is thinking about what is going to happen next, are you predicting something at a very pixel level? You know, there's some kind of uncertainty that you cannot predict and there's obviously your area of interest. So, JEPA tries to deal with only having predictive information in your latent space so that your prediction is getting more meaningful and human-like. And also, the framework can be interpreted in a different way. First of all, generative world modeling is about likelihood. They learn the normalized likelihood over the future frames. But JEPA can be understood as an energy-based model.

💬 0 comments
Add to My Notes
00:06:34Hazel

So in the very original paper that Yann LeCun suggested, he framed JEPA as an energy-based model. And an energy-based model learns the energy score function that gives a high value if two values $X$ and $Y$ are incompatible, and it gives a low score if those are compatible. And compatibility in the world model means that $Y$ is a plausible future of $X$.

💬 0 comments
Add to My Notes
00:07:05Hazel

So the energy-based model can collapse unfortunately, and there are two ways to prevent this collapse. The first way is contrastive learning, and the second way is regularization-based methods. I'm not going to talk about this too deeply, but the regularization-based methods are making a very rich and well-defined energy landscape. So that is what JEPA is doing.

💬 0 comments
Add to My Notes
00:07:32Hazel

For example, V-JEPA, which is Video JEPA, took as input consecutive frames with spatiotemporal masking, put them through this encoder, got the representation, and tried to predict what is happening in the unseen area. So the target has every piece of information, and here there are some regularization tools to prevent collapse. For example, they use an EMA encoder. EMA here means exponential moving average. This prevents really trivial collapses, and also you do a stop-gradient for the target encoder and you put the mask to give the model a more challenging task.

💬 0 comments
Add to My Notes
00:08:16Hazel

V-JEPA 2 is basically the same architecture as V-JEPA 1, but they wanted to scale up, and they also did some interesting post-training. One of them is action-conditioned control. So now this became more like a world model because now, as you can see in the predictor, it gets the robot action and poses. So this is one of the post-training methods in V-JEPA 2.

💬 0 comments
Add to My Notes
00:08:46Hazel

And this is actually called Dino world model, and it has a very same architecture as the action-conditioned post-training that you just saw. So what the Dino world model is doing is they use a frozen Dino v2 encoder. What they claim is, "Oh, do we actually have to train the JEPA encoder to get the meaningful abstraction for planning?" They said, "No, a pre-trained Dino encoder can do that role as well." So they generate past representations, and with the auxiliary variables—for example, action and proprioceptive signals—they predict the future state representation and they compare.

💬 1 comment
Add to My Notes
00:09:32Hazel

And this predictor is just a simple causal transformer which predicts the future autoregressively. But here I want you to think about: is this past representation what humans really do? Like, are we patchifying the image to predict the next step? Probably not.

💬 0 comments
Add to My Notes
00:09:57Hazel

So today we're going to talk about Causal JEPA: learning world models through object-centric latent intervention. Before starting this talk, I would like to thank my collaborators Quentin, Lucas, Yann LeCun, and Randall Balestriero. So the goal of Causal JEPA is understanding object interaction and object dynamics. These are three datasets that I picked to show today.

💬 0 comments
Add to My Notes
00:10:28Hazel

The first one is Push-T, a pretty famous example in control experiments. The goal is you do an action on the blue ball and you want to move the gray T-block to perfectly overlap on the green T. And the second dataset is CLEVR. This is actually a video pair with a question-answering dataset. The question can be predictive—for example, "what would happen in the next frame?"—and it can be counterfactual—for example, "what would happen if the blue-colored cylinder doesn't exist?"—and it can be explanatory. So it's basically a VQA benchmark. And the third one is PHYRE. As you can see, many objects are interacting with each other with a grounding rule; for example, there's gravity, the mass matters here, and etc.

💬 0 comments
Add to My Notes
00:11:27Hazel

So to understand these object dynamics, what the current models are doing is like this: they patchify the image and they try to predict what will happen in each patch. But what you are going to do to understand this mechanism is, you want to understand things like this. You have each object and you want to know how one object influences each other.

💬 0 comments
Add to My Notes
00:11:59Hazel

So what if, in the Dino world model, we just change the representation to an object-centric representation? Then it's closer to human-like thinking, right? But then we have to learn the object-centric representation. I'm not going to go very deeply into object-centric learning, but let me go through it briefly. If you're doing computer vision, you might have heard about object-centric learning. There's a very foundational way called slot attention from Francesco Locatello. This is basically an encoder-decoder framework. And between this encoder and decoder, you might have the feature space, right? And you bring buckets—like a basket—to put each feature in, and this basket is for each object. So there's a mechanism called slot attention that puts each feature into each slot. So there's a binding problem of features to slots, and they decode with this basket so that the model can have well-aligned information with the slots.

💬 0 comments
Add to My Notes
00:13:14Hazel

You don't have to read all of this, but I brought this because this is a Transformers United class, and slot attention is about self-attention. So what I want to show is how to bind the features to the slot. You do self-attention where the key is the input features. For example, in Dino v2, there are patch embeddings, right? You allocate each patch embedding to slots; you're basically assigning them, and the value is also the input features. And you update the slot by using a GRU. So iteratively you update the slots, and now the slot will have the object-aligned representation. So this is how usually the models are getting object-centric representations. This is a very naive approach. There are so many advanced ways—for example, this is for images, and there are video slot attention models as well. So yeah, you can give it a try.

💬 0 comments
Add to My Notes
00:14:15Hazel

And why do models behave as if they perfectly understand the dynamics? Let's say the model learned a monkey eating a banana. If the model is truly understanding this eating mechanism, the model might be able to infer what is happening to the banana when we put an invisible cloak on the banana. So even though you're not seeing the banana, if you see a monkey moving its mouth constantly, you can imagine, "Oh, the banana might get shorter," you know? And vice versa: if you make the monkey invisible and the banana is disappearing, you can infer that a monkey is eating something, right? This is the very core motivation and very core explanation of Causal JEPA. This is really relevant to what I'm doing right now.

💬 0 comments
Add to My Notes
00:15:15Hazel

And if you think of the previous example, the predictor doesn't have to be a causal transformer. As long as we are seeing only history, we're okay to use multiple time-step histories. So we're using a bidirectional transformer here. I would like to mention that.

💬 0 comments
Add to My Notes
00:15:34Hazel

And let's see how the model actually implements this monkey and banana mechanism in the transformer. Let's say we have a nicely aligned representation for each object. And we encode these object states into the representation. For example, let's say our loopback history is four frames. So you see four frames, and next frame prediction will happen. But because this is not a causal autoregressive transformer, this is a bidirectional ViT-style transformer, we need placeholders for these future tokens as well. So we use this as a mask token.

💬 0 comments
Add to My Notes
00:16:20Hazel

And now, our goal is to predict this mask token correctly. And for example, each row means the evolution of each object. Now, as I told you before, I want to mask something here. Blue dots are observable slots, and the yellow ones are masked slots. And let's imagine that we have this mask slot here. What should the model do? What is the easiest way for the model to have a reasonably low loss and just predict the mask token? Maybe just average the previous and the next token, just like doing interpolation. But that is not what we want. What we want is to learn the object interaction.

💬 0 comments
Add to My Notes
00:17:11Hazel

So like the previous example of the monkey, I just mask everything. This is a bit of an aggressive way of masking. But now the model doesn't have a shortcut. It needs to infer the other slots to correctly infer the current state or the masked state.

💬 0 comments
Add to My Notes
00:17:30Hazel

But what if we want to mask two objects at the same time? This can happen because in slot attention, you have to fix the maximum number of slots. The number of slots doesn't vary during training. So you kind of give a plenty amount of slots to give the model freedom: "Okay, what information should I define as a slot?" And sometimes in the scene, for example, we define eight slots, but we only have three objects in the scene. Then just masking only one object—only one slot—might not be enough. And then now we want to put this into the transformer and we have to flatten it. So we have to do positional encoding before then.

💬 0 comments
Add to My Notes
00:18:17Hazel

Okay, there's no problem with temporal positional encoding. But when we are trying to do positional encoding in the slot axis, there's a problem, because what object-centric models are doing is they do not define the order of the objects, but rather they are permutationally equivariant with respect to the object orders. So this is basically not a list, but a set of objects. And then, if you're masking multiple objects, the model might not know what to predict because these slots do not have object identities.

💬 0 comments
Add to My Notes
00:19:05Hazel

So the video slot attentions can keep temporal consistency inside the video itself. For example, Video A kind of has consistency with the object order, but we cannot guarantee that Video A and Video B have the same order of objects. So now what we do is we do not mask the very first time step—in this case, time step $t-3$—and we use this information as a slot identity. And what we are doing is we define each mask token with the identity token plus a learnable mask token with positional encoding. And this is less aggressive because you have an initial condition of each object. So it makes more sense when you predict the mask tokens.

💬 0 comments
Add to My Notes
00:20:02Hazel

And another part I want to talk about is action conditioning. The world model should condition on the action properly. And what the Dino world model is doing is they concatenate the action embedding behind the patch embedding. For example, Dino v2 small has 384 dimensions for the features, and let's say we have a 10-dimensional action embedding. What they do is they duplicate it to match the number of patches, and they just put the action embedding after the patch representation. So now it's 894 per patch because we added this action embedding after the patch representation. But this is not a really optimal way, I think, because what we want to learn is something like this: why don't we consider action as another node of the graph?

💬 0 comments
Add to My Notes
00:21:03Hazel

The Causal JEPA does not recover any true causal graph, but its motivation is grounded in the causal graph. So, for example, because we defined each object representation, those are kind of playing the role of the nodes, and we also try to use this action as one of the nodes. So in this current situation, the action is added as a separate node, not as a part of the feature representation.

💬 0 comments
Add to My Notes
00:21:39Hazel

So this is the architecture of Causal JEPA. To sum up, you have the history frames, and you put this into the object-centric encoder to get the object representation. And then you select some amount of objects to mask, and you mask it and you put this into the predictor—which is a bidirectional transformer—with the action, and then you're predicting every mask token.

💬 0 comments
Add to My Notes
00:22:11Hazel

And now we will see some results. I did three experiments based on our goal. Our goal is understanding the object dynamics, and the first thing is reasoning on counterfactual questions, the second thing is planning and control, and the third is physical impossibility.

💬 0 comments
Add to My Notes
00:22:32Hazel

So in CLEVR, we did an experiment with the other existing models as well, but what I want to highlight is the model without masking. What I want to say here is the performance is not because we use the object-centric representation, but the core is masking—like in the banana example. So if you see the C-JEPA result, not only the average accuracy is better, you can see there's a clear gain in the counterfactual questions. This counterfactual question is very well resonated with our original motivation because the counterfactual is asking something like, "Oh, what if this object doesn't exist? What if this object exists?" So you have to understand how the objects interact with each other.

💬 0 comments
Add to My Notes
00:23:28Hazel

And the next example is Push-T planning. The agent tries to control the object to reach the goal state. And here I want to emphasize the efficiency. If you look at the Dino world model baseline, you have 196 patches. If you imagine you are having $224 \times 224$ images, and each patch should have a 384-dimensional feature. And if you're using object-centric representation, the number of tokens is significantly less than that. And because now we have clear semantics from each token, the feature doesn't have to be super large. If you think about it, there's nothing much to include in the object: for example, texture, color, shape, rotational state, and location. That's not that much. We don't need a really huge representation space.

💬 0 comments
Add to My Notes
00:24:30Hazel

So after only putting object-centric representation instead of the patch representation in the Dino world model, its performance actually drops a lot. And this can be true because the Dino world model uses the causal transformer, and we use a bidirectional transformer, right? And the object representation doesn't necessarily have to encode the velocity or acceleration kind of stuff because you cannot define those properties by only looking at a single static image. So this drops the performance a lot.

💬 0 comments
Add to My Notes
00:25:08Hazel

And after we change the action conditioning method, treating them as a separated node, and we change the transformer to the bidirectional transformer, the performance gain is significant. It gains 15% in absolute percentages. And after masking, compared to the object-centric Dino world model, you gain 28%, which is pretty large. And compared to the OC-JEPA, the only difference is object masking, and this object masking actually helps the model to understand this dynamics.

💬 0 comments
Add to My Notes
00:25:53Hazel

And this is the effect of the action conditioning. As I told you before, the latent concatenation in the red line denotes the action conditioning method based on the Dino model. But after treating this action as a separated node, it clearly shows the margin.

💬 0 comments
Add to My Notes
00:26:17Hazel

And there's an additional experiment on PHYRE. PHYRE is, among the three datasets we have, the most complicated dynamics. There are a lot of formulation configurations and you need to learn precisely what is happening in the scenario. When I compare the OC-JEPA and C-JEPA, OC-JEPA often generates physically implausible scenes. You see the bar is floating below the fixed bar, which doesn't make sense. And this can be done by just learning correlations: like when two bars are close, they just stay there. But that's not true, like that's not how physics works. So by the training method of object masking, you keep asking the question to the model: "What would happen if this doesn't exist? What should you consider to predict the masked token?" It can learn the true dynamics.

💬 0 comments
Add to My Notes
00:27:20Hazel

And this is the attention probing for the previous example. You can see the failure is actually coming from attending to the wrong irrelevant object. So for example, OC-JEPA relies on the cup which contains the blue ball, and the C-JEPA conditions on the right bar to predict its future state.

💬 0 comments
Add to My Notes
00:27:52Hazel

And here, the terminology "causal" in Causal JEPA can stand for many things, but here we use causal as temporally directed predictive dependencies. This means that because we're predicting the future from the history, the edge is directed. And to predict the mask token, you need to attend to the relevant object. So we called it temporally directed predictive dependencies. This is not the very conventional and traditional way of defining causal, but recent modern causal machine learning uses this kind of definition as well.

💬 0 comments
Add to My Notes
00:28:36Hazel

And here, to go back to the role of object masking, we would say this predictor finds the influence neighborhood. The influence neighborhood, we just define it as its name, but it's basically a predictively sufficient set. So it's like a minimal set that it needs to predict the mask token correctly.

💬 0 comments
Add to My Notes
00:28:58Hazel

And this can be true, more formally, with four assumptions. The first thing is we do not assume there's an instantaneous relationship. And the second assumption is that every training instance should share the same mechanism. For example, gravity should not change along videos. There's a governing rule applied in the whole dataset. And the third assumption is object-aligned representation. This is the trickiest in the practical sense because this assumes that our object-centric representation is constant throughout the video. It should not get swapped, the object should not be split into different slots, and our representation should reflect the scenario faithfully. And the fourth is history sufficiency. This is because we use finite history. For example, we see four previous frames and predict the next frame, and these four history frames should be enough to predict the future.

💬 0 comments
Add to My Notes
00:30:08Hazel

And to make this causal machine learning practical, we do not assume a first-order Markov process. As I told you before, object-centric representations usually do not follow the first-order Markov process. And we allow confounders. The confounder makes things really tricky because we cannot recover the true causal graph usually because of the confounder, but in the object-centric representation in the real world, this is kind of inevitable.

💬 0 comments
Add to My Notes
00:30:43Hazel

So I would like to answer these questions really quickly. What happens if object-centric representation is not faithful? A little bit is fine because, you know, masking object slots is still some kind of inductive bias even though our masking is not perfect. So a little minor wrong model can be okay, but if it's really bad, it doesn't work.

💬 0 comments
Add to My Notes
00:31:20Hazel

Is there any way to recover the true causal graph? No. In our method, there are confounders, so we cannot recover the true causal graph. And also sometimes it is really hard to define what is the true causal graph in many scenarios.

💬 0 comments
Add to My Notes
00:31:34Hazel

How to select the number of objects to mask? This is a good question. Because we have a fixed number of slots and the number of actual objects is varying depending on which frame we are looking at. The ideal amount is just masking only one foreground object without the background slots. But we sometimes cannot control this really well. So I'll say decide the number to mask based on the data statistics. Just guess your perfect number of masks, and you should sweep a bit to find the perfect number.

💬 0 comments
Add to My Notes
00:32:18Hazel

And the largest limitation is coming from the object-centric encoder. The object-centric representation does not work really well in occlusion situations, and you know, in the middle of the video some objects can appear and disappear, but this kind of slot attention cannot handle this scenario really well. So this is some pain point of this model.

💬 0 comments
Add to My Notes
00:32:42Hazel

And finally, we got back to these three components of a world model again. In the beginning of the talk, I said there are three components that we should consider to make a good world model. For a good state representation, we use this object representation. And for a good transition model, we did object masking to let the model learn the predictive sufficiency. And about the dynamics model, we kind of tweaked the method to condition the action. So we treat these action variables as separated nodes.

💬 0 comments
Add to My Notes
00:33:24Hazel

I think this is all of the Causal JEPA part, and let me quickly put it to Lucas.

💬 0 comments
Add to My Notes
00:33:36Lucas Amaze

So hi everyone, my name is Lucas. I'm a third-year PhD student at MILA, advised by Damien Scieur with research at Samsung, but I work closely with Randall. And the work I'm going to talk about today is a work I've done in collaboration with Quentin at NYU, with Yann LeCun, Damien, and Randall.

💬 0 comments
Add to My Notes
00:34:07Lucas Amaze

So today what I'm going to talk about briefly is how to make this whole world model stuff and JEPA stuff pretty simple to train. And I'm not going to tell you again what a world model is. I think Hazel did it pretty well. So I'm going to go directly to what we did. So just before that, I would like to talk about a big problem.

💬 0 comments
Add to My Notes
00:34:33Lucas Amaze

As Hazel said, JEPA aims to learn representations in an abstract space, and so it's directly in opposition to generative modeling, where in generative modeling what you try to do is you try to model your input space. Basically, you try to learn a representation of your input space and try to do stuff in your input space. And so JEPA says that this is, for most of your tasks, not desirable. For instance, if you do a self-driving car application, you most likely don't care to model the movement of the leaves of the tree on the road. You don't care to model that. If you do generative modeling, by definition, you will have to model that because you want to model all the details of your input. So your loss is going to give signal for that. So what JEPA proposes is to first encode all your inputs into an abstract space, typically with an encoder neural net, and then try to model the dynamics of your space in that latent space.

💬 0 comments
Add to My Notes
00:35:38Lucas Amaze

Okay. So it's pretty nice when you say it like that, but if you look on the right, the image I put, if you just do that, you suffer from what people call collapse. And so, what is collapse? It's the failure mode where you can see that if I do nothing, if I don't put constraints on the distribution of my embeddings, your model on the right can simply learn to disregard the input and just produce a constant vector. And so you can minimize the prediction loss in the latent space just by saying, "Okay, I am going to encode everything as a constant vector like zero," and then it's trivially easy to predict what is going to be the next state—it's going to just be zero again. And so the whole research on JEPA is a big part of the research on collapse.

💬 0 comments
Add to My Notes
00:36:33Lucas Amaze

So as Hazel said before—I should add Causal JEPA in that previous slide as well—you have V-JEPA that tries to avoid that collapse with exponential moving averages. You have Dino world model that uses a pre-trained encoder basically to avoid the collapse, because if you use a pre-trained encoder and you freeze it, you can produce non-trivial embeddings and you can learn the dynamics directly in the embedding. So it's essentially supervised learning in a world model. And then you have PLLDM that tries everything end-to-end. So they train the encoder, the predictor, and everything end-to-end; they're an action-conditioned model as well. But they use something called V-Reg.

💬 0 comments
Add to My Notes
00:37:18Lucas Amaze

And what is V-Reg? It's just an anti-collapse regularization term that tries to make the covariance matrix of your latent features the identity. So basically you want each dimension to be decorrelated from the others in your feature space, except for the diagonal line which is one, typically, so you have positive correlation. The problem with PLLDM is, as you can see on the slide, you need to apply one term for the variance and one term for the covariance minimization. You need to do that spatially and temporally, so you have four terms. They add an additional inverse dynamics loss to try to gain more information about the action, whatever, it's five. Plus you have the prediction in the future, the L2 minimization you see on the right: so your predictor tries to predict the future in the latent space and you compare that with what your encoder says the future should be. Okay, and so it gives you six terms. So it's pretty difficult to tune that because you have six hyperparameters to tune, and that's why we propose L-World Model.

💬 0 comments
Add to My Notes
00:38:38Lucas Amaze

So what is L-World Model? It's basically a simple JEPA that doesn't use any tricks. So there is no exponential moving average, no masking, no stop gradient, no pre-trained encoder, and also no unstable loss. Why? Because we have a single hyperparameter, and we make the model such that it's only 15 million parameters. So it means you can train that on a single GPU. It's fully end-to-end from raw pixels. So there is no auxiliary information such as proprioception information or whatever. And we observe that it's 50 times faster than the Dino world model for doing planning.

💬 0 comments
Add to My Notes
00:39:17Lucas Amaze

So basically, what is L-World Model? It's JEPA in its pure essence. So what you do is you take an observation $O_t$ and an observation $O_{t+1}$, so the next observation. You process both of them through a shared encoder and you get representations $z_t$ and $z_{t+1}$. The task is you will use $z_t$ and the action at step $t$ and try to learn a predictor to model the dynamics. So you're going to try to predict the future. So you have an estimation of the future called $\hat{z}_{t+1}$ and you compare that to what your encoder predicted for the next state. You use mean squared error, and you use what we call SIG to avoid collapse.

💬 0 comments
Add to My Notes
00:40:03Lucas Amaze

What is nice is that basically what I told you about before with JEPA is literally what you see, and the code is exactly what I just described. So if you look at the pseudo-code on the right, it's actually not that much code. It's literally the true code. You encode, you predict, you compute your prediction error for the future, and you use SIG to avoid collapse. Okay.

💬 0 comments
Add to My Notes
00:40:26Lucas Amaze

So you can see that at the bottom line, at the return, I have a single hyperparameter lambda. So this is the only stuff you need to tune, and because you have a single hyperparameter, you can do a bisection. So it's $\log(n)$ to find the optimal lambda. So it's pretty nice because you can really easily tune the model.

💬 0 comments
Add to My Notes
00:40:46Lucas Amaze

And so what is SIG now? So this is the stuff that avoids your representation collapse and makes sure that your $z_t$ are informative about what is inside your observation. So basically you can see $z_t$ as a learned abstraction of the world, or a learned state. You have an observation of the world which is the image $O_t$, and you try to estimate the true state. Okay, and so how do you prevent collapse? Basically we use a simple regularization called SIG for Sketched Isotropic Gaussian regularizer that Randall and Yann LeCun introduced last November, I think, if I'm correct.

💬 0 comments
Add to My Notes
00:41:30Lucas Amaze

The idea is very simple. The math is a bit tricky, but I invite you to read the paper. It's a very nice paper. But the idea is very simple. You take the distribution of your embeddings. Okay? So you take your batch, you look at how your embeddings $z_t$ are distributed, and you try to optimize that distribution to be isotropic Gaussian. So how do you do that? You could use generative modeling to do that with the KL divergence and everything, such as in variational autoencoders. We don't want that here. We don't want to use generative models. So what you do is you're going to try to use a statistical test. Okay. So in statistics, you have tests that tell you, given an empirical distribution, how close it is to a Gaussian distribution. The problem is that our embedding space is very high dimensional, and so the statistical tests suffer from the curse of dimensionality, so it's pretty difficult to directly use a test to optimize a high-dimensional distribution.

💬 0 comments
Add to My Notes
00:42:42Lucas Amaze

And so what Randall had as an idea is to sample a lot of random directions in your latent space and you're going to project all your embeddings into one dimension. So for instance, if you look at the red arrow on this image and you look on the right, you can see that once you project your embeddings into that direction, you get a univariate empirical distribution, and now you can optimize because it's just a 1D distribution to make that Gaussian. And if you do that in a lot of directions, there's a theorem called the Cramér-Wold theorem that says that if you optimize the marginals to be Gaussian, then the joint is going to be Gaussian. So basically if you do that for a lot of random directions, you can prove that your latent embedding distribution is going to be Gaussian as well. So it makes us sure that we will have informative embeddings. So that's pretty nice.

💬 0 comments
Add to My Notes
00:43:53Lucas Amaze

Okay. So now, let's imagine you train your world model. How can you make sure that you learned a good world model? So there are two ways. The first way is online control. So can you use your learned world model that is action-conditioned, and can you in an online way optimize this action to perform control? So this is the first thing. And the second thing we are going to talk about after is trying to assess if your world model understands intuitive physics.

💬 0 comments
Add to My Notes
00:44:28Lucas Amaze

But let's focus on control for now. So how we perform control is basically we're going to use your learned world model. You're going to sample random trajectories of actions, and you're going to optimize by rolling out the future. You will optimize that sequence of actions to match a target goal. So for instance, let's say you have a goal frame $O_g$. You start from the current frame $O_1$. You process both of them through your encoder. You get the initial latent state $z_1$. You have your target state which is $z_g$. Okay. And then you sample an initial sequence of actions, and you use the first one with your first state inside the predictor. It gives you $z_2$. Then you use the second one, etc. And after each step, you compare how far you are in your latent space to the representation of the goal. It can be MSE, for instance. You use the MSE loss to estimate how far you are from your goal. And because your predictor is differentiable, you can backpropagate down to the sequence of actions to minimize the distance with your goal. Okay.

💬 0 comments
Add to My Notes
00:45:51Lucas Amaze

So I can show you some results. You can see that we consider four tasks. So the first one on the left is called Two Room. It's a simple 2D navigation task where you need to move from one side of a room to the other by passing through a door. Then you have Reacher. It comes from the DeepMind control suite, where you have a two-joint arm and you need to reach a target location for a given goal. Then you have Push-T. So Push-T, Hazel already presented it, but the goal of Push-T is to push the T to the green area. That's as simple as that, and you can only push, you cannot pull. And OG-Bench Cube. It's a 3D environment where the objective is to use a 3D manipulation robotic arm to make the cube match a target location as well.

💬 0 comments
Add to My Notes
00:46:56Lucas Amaze

And so I think the most interesting result is the Push-T one, because you can see that the Dino model has been trained with proprioception, and you can see that L-World Model without proprioception beats the Dino model with proprioception with half the parameters. And what is most interesting is if you remove the proprioception of the Dino model, you see that the performance drops. And L-World Model, because it doesn't have proprioception, beats by a lot the model with proprioception. And PLLDM, which is the only other one that doesn't use proprioception and that is fully trained end-to-end, also has very bad performance on Push-T. So it was very nice to see this result. For Reacher, we beat PLLDM and Dino. For OG-Bench Cube, for instance, we beat PLLDM but we don't beat the Dino world model. And the most likely explanation for that is that we only train on trajectories coming from the dataset. The Dino model has a pre-trained encoder; it has been pre-trained on 142 million natural images, and consequently, it has a better understanding of objects and 3D than we have because it has been trained on a lot more data.

💬 0 comments
Add to My Notes
00:48:19Lucas Amaze

And also very funnily, you can see that for Two Room, even if it's the simplest task, almost all the baselines destroy the task, but for us we don't. And this is actually a limitation of SIG that exists for now: if you have an intrinsic dimensionality—so what I mean by that is the true dimensionality that you need to solve the problem; for instance for Two Room it's just two because you just need to know the $X$ and $Y$ of the agent. As soon as you know that and I give you a target location, you can solve the environment. If that is very much smaller than the embedding you use, there is no way that you can produce Gaussian embeddings. At least your SIG, when you optimize, is going to need to create fake information to make your latent space Gaussian. So it doesn't help you basically. There is research going on to try to fix that, and actually we know that if you carefully tune the hyperparameters you can somewhat overcome this issue, but for the sake of fairness, we didn't hyper-tune all the hyperparameters and we kept the same hyperparameter across all the environments.

💬 0 comments
Add to My Notes
00:49:34Lucas Amaze

But the most interesting part, why it's very nice, is when you look at the planning time. So for Dino world model, and we heavily optimized Dino world model to try to be as fast as possible, the fastest we could go for doing the full planning is 47 seconds. And the reason for that is because your predictor takes all the patches and you need to predict all the patches, so it's quite slow because of the quadratic cost of the attention to predict the future, right? As for us with L-World Model, we have a single embedding for representing the latent state because we just use the CLS token of the encoder, and it allows us to use some other tricks. We didn't optimize a lot for that, we could go a lot less than this, but we can go to a full planning time under a second, which is very nice. It's like almost 50 times faster.

💬 0 comments
Add to My Notes
00:50:34Lucas Amaze

Another interesting thing is that if you fix the FLOPs for planning at those of L-World Model—so if you reduce the planning time by tweaking the hyperparameters for the planning until the Dino model plans under a second—you can see that the success rate drops by a lot for Push-T and for OG-Bench Cube, which is expected. But it shows that at a similar budget, we outperform even the Dino model on OG-Bench Cube.

💬 0 comments
Add to My Notes
00:51:12Lucas Amaze

So that's for the control. For intuitive physics understanding, we did something pretty easy. The first thing is we probed the latent space. So we just took the encoder from the L-World Model, froze it, and trained a linear or non-linear MLP probe to try to predict the coefficients of the simulation for that state. Okay, so I will not go too much into detail, you can have a look at the paper for that. But what is pretty interesting is that for instance, for OG-Bench Cube, you can see that the linear probe with L-World Model almost all the time has a lower mean squared error. Which means that if you compare it with the MLP probe on the right, it's almost the Dino model all the time that has the lowest mean squared error. And so what it suggests is that our latent space is less entangled than the one of the Dino model. So it's easier to recover the coefficients directly from the latent space rather than the Dino model. And it does make sense, I think, because as we use SIG, it pushes the latent space in such a way where each dimension is somewhat meaningless. So it's not fully disentangled, but it's somewhat more than the Dino model, which is pretty nice.

💬 0 comments
Add to My Notes
00:52:35Lucas Amaze

Another thing that you could do, which is pretty cool, is to try to see what happens when you violate the world model. So if a sudden change in the dynamics happens, does your world model predict that this is a violation? And so what we did for that is, as you can see on the left—sorry, the cube color and cube teleportation are inverted—but if you look on the left, you can see that we have a normal trajectory where your robotic arm picks up the cube and moves it to the right, for instance. Nothing happened. And then we considered two perturbations: for instance, we changed the color of the cube. So this is the one on the right, not in the middle. Suddenly, at a given frame. And another transformation we did is randomly, suddenly—this is the one in the middle—the cube teleports.

💬 0 comments
Add to My Notes
00:53:27Lucas Amaze

Okay. And so if you look on the right, you can see that the X-axis is the time steps, and the Y-axis is the prediction error. So this is the difference between what your predictor predicts is going to happen and what the actual embedding of the next state looks like. And so what we can see is that if you don't have any perturbation, so this is the gray line, it's fine. If you change the cube color, it's a bit higher, but it's very negligible. So it means that basically your world model doesn't care much about the color of the cube, which is pretty cool because you don't need that for the dynamics, right? You don't care about the color of the cube. But if suddenly the cube teleports, then the prediction error shoots up a lot, meaning that your world model didn't predict that. Some people say to me often that, "Yeah, but it's just out of distribution." And I would say it's true, but I think it's not very meaningful to say that because, as humans, when you violate your model, it's also very out of distribution. That's why, for instance, when someone does a magic trick where they make a coin disappear in front of you, you're suddenly very surprised and it frustrates you a bit, you know? It's the same here; it's because the prediction error is very high.

💬 0 comments
Add to My Notes
00:54:46Lucas Amaze

So we did two other small experiments as well. The first one is we try to embed the location of the agent and the T, and we try to make the agent location and the T location move. So you can see that the middle plot shows you the different locations, and we try to project the embedding space with a t-SNE plot to try to see if we can recover this exact relative distance between all the different locations of the original space. And what you can see is that, up to permutation of the axes and rotation/reflection, this is exactly what happened. So we recover basically the relative distances in the original space, which is pretty nice.

💬 0 comments
Add to My Notes
00:55:34Lucas Amaze

The last thing we did is we threw out the world model and we trained a decoder to try to interpret what is happening when you make future predictions. And so for that, you can see that we give a context to our model. So this is the first three frames of the top row, or the second row. We give that to our model. The first row is really what happened in the future on the right, where you have open-loop prediction. So this is actually what happened in the trajectory. And then on the second row, you have what your world model imagines when you give the same sequence of actions that has been taken for the original sequence. And you can see that the world model predicts somewhat the reality when you give the sequence of actions. But what is very interesting is that if you look at the second rollout with the cube, you can see that we predict very well what is going to happen with the cube. But if you are very careful, you can see that at frame 15 and 20, the angle of the gripper is opposite. And so basically, you can see that the world model didn't learn the rotation of the gripper, which was pretty interesting because it still was able to solve somewhat the environment.

💬 0 comments
Add to My Notes
00:56:55Lucas Amaze

So there are many limitations for our model—or as I like to call them, research opportunities. So for now, you are doomed to short-term planning horizons. So if you can unlock that, that would be very nice. Another problem is that you reason at a single temporal level. So we need hierarchies. Basically, for instance, when you think about, "Oh, I need to go to the airport," you think at a different hierarchy, right? The airport is your goal, you think, "Okay, I need to go to my car, then I need to go to the airport," and blah, blah, blah. And then only at the end does it translate into muscle movements, right? So you don't all the time think in terms of muscle movements. So we need that as well to be able to predict further in the future. A very important stuff as well is to move from these toy environments. Like, I fairly agree with this kind of criticism that it's very toyish experiments. And so can we move to real-world robotics or very stochastic and partially observable environments like Minecraft? That would be very nice.

💬 0 comments
Add to My Notes
00:57:58Lucas Amaze

And also a big problem, I think, is how do you specify your goal? So for now, you can see as I explained before, you need to provide a visual goal, but you don't all the time have access to that, and it doesn't tell you anything about how you should solve the task. So for instance, if you have a plane that needs to land, you don't want to just show a picture of the plane landed to do the planning. You want to specify how smooth the landing should be and everything. So we don't know how to do that for now with this kind of approach.

💬 0 comments
Add to My Notes
00:58:28Lucas Amaze

I would like also to take just two minutes to do a bit of advertisement for something we have been pushing with a lot of Randall's students and a lot of people for the past few months, which is called Stable World Model. And if you're interested about world model research and this kind of stuff, you should definitely have a look for that. So it's a GitHub library fully open source that allows you to train world models very easily. So you have all the baselines I discussed and Hazel discussed that are implemented there and heavily tested. You have all solvers to do planning. You have many environments. We recently added the DeepMind control and Minecraft, and we are in discussion to add support for real robot data also very soon. And everything is very heavily tested and there is documentation. So yeah, feel free to give it a try and give feedback or contribute to the library, that would be very cool. Thank you.

💬 0 comments
Add to My Notes
00:59:27Host

Okay, great. Thank you so much. Let's give a round of applause to our speakers today.

💬 0 comments
Add to My Notes
00:59:37Host

So now we'll be soliciting some questions. We have some online on the Slido, but I'll also be looking for in-person questions in case anybody here wants to ask anything as well. So we'll kind of balance between both and I'll let you guys figure out who should answer each question, or if you both have insights for anything, that's also great. Does anybody here have any questions?

💬 0 comments
Add to My Notes
01:00:05Audience Question

Background? I'm just curious, how do world models transfer to physical AI space? Like, how can we see them moving forward in the robotics space?

💬 0 comments
Add to My Notes
01:00:15Hazel

I would give it to Lucas probably. Lucas, can you hear us?

💬 0 comments
Add to My Notes
01:00:20Lucas Amaze

Yes. Yes. Can you repeat the question, please?

💬 0 comments
Add to My Notes
01:00:22Audience Question

Oh. Um, so how can we expect the world models to be in the physical AI space? Because we already have diffusion-based foundation models like GR or maybe Pi, and also like $\pi_0$ using world models. Can we expect some more applications in the robotics space?

💬 0 comments
Add to My Notes
01:00:41Lucas Amaze

Okay. Okay. Okay. So first of all, I would be very skeptical that the current VLA models have a good understanding of the world. Like, there is no reason for that, okay? They are not trained to predict the consequence of their action, okay? And I would be very skeptical to see something that doesn't know what is the outcome of its action be very reliable to do physical AI. So as humans, why you're very good at what you do is because you can predict what is the consequence of your action in the real world, and that's what world models try to do. The VLAs, they don't do that. So if you want to have physical AI basically, you need world models; you cannot bypass that, I think. As Yann says often, like VLA and everything, they can be very helpful for simple stuff where you don't need to predict what is going to be the outcome of your action in the real world. For instance, if you want to make robots that do some dance or whatever. They only need to know their internal dynamics. Okay? They don't need to know how to interact with the real world. To interact with the real world, you need to predict the outcome of your action. So, you need world models. I hope that answers your question. If not, happy to clarify.

💬 0 comments
Add to My Notes
01:01:56Audience Question

Sure. Sure. Thank you.

💬 0 comments
Add to My Notes
01:01:58Hazel

I would like to answer at a one more sentence. I think Professor Shuran Song at NYU, she has an opinion that world models can be an evaluator for policies. So I think this can be also your instance.

💬 0 comments
Add to My Notes
01:02:15Host

We have someone on Slido. Let me see. Someone's asking: "Masking is shown to improve predictions, but is it really necessary to learn the world model? Does it mean that prediction loss alone is insufficient?"

💬 0 comments
Add to My Notes
01:02:29Hazel

Uh, the prediction loss. Okay, let me assume that you're still talking about the object-centric representation. If you only do the prediction loss with the object-centric representation, the shortcut or what the model will learn, it can be the self dynamics. We cannot guarantee that the model learns interaction-based dynamics. So I won't say that is insufficient, but with the masking, I think you can reinforce the model to learn the object dynamics.

💬 0 comments
Add to My Notes
01:03:04Host

Yeah, that makes sense. Um, someone's asking, "Do you have any insights on how C-JEPA learns to plan ahead beyond just the next frame?"

💬 0 comments
Add to My Notes
01:03:19Hazel

So for the—uh, is that all the question?

💬 0 comments
Add to My Notes
01:03:23Host

Yeah.

💬 0 comments
Add to My Notes
01:03:24Hazel

Okay. Um, we strictly follow the evaluation method—I mean, the planning of the Dino model. So with the predicted future frame, we do it autoregressively to reach the long-horizon planning, and we use the same parameters for planning as well.

💬 0 comments
Add to My Notes
01:03:45Host

Any in-person questions? Okay, I'll continue with the online ones because we have quite a lot. Let me see. Someone's asking, "What would a JEPA-native agent architecture look like compared to current LLM-based agents?"

💬 0 comments
Add to My Notes
01:04:05Hazel

Lucas, do you want to answer?

💬 0 comments
Add to My Notes
01:04:09Lucas Amaze

Of course. Can you just repeat the question?

💬 0 comments
Add to My Notes
01:04:12Host

Someone asked, "What would a JEPA-native agent architecture look like compared to current LLM-based agents?"

💬 0 comments
Add to My Notes
01:04:22Lucas Amaze

Okay. Basically, the way it differs is that you learn action-conditioned models, right? So LLM agents, you don't learn action-conditioned models. As far as I know, I'm not an expert in LLMs either, but you learn to use tooling and to make tool calls. This is not what you do when you do world models. What you do is you learn: "Okay, I have an action. I am at a current state. What is going to happen in the future if I take that action?" Okay? This is not how you train LLMs, so this is one step that differs. And the second thing is that by construction, okay, you can just learn to model the future with world models and then use planning by using old concepts of control theory, such as model predictive control or whatever, to directly convert that into an agent. You don't need post-training or whatever; you can do that zero-shot. So that's pretty neat. For LLMs, I'm not sure, so I cannot reply for that.

💬 0 comments
Add to My Notes
01:05:19Host

I think that makes sense, yeah. I guess this is kind of—let me see. Yeah. "Do you think JEPA models are less prone to hallucination than transformer-based models since they hopefully learn a more accurate representation of the world?" Not sure if you guys might have any insights.

💬 0 comments
Add to My Notes
01:05:44Hazel

Uh, I can answer first and Lucas, you can just add your opinion on it. I'm not sure what kind of hallucination the person asked, but in the predictive sense, because JEPA models are—as I mentioned in the earlier part of the talk—JEPA is an energy-based model. So rather than just predicting the full pixel space, what it evaluates is, "Oh, is our predicted future making sense? Like, is it a possible situation or is it impossible?" So I think in that sense, in the sense of the representation, the representation will be more suitable for world models, I think. Yeah.

💬 0 comments
Add to My Notes
01:06:36Lucas Amaze

Yes. So I can complete on that. So I think the original question was to compare JEPA and transformer-based models. So I would like just to emphasize that JEPA is not an architecture, a new architecture. Like, for instance, for L-World Model, our encoder and predictor are two transformers. So JEPA is really a framework to try to learn world models. It's not a new architecture. Okay? And second, I think you can definitely have hallucination if you don't learn a good model. Okay? But this is not at all the same hallucination as you have with LLMs, because LLMs, they have multiple sources for hallucination. One of them is that they are not grounded in the real world. Okay? With world models, you can easily fix that. Additionally, also, you can say that with world models you can have another source of hallucination as well, is that when you optimize the action sequence that you want to do for doing planning, you could have actions that are not meaningful for the real world. For instance, let's say that your action is between -1 and 1, but your optimizer says, "Okay, you should use -2 as an action." So you can have this kind of hallucination. Otherwise, it's just about the quality of your learned model, and this is just a matter of data and capacity. Hopefully that makes sense.

💬 0 comments
Add to My Notes
01:07:58Host

Yeah. No, great, I think that makes a lot of sense. Thanks. We have someone asking: "Are world models better than diffusion-based models for robot control?"

💬 0 comments
Add to My Notes
01:08:12Hazel

I think we had a similar answer as the previous one. The world model can still use the diffusion model.

💬 0 comments
Add to My Notes
01:08:21Lucas Amaze

Yeah, of course. Of course, you could use diffusion models to train JEPA. Actually, some people try to do that. Diffusion models have some advantages and some limitations as well. So yeah.

💬 0 comments
Add to My Notes
01:08:35Host

Um, any in-person questions?

💬 0 comments
Add to My Notes
01:08:39Audience Question

All right, we got one at the back. Kind of a longer horizon question, but you mentioned that you might consider planning through this whole family of different algorithms developed for control like model predictive control, etc. And you mentioned that it might be possible to use these things as kind of a grounding signal to train policies. Do you think that kind of combining those approaches—maybe that's what you already meant by training policies, though—of running a lot of compute on something like an autoregressive controller for a relatively long horizon and then distilling those back down into policies might be an efficient way at getting some longer horizon behavior for one of those problems and/or research directions you mentioned?

💬 0 comments
Add to My Notes
01:09:19Lucas Amaze

You want to go for it Hazel, or?

💬 0 comments
Add to My Notes
01:09:22Hazel

I can go—

💬 0 comments
Add to My Notes
01:09:24Lucas Amaze

Okay, yeah, okay. Um, yeah, you can definitely train policies on that because as soon as you have a world model basically, you can do either two things. You can use that directly and optimize your action sequence to have a zero-shot policy. Or you can do the same as Dreamer, for instance, and use your pre-trained world model and train reinforcement learning policies on that, and periodically go in your environment and collect better data, fine-tune your model, improve your policy, and blah, blah, blah. You can do that as well. So in the future, I think you will have something similar to System 1, System 2, same as humans do, where you will have a condensed, basically action-reaction policy, where you will learn to directly predict what the action I should take is to achieve that goal given this state. Okay? And then for difficult tasks that you are not very confident about, you can use planning and model predictive control to have more careful plan and situation. For instance, when you want to learn to drive a car, maybe for the first 20 hours you will use model predictive control to make sure you don't kill someone. And then after 20 hours, you think you're an expert, so you can basically distill your prediction of model predictive control and planning to a direct policy that outputs directly the action.

💬 0 comments
Add to My Notes
01:10:45Host

Right. Great. Unfortunately, that's all the time we have. Um, Hazel will be staying after in case anybody has any questions or wants to talk to her. So thanks again to our speakers for the very insightful talk. Thank you.

💬 0 comments
Add to My Notes
Video Player
My Notes📝
Highlighted paragraphs will appear here