Deep Dive into Long Context

Google

Explore the synergy between long context models and Retrieval Augmented Generation (RAG) in this episode of the Release Notes podcast. Join Google DeepMind's Nikolay Savinov with host Logan Kilpatrick as they discuss scaling context windows into the millions, recent quality improvements, RAG versus long context, and what's next in the field.

Hosts: Nikolay Savinov, Logan Kilpatrick

📺Watch on YouTube

📅May 2, 2025

⏱️59:32

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Deep Dive into Long Context" from "Google". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=NHMJ9mqKeMQ

00:00:00Nikolay Savinov

I'm really impressed by the work of our inference team.

🤍0 likes💬 0 comments

Add to My Notes

00:00:03Logan Kilpatrick

I've got a bunch of spicy uh rag versus long context questions for you.

🤍0 likes💬 0 comments

Add to My Notes

00:00:07Nikolay Savinov

You can rely on context caching to make it both cheaper and faster to answer.

🤍0 likes💬 0 comments

Add to My Notes

00:00:12Logan Kilpatrick

What's the limitation of like continuing to scale up beyond 1 to 2 million.

🤍0 likes💬 0 comments

Add to My Notes

00:00:18Nikolay Savinov

This thing is going to be incredible for coding applications.

🤍0 likes💬 0 comments

Add to My Notes

00:00:21Logan Kilpatrick

We will have lots more exciting long context stuff to to share with folks.

🤍0 likes💬 0 comments

Add to My Notes

00:00:51Logan Kilpatrick

Welcome back to release notes everyone. How's it going? Today we're joined by Nikolai Savanov, who's a staff research scientist at Google DeepMind and one of the co-leads for long context pre-training. Nikolai, how are you?

🤍0 likes💬 0 comments

Add to My Notes

00:01:03Nikolay Savinov

Yeah, hi. thanks for inviting me.

🤍0 likes💬 0 comments

Add to My Notes

00:01:05Logan Kilpatrick

Let's start off with um at the most fun foundational level and we'll sort of build up from that. Uh, what is what is a token? Um, and how folks should should think about that?

🤍0 likes💬 0 comments

Add to My Notes

00:01:17Nikolay Savinov

So the way you should think about the token, it's basically slightly less than one word in case of text. So token could be a word, part of a word or it could be things like uh yeah, punctuation like uh commas, uh full stops, etc. And um for images and audio, it's slightly different, but um for text just think of it as uh slightly less than one word.

🤍0 likes💬 0 comments

Add to My Notes

00:01:49Logan Kilpatrick

Yeah, and why do we need tokens? Like why like like humans generally are sort of familiar with characters, why does AI and LLMs have this special concept of a of a token? What's like what does it actually enable?

🤍0 likes💬 0 comments

Add to My Notes

00:02:01Nikolay Savinov

Oh, this is a great question and actually many researchers ask this question themselves. So there were actually quite some papers trying to get rid of tokens and uh just rely on character level generation. But the thing is, while there are some benefits uh of doing that. There are also some drawbacks and the the most important drawback is uh well, the generation is going to be slower because uh you generate uh roughly one token at a time and if you are generating word in one go, it's going to be much faster than generating every character separately. So those efforts they I would say they didn't really succeed and we are still using tokens.

🤍0 likes💬 0 comments

Add to My Notes

00:02:48Logan Kilpatrick

Yeah, for folks who haven't spent a bunch of time thinking about tokens, there's a bunch of good uh Andre Carthy videos and tweets and stuff like that of how like tokenizers are the root of all like weirdness and complexity in LLMs, like all these weird edge cases that you run into. It's like most of them are rooted in the fact that the model is not looking at things from a character level, it's looking at it from a token level and actually like the pertinent example, um, which folks love to go to these days is like counting the characters in a single word. Like how many Rs are there in strawberry is like a weird problem to solve my understanding is because tokenizers like break the word into different parts, it's not actually looking at the word at like the individual character level, is that a an apt description?

🤍0 likes💬 0 comments

Add to My Notes

00:03:33Nikolay Savinov

Yeah, I think that's a pretty good description of the problem. And you should one thing you should realize is that those models due to tokenization, the view they view the world very differently from how humans uh view the world. And when you see a strawberry, you see a sequence of letters, but what the model sees, it could be even one token. And then you ask like, hey, count number of uh R letters in this token. But this is a pretty hard, it's pretty hard to get this knowledge from from pre-training because you would need to associate the R letter token that you encountered somewhere in the web with the word strawberry, which is also one token. So if you think about the mental load of doing that, it's it's not as such a trivial task, I would say. Although obviously when the model can do it, we start complaining hey, like if it's AGI, how come it can't count number of R letters and strawberry that's like a child could do that.

🤍0 likes💬 0 comments

Add to My Notes

00:04:44Logan Kilpatrick

Yeah, it is it is super weird.

🤍0 likes💬 0 comments

Add to My Notes

00:04:46Nikolay Savinov

And and actually another interesting thing is that if you watch some of the Carthy videos then there's a lot of there are a lot of problems with the white space. So this is an interesting point because normally most tokens are prefixed with the white space. And then some really weird effects uh might happen because uh you might you might encounter problems uh on the boundaries when you think you are concatenating uh something, but uh this concatenation is very unusual for the model to see.

🤍0 likes💬 0 comments

Add to My Notes

00:05:26Logan Kilpatrick

Hm. Interesting. That is super interesting. I think this actually takes me to just like generally talking about context windows. And I think there's a lot of discussion or obviously we're talking about long context which sort of assumes you know what a context window is, but um, can you give the lay of the land of like how folks should thinking about like what a context window actually is? why do I as a user of LLMs or somebody who's building with AI models, why do I need to care about the context window.

🤍0 likes💬 0 comments

Add to My Notes

00:05:54Nikolay Savinov

So context windows, uh those are basically exactly this uh this context tokens that we are feeding into LLM. And it could be the current prompt or the previous interactions with the user, it could be the files that the user uploaded like videos or PDFs. And when you supply context into the model, the model actually has knowledge from two sources. So one source is what I would call in weight or pre-training memory. So this is a knowledge that while the LLM was trained on a slice of the internet. And it learned something from there, it doesn't need um additional knowledge to to be supplied into context to remember some of those facts. So there is already even without context, there is some kind of memory present in the model. But another kind of memory is this explicit in context memory that you are supplying to the model. And it's pretty important to understand the distinction between those two because in context memory is much, much easier to modify and update than in weight memory. So for some kinds of knowledge, in weight memory might be just fine, like uh if you need to uh memorize some simple simple facts that uh the objects fall down and not up. This is a very like basic common facts, it's fine if it's if this knowledge comes from pre-training. But there are some facts which are true at the time of uh pre-training, but then they become obsolete at the time of inference and you would need to update those facts somehow. And the context provides you a mechanism for to do this update. And it's not only about the up-to-date knowledge, there are also different kinds of knowledge like uh private information. The network doesn't know anything about you personally and it can't read your mind. So if you want it to be really helpful for you, you should be able to supply your private information into context and then it will be able to personalize. Without this personalization, it's going to give you a generic answers, it would give to any human instead of answers tailored to you. And the final category of knowledge which um which need to be inserted in context is uh some rare facts. So basically some knowledge which was uh encountered very sparingly on the internet. And I must say, I suspect this category of knowledge, it might go extinct with time. Maybe future models will just learn the whole slice of the internet by heart. And we will not need to to worry about those, but the reality at this point is that if something is mentioned once or twice on the whole internet. The models are actually unlikely to remember those facts and they are going to hallucinate the answers. So you might want to insert those uh explicitly into context. And the kind of tradeoff we are dealing with is for the short context models, you have limited ability to provide additional context. Basically you would have a competition between uh knowledge sources. And if the context is really large, then you can be less picky about what you insert and you can have uh higher recall and coverage of uh relevant knowledge. And if you have if you have higher coverage in context, that means well, you're going to alleviate all those problems with inweight memory.

🤍0 likes💬 0 comments

Add to My Notes

00:09:54Logan Kilpatrick

Yeah, I think there there's so many angles to push on. Uh that that was a great description. Um, one of the follow-ups from this is we talked about sort of inweight memory, we talked about in context memory, um, or in yeah, just in context in general. The the sort of third class is around how to bring context in that like through rag systems, um retrieval augmented generation. Can you sort of give like a high-level description of rag and then I've got a bunch of spicy uh rag versus long context questions for you.

🤍0 likes💬 0 comments

Add to My Notes

00:10:26Nikolay Savinov

Yeah, sure. So what Rag does is uh well, it's a simple engineering technique. It's an additional step before you pack the the information into LLM context. So imagine you have a knowledge uh knowledge corpus and you chunk this knowledge corpus into well, small textual chunks. And then you use some uh special embedding model to turn every chunk into a real valued vector. Then based on those real valued vectors, if you get the query at the test time, you can embed the query as well and then you can compare this real valued vector for query to those chunks from the corpus. And for the chunks which are close to the query, you're going to say, hey, like I found something relevant, so I'm going to pack those uh chunks into context and now I'm running LLM on this. So that's how that's how Rag works.

🤍0 likes💬 0 comments

Add to My Notes

00:11:31Logan Kilpatrick

And and why and this is maybe a silly question. Um, Rag, my my sense has always been like lets you obviously there's like very hard limits on context that you can pass to the model, we have 1 million, we have 2 million, that's awesome, but like actually if you look at like internet scale, you know, Wikipedia has, you know, many trillions of tokens or whatever, maybe not trillions, maybe billions of tokens, whatever it is. Um, why is like rag as this notion of like bringing in the right context to the model, not just like baked into the model itself. Like is it just that to the point of the conversation, the model just not working well for like it's just like the wrong research direction to go in? Or like why why don't you think we build that mechanism in because my face value perspective is like it seems like that would like kind of be useful if the model could just like do rag and if I could pass a billion tokens and then let the model sort of figure out heuristically or through the, you know, whatever mechanism, what the right tokens are. Um, or is that just like a a problem somewhere else in the stack that should be solved and the model shouldn't have to think about that.

🤍0 likes💬 0 comments

Add to My Notes

00:12:34Nikolay Savinov

Well, one thing I want to say is that after we released 1.5 Pro model, there were a lot of debates on social media like is rag becoming obsolete. And well, from my perspective, not really like uh say enterprise knowledge bases, they constitute billions of tokens and not millions. And so for this use case, for this uh for this scale, you still need rag. What I think is going to happen in practice is that it's not like Rag is going to be eliminated right now, but rather long context and Rag are going to work together. And the benefit of long context for Rag is that you will be able to retrieve more relevant uh needles from the context by using Rag. And by doing that you are going to increase the recall of the useful information. So if previously you were setting some rather conservative threshold and cutting out many potentially relevant chunks, then now you're going to say, hey, I have a long context, so I'm going to be more generous, so I'm going to pull more, more facts. And so I think uh there is a pretty good synergy between those and uh the real limitation is uh the latency requirements of your application. So if you need real-time interactions, then well, you'll have to use shorter context. But if you can afford to wait a little bit more, then yeah, you're going to use long context just because uh you can increase the recall by doing that.

🤍0 likes💬 0 comments

Add to My Notes

00:14:18Logan Kilpatrick

Why is 1 million just like a marketing number or like is there something like intrinsic that like after a million or two million, like is there actually like something technically happening around the like million token mark for from a long context perspective? Or is it literally just we found a a number that sounds good and then made the technology work from a um from a research perspective.

🤍0 likes💬 0 comments

Add to My Notes

00:14:40Nikolay Savinov

Well, when I started working on long context, uh the competition at the time, I think it was about 128k or maybe 200k tokens at most. So I was thinking how to set the generally, I feel like needle and Haystack is talked about a little bit, but are there like another set of like standard benchmarks that you're thinking about from a long context perspective?

🤍0 likes💬 0 comments

Add to My Notes

00:30:48Logan Kilpatrick

Yeah.

🤍0 likes💬 0 comments

Add to My Notes

00:30:49Nikolay Savinov

So that's one consideration. Another consideration is that something which people call retrieval versus uh synthesis. So theoretically, if you need to just retrieve one needle from the haystack, that can be solved by by rack as well.

🤍0 likes💬 0 comments

Add to My Notes

00:32:07Logan Kilpatrick

Hmm.

🤍0 likes💬 0 comments

Add to My Notes

00:32:08Nikolay Savinov

But the tasks that we should really be interested in are the tasks which integrate information over the whole context. And for example, well summarization is uh is one such task and uh rack would have a hard time dealing with this. But now these tasks they it sounds nice and the right direction to go, but they're actually not so easy to use for automatic evaluation. For example, the matrix for summarization, we know that they are like matrix like Rouge, etc. they are imperfect.

🤍0 likes💬 0 comments

Add to My Notes

00:32:46Logan Kilpatrick

Hmm.

🤍0 likes💬 0 comments

Add to My Notes

00:32:47Nikolay Savinov

And if you're doing hill climbing then you actually you're better off using something which is more um how do I say, less uh less gameable metrics.

🤍0 likes💬 0 comments

Add to My Notes

00:33:00Logan Kilpatrick

And and just a quick follow up, what what makes them less useful like for summarization as an example? Is it just that it's like more subjective of like what a good summary is versus what isn't and it doesn't have like a ground truth source of source of truth or what what makes that use case hard?

🤍0 likes💬 0 comments

Add to My Notes

00:33:12Nikolay Savinov

Yeah, those of us they're going to be uh pretty noisy because there will be relatively low agreement between the even between the human raters. Of course, this is not to give an impression that we shouldn't work on summarization and we shouldn't measure summarization. These are important tasks. I'm just talking about uh like my personal preferences as a researcher is to hill climb on something which has a very strong signal.

🤍0 likes💬 0 comments

Add to My Notes

00:33:42Logan Kilpatrick

Yeah, that makes sense. How how do you see sort of as long context, especially for Gemini is just like a core part of the capability story that we're telling the world. It's like a core differentiator for for Gemini. Um and and yet at the same time it feels like the long context has always been like a independent workstream of like everything isn't long context. Like do you think there's a world where like, you know, we have there's a ton of other teams hill climbing on a bunch of other random stuff, factuality, whatever it is, uh, you know, reasoning, etc, etc. Do you think the like directional from a research perspective, from like a modeling perspective is that like long context is just fused into every other work stream or do you think there's like still, you know, it needs to be an independent work stream because it's like just fundamentally different and how you get the model to do useful stuff with long context versus, you know, reasoning as a corollary example perhaps.

🤍0 likes💬 0 comments

Add to My Notes

00:34:35Nikolay Savinov

So, I guess my answer will be twofold. First of all, I find it helpful to have um an owner for every important capability. But second, I think it's important for for the work stream to also uh provide tools for people outside of this work stream to contribute.

🤍0 likes💬 0 comments

Add to My Notes

00:34:58Logan Kilpatrick

Yeah, that that makes a ton of sense. Um my I have another followup around reasoning stuff and I'm curious how like the interplay between reasoning and long context. So we had Jack Ray on and we were both at dinner with Jack last night uh and we were talking about random reasoning, reasoning stuff. Um do do you think like the have have you been surprised by like how much it feels like and you can correct me if this is wrong like the reasoning capability actually makes long context much more useful like is that like just a normal expected outcome just because the model's spending more time thinking or is there like some inherent like deep connection between reasoning capabilities and and long context to like make it much more effective.

🤍0 likes💬 0 comments

Add to My Notes

00:35:41Nikolay Savinov

I would say there's a deeper connection and the connection is that if the next token prediction task improves with the increasing context length. Then you can interpret this in in two ways. Uh one way is to say, hey, I'm going to load more context into the input and the predictions for my uh short answer are going to improve as well. But another way to look at this is say, hey, well, the output tokens, they are very similar to input tokens. So, if you allow the model to feed the output into its own input, then it kind of becomes like input, so theoretically, if you if you have a very strong long context capability, it should also help you with the reasoning. Another argument is that long context is pretty important for the reasoning because if you are just going to make a decision by generating one token, even if the answer is binary and uh it it's totally fine to generate just one token, it might be preferable to first generate a thinking trace.

🤍0 likes💬 0 comments

Add to My Notes

00:36:53Logan Kilpatrick

Hmm.

🤍0 likes💬 0 comments

Add to My Notes

00:36:54Nikolay Savinov

And the reason is simply the architectural, like if you need to make um uh many logical jumps uh through the context when making a prediction, then you are limited by the network depth because that's that's roughly the number of uh number of attention layers, that's what going to limit you in terms of the the jumps through the context. So you are limited, but uh now if you imagine that you are feeding the output into the input, then you are not limited anymore, basically, you can write into your own memory and you can uh perform much harder tasks than you could by just uh utilizing the the network depth.

🤍0 likes💬 0 comments

Add to My Notes

00:37:37Logan Kilpatrick

That that's super interesting. You you and I have also both like related to this reasoning plus long context story. You and I have both been pushing for a long time to try to get uh long outputs uh landed into the models. And I think developers want this. I I see pings all the time. I'm going to start sending them to you now so that you have to answer this question, but lots of people saying, hey, we want longer than 8,000 output tokens. We sort of have this to a certain extent now with reasoning token or with the reasoning models, they have 65,000 output tokens with the caveat that a large portion of those output tokens is actually for the model to do the thinking itself versus generating some like final response to the user. How connected are like the long long context input versus like long context output capabilities? Like is there any interplay between those two things? I feel like for a lot of the like core use case, I think that people want is like, you know, dump in a million tokens and then like refactor that million tokens. Um, do you think we'll get to a world where like those two things are actually like the same capability? Do you look at them as the same capability or is it like two like completely fundamentally different things from a research perspective.

🤍0 likes💬 0 comments

Add to My Notes

00:38:46Nikolay Savinov

No, I don't think they are fundamentally different. I think uh the important thing to understand is that straight out of pre-training, there isn't really any limitation from the model side to generate a lot of tokens. You can just put say half a million and tell it, I don't know, copy this half a million tokens and it will actually do it. And we actually tried it, it it works. But this capability it requires very careful handling in the post training.

🤍0 likes💬 0 comments

Add to My Notes

00:39:21Logan Kilpatrick

🤍0 likes💬 0 comments

Add to My Notes

00:39:22Nikolay Savinov

And the reason why it requires a careful handling is because in a post training, you also you have this special uh end of sequence token and if your SFT data is short, then what's going to happen is the model is going to see this end of sequence token pretty early in the in a sequence. And then it's just going to learn like, hey, like you you're always showing me this token within context X. So, yeah, I'm going to generate this token within context X and uh stop generation. That's what you are teaching me. This is actually an alignment problem. But one point I want to make is that I feel like reasoning is just one kind of uh long output tasks. And for example, translation is another kind and reasoning it uh it has a very special format. It packs uh the reasoning trace uh into some delimiter. And model actually knows that we are asking it to do the reasoning in there. But for translation, the whole output, not just the reasoning trace is going to be long. And this is another kind of capability that we want uh the model to encourage to produce. So it's just uh it's just a matter of properly aligning the model and we are actually working on long output.

🤍0 likes💬 0 comments

Add to My Notes

00:40:54Logan Kilpatrick

I'm excited. People people want it uh very badly. I think that gets to a bunch of uh a broader point around just like how developers should be thinking about best practices for long context and also for rag potentially as well. Do you have a general sense of and I know you um gave a bunch of feedback on our long context developer documentation. So we have some of this stuff sort of documented already, but what what's your general sense of what the suggestions are for developers as they're thinking about how to most effectively use long context.

🤍0 likes💬 0 comments

Add to My Notes

00:41:23Nikolay Savinov

So I think suggestion number one is uh try to rely heavily on uh context caching.

🤍0 likes💬 0 comments

Add to My Notes

00:41:29Logan Kilpatrick

🤍0 likes💬 0 comments

Add to My Notes

00:41:30Nikolay Savinov

So let me explain the concept of context caching. Uh the first time you supply a long context to the model and uh you're asking a question. It's going to take longer and it's going to cost more while if you're asking the second question after the first one on the same context. Then you can rely on context caching to make it both cheaper and faster to answer. That's the one of the features that we are currently providing for some of the models. And so yeah, try to rely heavily on this thing. Try to cash the files that uh user uploaded into context. Because it's not only uh faster to process, but it's going to cost you on average uh four times less for um for the input token price.

🤍0 likes💬 0 comments

Add to My Notes

00:42:19Logan Kilpatrick

And and just to give an example of this, like the most common and you can correct me if this is wrong or not not the same mental model that you have. But like the most common application where this ends up being really useful is like the like chat with my docs or like chat with PDF or like chat with my data type of applications where the the actual original input context to your point is the same and that's one of the um again, correct me if my mental model is wrong. Like that's one of the requirements of using context caching is that the original context you supply has to be the same. If for some reason that input context was changing on a request by request basis, context caching doesn't actually end up being that effective because you're you're paying to store some set of original input context that has to persist from like a user request by a user request basis.

🤍0 likes💬 0 comments

Add to My Notes

00:43:06Nikolay Savinov

Yeah, I guess uh answer is yes to to both. It's important for cases where like you want to chat to a collection of your documents um or like some large video, you want to ask some questions on it or a code base. And you are correct to mention that this knowledge it shouldn't change or if it changes then uh the best way for it to change is at the very end.

🤍0 likes💬 0 comments

Add to My Notes

00:43:35Logan Kilpatrick

Hmm.

🤍0 likes💬 0 comments

Add to My Notes

00:43:36Nikolay Savinov

Because then uh what we're going to do under the hood is we're going to find the the prefix which matches the cache prefix and we are just going to throw away uh the rest. And sometimes developers ask a question like where should we put the question before the context or after the context? Well, this is uh this is the answer like uh you want to put it after the context because if you want to rely on caching and uh uh profit from cost saving then uh that's the place to put it because if you put it at the beginning and if you're intending to put all your questions at the beginning, then your caching is going to start from scratch.

🤍0 likes💬 0 comments

Add to My Notes

00:44:19Logan Kilpatrick

Yeah. That that's awesome. That's helpful. Um other tips? Anything else besides context caching that folks should be thinking about from a developer perspective?

🤍0 likes💬 0 comments

Add to My Notes

00:44:28Nikolay Savinov

Uh one thing we already touched on and that's a combination with rag. So, if you need to go into billions of tokens of context, then uh you need to combine with drag. But also in some applications where you need to retrieve uh multiple needles, it might still be beneficial to combine with rag even if you need much shorter context. Another thing we which we already discussed is that uh well, don't don't pack the context with uh irrelevant stuff. It's uh it's going to affect this uh multi needle retrieval. Another interesting thing is we touched on the interaction between uh in weight and in context memory. So one thing I must mention is that if you want to update your in weight knowledge using um in context memory, then the network will necessarily get two kinds of knowledge to rely on. So there will there might be a contradiction between those two. And I think it's beneficial to resolve this contradiction explicitly by careful prompting. So for example, you might start your question with saying based on the information above, etc. And when you say this, based on the information above, you give a hint to the model that it actually has to rely on in context memory instead of in weight memory. So it resolves this uh ambiguity for the model.

🤍0 likes💬 0 comments

Add to My Notes

00:45:59Logan Kilpatrick

I love that. That's a great uh that's a great suggestion. And and your sort of comment about this tension between in weight versus not and and again, we we talked a little bit about this, but how how do you think about from a developer perspective like the fine tuning angle of this? And the only thing that's maybe more controversial than like is, you know, long context going to kill RAG is, you know, should people be fine tuning at all? And like Simon uh Willison has a bunch of threads about this is like, who does anyone actually fine tune models? Does it end up helping them? Um, how do you think about this from like a like would it be useful to do fine tuning and long context for like a similar corpus of knowledge or like does the fine tuning piece potentially lead to like better general outcomes for fine tuning? How how do you think about that that interplay?

🤍0 likes💬 0 comments

Add to My Notes

00:46:44Nikolay Savinov

Yeah, so let me maybe elaborate on how fine tuning could actually be used on the knowledge corpus. So what people sometimes do is uh well, they get additional uh additional knowledge, let's say you have a big uh enterprise knowledge corpus, say uh billion of tokens. And well, you could continue training the the network just like we were doing with pre-training, so you could apply language modeling loss and you can ask the the model to learn how to predict the next token on this uh knowledge corpus. But you should keep in mind that uh this this way of integrating information, it actually works, but it has limitations. And one limitation is because uh you are actually going to train the network instead of just supply the context, you should be prepared for various problems like uh you will need to tune hyperparameters. You will need to know when to stop the training, you'll have to deal with the over fitting. Some people who actually tried to do that, they reported uh increased hallucinations from from from using this process. And they hinted that uh maybe it's not the best way to supply knowledge information into the network. But obviously it it also like this this technique also has advantages uh in particular, it's going to it's going to be pretty cheap and fast at inference time because well, the knowledge is in the weight, so you're just sampling. But there are also some privacy implications because now the knowledge is cemented into the weights of the network and if you actually want to update this knowledge then you are back to the original problem like uh this knowledge is not easy to update like it's in the weights. So how are you going to do it? You will have to again supply this knowledge through the context.

🤍0 likes💬 0 comments

Add to My Notes

00:48:51Logan Kilpatrick

Yeah, I think it's it's such an interesting um tradeoff problem from a developer perspective about like how how rapidly you want to be able to update the information. Uh I think the cost piece of it is like it's not cheap to just like keep paying to like I feel like rag is actually like pretty reasonable. You're paying for a vector database, which I feel like there's a lot of offerings and that's reasonably efficient to do at scale. But I think like continuously fine tuning new models is like often times potentially not not cheap, which is super yeah. a lot of interesting dimensions to take into account. I'm curious about the sort of long-term direction from a fine tuning or maybe not from a fine tuning, from a long context perspective, like what what can folks look forward to in the next like three years for long context from a maybe an experience perspective, but like will people will we even talk about long context in three years? Will it just be like the model does this thing and I don't need to care about it and like it's just works? Um or yeah, how how are you thinking about this?

🤍0 likes💬 0 comments

Add to My Notes

00:49:49Nikolay Savinov

So I'll make uh I'll make a few predictions. Uh what I think is going to happen first is the quality of the current one or two million context, it's going to increase dramatically and we are going to max out pretty much all the retrieval like tasks quite soon. And the reason I think it's going to be the first step is because well, you could say like, hey, but why don't we extend the context, why stop at 1 million or two million? But the point is that the current million context, it's not uh close to perfect yet. And while it's not close to perfect, there's a question why do you want to extend it? Because what I think is going to happen is when we achieve close to perfect million context, then it's going to unlock uh totally incredible applications. Like something we could never imagine would happen, like the abilities to process uh information and connect the dots, it will increase dramatically. This thing it already can simultaneously take in more information than a human can. Like uh, Idon't know, go watch a one hour video and then immediately after that answer some particular question on that video like at what second someone is uh dropping a piece of paper. You you can't really do that very precisely as a human. So what I'm what I think is going to happen is uh this uh superhuman abilities they are going to be more pervasive. Like the more like the better long context we have, the more capabilities that we could never imagine are going to be unlocked. So that's uh that's going to be step number one. The quality is going to increase and uh we're going to get nearly perfect uh retrieval. After that, what's going to happen is the cost of long context is going to decrease. And I think it will take maybe a little bit more time, but it's going to happen and uh as the cost decreases, the longer context also gets unlocked. So I think reasonably soon we will see the 10 million context window which is uh like a commodity like um it it will basically be normal for the providers to to give uh 10 million context window which is currently not the case. When this happens that that's uh that's going to be a deal breaker for some applications like coding because I think for one or two million you can only fit some uh I don't know, some somewhere between small and medium size uh code base in the context but 10 million actually unlocks um a large large coding projects to be included in the context completely. And by that point we'll have we'll have the innovations which uh enable uh near perfect uh recall for the entire context. This thing is going to be incredible for coding applications because the way humans are coding well, you need to hold in memory as much as possible to be effective as a coder and you need to jump between the files all the time. And you you always have this narrow attention span. But LLMs are going to circumvent this problem completely, they're going to hold all this information in their memory at once and they're going to be reproduce any part of this information precisely. Not only that, they will also be able to really connect the dots. They will find the connections between the files and so they will be very effective coders. I imagine we will very soon get uh super superhuman coding assistance. They will be totally unrivalled and uh this they will basically become the the new tool for every coder in the world. And so when this 10 million happens, that that's a second step. And going to say 100 million, well, it's more debatable. I think it's going to happen, I don't know how soon it's going to come. And I also think we will probably need more deep learning innovations to achieve this.

🤍0 likes💬 0 comments

Add to My Notes

00:54:43Logan Kilpatrick

Yeah, I love that. Well one one sort of quick follow up across all three of those dimensions is like how much from your mind is this like hardware story or like the infrastructure story relative to like the model story. Like there's obviously a lot of work that has to happen to like actually serve long context at scale, which is why it costs more money to do long context, etc, etc. Um, do you think about this from a research perspective or is it like, hey, the hardware is sort of going to take care of itself, the TPUs will do their job and like I can just focus on on the research side of things.

🤍0 likes💬 0 comments

Add to My Notes

00:55:15Nikolay Savinov

Oh well, yeah, I mean, just having the the chips is is not enough. Uh you also need very talented uh inference engineers. And I'm actually I'm I'm really impressed by the work of our inference team, what they pulled off with the million context that was uh incredible. And without such strong uh inference engineers, I don't think we would have delivered one or two million context uh to customers. So this is uh it's a it's a pretty big uh inference engineering investment as well. And no, I don't think it's going to resolve itself.

🤍0 likes💬 0 comments

Add to My Notes

00:56:04Logan Kilpatrick

Yeah, our our inference engineers are always working hard because there's always uh we always want long context on these models and it's uh yeah, it's not easy to make it happen. How do you think about the sort of interplay of a bunch of these agentic use cases with long context? Do you think is it like a fundamental enabler of of different agent experiences than than you could have before or like what what what's the interplay between those two those two dynamics.

🤍0 likes💬 0 comments

Add to My Notes

00:56:30Nikolay Savinov

Oh, this is an interesting question. I think uh agents can be considered both consumers and suppliers for long context. So let me explain this. So the agents to operate effectively, they need to keep track of the last state, like the previous actions that they took, the observations that they made, etc and of course the current state as well. So to keep this uh all these previous interactions in memory, you you need longer context. So that's where longer context is helping agents. That that's where the agents are the consumers of long context. But there is also another orthogonal perspective is that agents are actually suppliers of long context as well and this is because packing long context by hand is incredibly tedious. like if you have to upload all the documents that you want by hand every time or like uh upload a video or I don't know, copy paste some uh content somewhere from the web. This is really tedious. You don't want to do that. You want uh the model to do it automatically and one way to achieve this is uh through the tool calls. So if the model can decide on its own like, hey, at this point I'm going to um fetch some more information. And then it's going to just uh pack the context on its own. And so yeah, in that sense uh agents are the suppliers of long context.

🤍0 likes💬 0 comments

Add to My Notes

00:58:13Logan Kilpatrick

Yeah, that that's such a great example. Um, my two cents and I've had many conversations with folks about this. I think this is actually one of the main limitations of how people interact with AI systems is like you're your example of like it's tedious, like it's so tedious. Like the worst part about doing anything with AI is like I have to go and find all the context that might be relevant for the model and like personally bring that context in. And in many cases like the context is like already on my screen or on my computer or like I have the context somewhere, but it's like I have to do all the heavy lifting. So, I'm excited for like a uh we we should, you know, we should build some like long context agent system that just like goes and gets your context from everywhere. Like I think that would be super, super interesting and I feel like solves a very fundamental problem not only for developers but like from a end user of AI systems perspective, like I wish the models could just go and fetch my context and I didn't have to do it all.

🤍0 likes💬 0 comments

Add to My Notes

00:59:09Nikolay Savinov

Yeah, MCP for the win.

🤍0 likes💬 0 comments

Add to My Notes

00:59:12Logan Kilpatrick

I love that. Nikolai, this was an awesome conversation. Thank you for taking the time. I'm glad we got to do this in person. Um and appreciate all the hard work from from you and the long context teams uh and hopefully we'll have lots more exciting long context stuff to to share with folks in the future.

🤍0 likes💬 0 comments

Add to My Notes

00:59:27Nikolay Savinov

Yeah, thanks for inviting me. It was uh fun to have this conversation.

🤍0 likes💬 0 comments

Add to My Notes

00:59:31Logan Kilpatrick

Yeah, I love it.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here