TextPurr Logo

TextPurr

Loading...
Loading...

Jeff Dean's talk at ETH Zurich in April 2025 on important trends in AI

Google
Important Trends in AI: How Did We Get Here, What Can We Do Now and How Can We Shape AI’s Future? by Jeff Dean. This informative talk succinctly summarized a truly remarkable progress in model architectures, hardware and systems for ML/AI.
Hosts: Unknown Host, Jeff Dean
📅April 20, 2025
⏱️01:18:47
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "Jeff Dean's talk at ETH Zurich in April 2025 on important trends in AI" from "Google". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=q6pAWOG_10k

00:00:00Host

All right, welcome everyone. Great to see a full house. It is my great pleasure to introduce Jeff Dean, who is Google's chief scientist. Uh he joined Google in 1999 where he's been uh building and uh co-designing, co-implementing the pillars of Google's distributed technology with systems like MapReduce, uh Bigtable, Spanner, uh and TensorFlow, more recently pathways. Uh and uh in 2011, he co-founded the Google Brain team. uh and since then his focus in research has been on systems and applications for AI. and today he's going to tell us about important trends in AI. And I should also mention he's won many awards. Uh he's the recipient of the ACM Prize for computing, the IEEE von Neuman Medal, uh the Mark Weiser Award, uh and he's an ACM fellow among many others. So, we are very, very excited to have you here in case you can't tell by the turnout and very much looking forward to your talk. So, a warm welcome to Jeff Dean.

💬 0 comments
Add to My Notes
00:00:55Jeff Dean

Thank you. Thank you so much for the delightful introduction. Uh I'm really excited to be here and um I'm going to talk to you today about important trends in AI. You know, how do we get to where we are with the current state of what models can do? Uh what can we do now? now that sort of the field has advanced to the current level, and how can we shape what we want AI to do in the future? Um, and this is joint work with many, many people at Google and elsewhere. Uh, so it's not all my all my work. Uh, many of it is collaborative work, uh some of it is not necessarily my work, but I think it's an important uh set of of work to sort of discuss.

💬 0 comments
Add to My Notes
00:01:39Jeff Dean

Um, okay, so some observations, most of which are probably reasonably obvious to you. Uh, so I think most importantly, machine learning has really changed our expectations of what we think of computers as being able to do. You know, if you think back 10 years ago, you know, computers could barely see with the sort of rudimentary computer vision, uh, performance. Uh, they could speech recognition worked but not super well. Language understanding in terms of language models was kind of somewhat limited in terms of its capabilities. Uh, but what we've seen over the last, you know, uh 12, 13, 14 years is that increasing scale of both the compute used to train the models, the data and the model size, increasing those things generally delivers better results. There's a like a almost a truism to that in many ways where we've seen this over and over and over over the last uh, you know, 15 years that bigger model, more data gives you better performance in problems we actually really do care about in terms of uh, you know, capabilities of of uh computers. Uh and algorithmic and model architecture improvements have also been really important in this. So it's not just a, you know, throw more hardware at the problem kind of thing. Algorithmic and model architecture improvements have actually been even more significant than just the hardware improvements we've seen in the last decade. And uh finally, as a sort of result of all of this, the kind of computations we want to run on computing hardware are really changing and how we think about building the computer hardware to run kind of the applications of today and tomorrow is really shifting from traditional CPU based computation.

💬 0 comments
Add to My Notes
00:03:22Jeff Dean

Okay, so, first, I'm going to go through a section that is a whirlwind one slide per advance. Oh, I should relaunch Chrome within two days. Hang on, let me I agree I should probably relaunch Chrome, but let's try to not do it right now. Um, so, a whirlwind of one or two slides per particular technique that has been really influential in getting modern models to to how they uh came to be. And let's just launch right into that and it's going to be mostly chronological but not quite.

💬 0 comments
Add to My Notes
00:03:59Jeff Dean

Okay, so obviously a key building block from the last century uh has is neural networks. A lot of almost all of the advances you see in machine learning uh at the largest scale and in the capabilities you see computers have is based on sort of uh neural network-based computation. And these are sort of made up of artificial neurons. They're loosely based on how real neurons behave in some ways, but they're sort of very imperfect reproductions of how we even understand real neurons to behave and there's lots we don't understand. But they are sort of one of the underlying building blocks. And another key building block is back propagation as a way to optimize the weights of the neural network. Uh, so be by essentially back back propagating errors from the output the model gave you to the output you want the model wanted the model to give you, uh, really gives this effective algorithm for updating the weights of a neural network to minimize errors on a training data. And then, uh because of the generalization properties of neural networks, you can then generalize to problems or particular examples the neural network has not seen. So those two things uh really are, you know, key to a lot of the deep learning revolution is back propagation and neural nets.

💬 0 comments
Add to My Notes
00:05:13Jeff Dean

So, uh one of the things that I and some other people worked on in 2012 was really this notion that maybe if we were to train really, really big neural networks, they would be even better than small ones. Um, and so we had this hypothesis and in 2012, we decided it would be kind of fun to train a very large neural network, uh, and see if we could do it, uh, using an unsupervised learning algorithm. Um, and so we trained this large neural network, which was about 60x bigger than the previously uh largest known neural network in uh 2012. Um, using 16,000 CPU cores. At that time we didn't have GPUs in our data center. We had a lot of regular old CPU computers. Um, and what we saw was that this unsupervised training objective followed by some supervised training actually gave a 70% relative improvement in the less thinly contested image net 22K uh uh category. Most of the image net results you hear about are in the 1,000 category section. This was more interesting perhaps because it has 22,000 very fine grain categories. And this was quite a significant advance and kind of, you know, proved our hypothesis of larger models being more capable if you sort of put sufficient training computation behind them. As part of doing that work, we developed a uh our first sort of neural network large scale infrastructure systems project. Uh, this was called disbelief, D T belief, uh, partly because it was distributed uh over many machines for distributed computing system, but also because our colleagues didn't really think it was going to work. So it was a little bit of a play on words. Uh but really when you're training these large models, uh and the model doesn't fit on a single computer, there's a few different ways you can imagine parallelizing that computation. The first is you take your model, which typically in a normal that you have many layers of of neurons and you can slice them both vertically and horizontally say, to produce pieces of the model on each computer and then have communication managed between the edges that cross between the the different uh splits you've made in your model. Um, the other thing you can do is you can uh do data parallelism where now you have many copies of the underlying model on different machines, or perhaps combined with model parallelism, each copy being on many machines. And then you partition the data you're training on across those different model replicas, and in the case of what we were doing in disbelief, we actually had a centralized system that could accept gradient updates from different replicas of the model and would apply them, uh, sort of to the parameters. But we actually did this kind of in the not mathematically correct way because we were just doing this completely asynchronously. Different model replicas would get new copies of the parameters, would compute on a bit of data, would send a gradient based on those parameters and the the training data uh for that batch back to the parameter server. But by then, the parameters have moved because other model replicas have applied their gradient uh in the interim. And so this is clearly not mathematically correct according to the gradient cent algorithm, but it works.

💬 0 comments
Add to My Notes
00:08:29Host

So, that that's nice.

💬 0 comments
Add to My Notes
00:08:30Jeff Dean

Um, And that that's what really enabled us to scale to very large uh scale models even using CPUs.

💬 0 comments
Add to My Notes
00:08:42Jeff Dean

In 2013, uh we used that framework to actually scale up training of uh dense representations of words using a word embedding model, uh called Wordtovec. Um, And one of the things that is really useful coming out of this particular piece of work is that by having a representation of a word that is a high dimensional high dimensional vector, um, you get two nice properties if you train it in particular ways. One of the ways you can train it is shown at the bottom here, where uh you can take the representation, the vector representing the middle word and then try to predict the nearby words from that representation. Um, there's a different version where you can take all the surrounding words and try to predict the middle word, uh, but they both work kind of roughly equally well. Um, but if you train the embedding vectors for words in this way, what you find is that you can then represent words with these high dimensional vectors which have two nice properties. One is that nearby words in this high dimensional space after you train on lots of data, tend to be related because you sort of uh nudge all of the words related to cats and Pumas and tigers into the same part of the thousand dimensional space, say. Um, the other interesting thing is that the directions are meaningful in the space. So, in order to transform a uh, you know, a male version of a word to a female version, you go in roughly the same direction regardless of whether the words are king and queen, man and woman, uh bull and cow, various things like that. So, like interesting length linguistic properties sort of emerge from the training process in the directions of the the um directions between different points in the space.

💬 0 comments
Add to My Notes
00:10:30Jeff Dean

In 2014, uh three of my colleagues, Elias Sutzkever, Oriel Viles and Cole, uh, developed a model called sequence to sequence learning with neural networks. Um, and so the idea here is you have some input sequence and you want to try to predict an output sequence from the input sequence. Uh a very classic case is translation where you have the English sentence and then using the representation you've been able to build up by processing the input English sentence one word at a time. You now have a dense representation that you start to uh do what's called decoding of the French sentence. And in the process of doing uh lots of language uh sentence pairs of English and French, you can essentially learn to do a uh um language translation system purely from this kind of sequence to sequence based uh neural network. And if you use uh that to initialize the state of the neural decoder when translating, starting to translate, that actually works. And you scale up the LMs and it and you show that it uh can work better and better.

💬 0 comments
Add to My Notes
00:11:38Jeff Dean

So in about 2013, I started to get worried because our as we were making bigger and bigger neural networks for things like speech and vision, uh and in some sense language, um, I started to do some calculations of, you know, well if say speech recognition starts to work better, uh people might use it. And that might be problematic if we actually want to serve a lot of users in the system. So I did a very rough calculation of like, okay, suppose 100 million of our users start to talk to their phone three minutes a day. And at that time the models were big enough you couldn't run them on device, you had to run them in our data center. And I discovered that rolling out a better speech model. We actually had a better speech model at that time that would that reduced the error rate by uh 40%. which is enormous amount in uh improvement in speech recognition. Um, so we knew it was going to be better if we could serve it to a lot of people. Uh, but uh the calculation I did said 100 million people three minutes a day, we would need to double the number of computers Google had just to roll out that improvement in the speech recognition model. which is kind of one of the our many sort of products. Um, so I started to talk to some of our uh colleagues in our technical infrastructure group who have hardware expertise. and we decided it would be sensible to build a more customized hardware for neural network inference. And that was the genesis of the tensor processing unit line and the first version of this was really specialized for uh inference only. So it used very reduced precision, it had only 8-bit integer operations uh in its uh multiplier. Uh but really the target was let's build something that's really good at low precision linear algebra and that will be useful in a lot of different serving a lot of different kinds of neural network based models. And you don't need all the other kinds of bells and whistles in say modern CPUs that make things much more complicated, like branch predictors or caches of various kinds. You can instead just try to build the fastest and smallest dense linear algebra, low precision thing you can. And sure enough, that that was what uh a large team uh produced was uh a TPU that was 15 to 30 times faster than contemporary CPUs and GPUs at these kinds of tasks. And 30 to 80X more energy efficient. Um, and by the way, this is now the most cited paper in Isca's 50-year history, which is pretty impressive because it was only published in 2017. Um but this is really started our foray into more specialized compute for uh machine learning models.

💬 0 comments
Add to My Notes
00:14:26Jeff Dean

Then we said, well, it would be really good to actually scale things up and also focus on training and not just inference. Uh and so that's where we started to um think about system, you know, really more like machine learning supercomputer based systems that have really high-speed interconnect between a lot of chips that are densely connected by this uh customized speed interconnect. So we've now done six generations of five generations of TPU pods, which are really good for both inference and training. Um and these connect uh thousands of chips together. The initial one was 256, then 1,000, then 4,000, and the most recent ones have been around 8 or 9,000 chips all connected together with uh custom high-speed networks. Uh since version 4, they've had this like really exotic uh optical network. So you can take a rack of 64 chips and a rack of 64 chips over there and use funny optical switching and moving of mirrors to make it seem as though they're next to each other in the data center floor even if they're not. Uh so you can read all about that in that Isca paper. Um, and we actually just announced last week the latest version of this, this is Ironwood. We stopped naming them with number with numbers which confuses me. But uh, so Ironwood has uh, you know, a quite large pod size. It's got nine 9216 chips, each of which can do 4614 teraflops. Uh and so that's 42 and a half exflops uh in one of these pods. Uh of reduced precision floating points. This is uh floating point, 8-bit floating point precision. Uh but it is quite a boost up from the previous generation. Uh and so compared to the first uh training pod, it's about a 3600 increase in uh compute capability in the pod. uh in uh, you know, seven years or something.

💬 0 comments
Add to My Notes
00:16:49Jeff Dean

Another trend that's really important is that open source tools for machine learning have really enabled a much broader community to uh, you know, participate in improving those tools, but also to use those tools to tackle uh machine learning problems of all kinds of different uh ilk across many different disciplines. And I think, you know, TensorFlow, which we released in 2015, Pytorch, I think came in 2016. And Jax, which is another Google developed open source uh framework uh with a bit more of a functional style. uh in 2017 or 18, I think. Um these three things uh packages have really, I think pushed the field forward quite a lot in terms of um how accessible machine learning is to people, how standardized the represent sort of the way in which algorithms are expressed in these different frameworks.

💬 0 comments
Add to My Notes
00:17:40Jeff Dean

Um, in 2017, uh several of my colleagues came up with this uh very nice observation that in a recurrent model, uh you have this very sequential process of absorbing one token at a time and and then updating the internal state of the model before you advance to the next one. And that inherent sequential step, uh really limits how much parallelism and how efficiently you can learn from large amounts of data. And so they had this really nice observation that we're instead of sort of taking the current state, advancing it one step and changing that internal state, and then advancing to the next training token and updating that internal state again. They would just save all the internal states. And then they would have a mechanism called attention that would enable you to refer back to all the states you went through in order to get to the particular, uh, you know, if you're 117 tokens in, you have all 117 states that you went through and you can attend to them and look at which pieces of the representation seem most relevant for the task you're doing, which is often predicting the next token. So this is a hugely influential paper and part of the reason it's so influential is they were able to demonstrate on machine translation initially, uh that with 10 to 100X less compute and 10 times smaller models, you could get better performance than the state of the art LSTM or other uh model architectures at the time. Uh and you know, that's a log scale, so that difference uh is quite large even though the arrow is small. Um, so that's been really, really important. Um, nearly all modern large language models that you're hearing about use a transformer uh as the underlying uh sort of model architecture with some variation.

💬 0 comments
Add to My Notes
00:19:32Jeff Dean

Um, this is probably not new in 2018, but really came more into vogue in 2018, which was the realization that language modeling at scale can be done with self-supervised data. If you think about any piece of text and you can use that text to predict other parts of the text. And by doing so, you actually have a the right answer in some sense, what the actual text is. And that has been shown, it it gives you sort of very, very large amounts of training data and uh that's one of the major reasons these language models have gotten so good is you can get more and more text to train on and improve the quality. And there's a couple of different kinds of training objectives. The first is autoregressive where you get to look at the prefix of words and then you try to predict the next word. So many of the models you hear about today are, you know, of this form and so you can make lots of little training game puzzles for your model. Zurich is blank, Zurich is the blank, Zurich is the largest blank. And the model is forced to use the state that it sees to the left to try to predict the uh missing word. Um, from the context. Uh, you can also use kind of fill in the blank style uh representation or training objectives. So, Zurich blank, the largest blank in blank, uh is an interesting training example. And you can then take the same text and turn it into a completely different training example with different things you have to fill in. Zurich is the blank City, blank Switzerland. Um and so both of these kinds of training objectives are very useful. The autoregressive one tends to be used more because that's what you're trying to do when you're say operating a chatbot and you have a conversation. You don't have the the conversation to the right because it hasn't happened yet. You only have the stuff to the left.

💬 0 comments
Add to My Notes
00:21:33Jeff Dean

In 2021, some other colleagues of mine developed a way of mapping image tasks into a transformer based model. So prior to that, most people were using convolutional neural networks of some form. Uh but essentially they were able to take uh an image, break it into some patches. And in the same way the Wordtovec embeds words into a dense representation of those words, they're able to take patches of pixels and do a similar thing where you now represent that with some high dimensional vector for that patch. You know, it might incorporate aspects of the, you know, the color and the kinds of orientations of things, but those embeddings are learned and then once you have them, then you feed them into the rest of the transformer model and instead of using uh word embeddings for the input here, you use patch embeddings. And that sort of enables you to uh deal with image data. Uh as you'll see later, when you're training multimodal models, you can actually combine these so you can put in either text or images and the text the visual patches you just embed in a with a visual model and then the text patches you embed with a early early part of a uh text model. And you can kind of visualize that the attention uh operation in the transformer on the right hand side here is actually attending to kind of interesting parts of the image. When it's asked to say what's in that image, you know, it's attending to the airplane or the dog, or when it's got a snake and there's a lot of kind of confusing stuff, the attention is less focused. It's trying to look in all over the image to see the visual clues that might enable to predict the the right thing. So this has been hugely influential in unifying transformers for text with transformers for images.

💬 0 comments
Add to My Notes
00:23:25Jeff Dean

Another observation by some colleagues of mine was that thinking longer at inference time is very useful. So in the same way that your third grade math teacher told you to show your work uh when you were solving problems, uh because you were more likely to get the steps sequence of steps right in order to solve the problem correctly. Uh it turns out uh large language models are the same way. If you just give them an example problem, you know, Sean has five toys, for Christmas he got two from his mom and his dad, how many toys does he have now? The answer is nine. That's the one shot example in the input and now you're asked the new problem, John takes care of 10 dogs, each dog takes half an hour a day to walk and takes care of the business. How many hours a week does he spend taking care of dogs? Then the model uh got this particular problem wrong, said 50. That's not correct. Um, but if you encourage the model to show its work by in the one example problem you've given it, actually show it that hey, this is kind of the sequence of steps to work out the problem. Uh, you know, Sean started with five toys. If he got two toys each from his mom and his dad, then he has four more toys, 5 + 4 is 9, the answer is nine. So that's like seems very simple, but it actually turns out that this tremendously helps models become more accurate. because they are now uh encouraged to think through the steps in order to solve the problem in a finer grained way. And you see that as the model scale increases, um the the solve rate goes up somewhat if you just use standard prompting, but goes up dramatically when you use chain of thought prompting. This is for like uh benchmark of like roughly eighth grade math level problems. So prompting the model to show its work improves the accuracy on reasoning task. And you can think of this is also a way of using more compute inference time because now it has to produce all these extra tokens uh in order to actually get

💬 0 comments
Add to My Notes
00:25:25Jeff Dean

In 2018, we started to think about what better software abstractions can we have for these large distributed machine learning computations. Um, and we knew we wanted to train models at larger scale, so each one of these sort of smaller boxes with yellow dots in it, uh you can think of as a TPU pod and you want to be able to train a model where uh you can connect together many of these TPU pods in software. And have kind of a underlying distributed system manage the right sort of uh communication mechanism for when one of these chips needs to talk to another. So when two yellow chips in the same uh small box need to talk to each other, you use the very high speed TPU network. Uh when uh say the chip in the upper uh left box needs to talk to a one in the pod in the same building. It will use the data center network within that building. If it needs to talk across buildings, it will use the uh network that goes between buildings in the same uh say data center facility and you can even have uh TPU pods connected together in different regions via, you know, larger uh wider area network links, that big orange orange red uh arrow. And by having this nice scalable software that can simplify running these large scale computations. In fact, one of the abstractions that pathways gives to the sort of machine learning developer researcher is you just have a single Python process. And Jax has a notion of devices. So normally if you're just running on a single machine with say four TPU uh uh chips in it, it just shows up as a process with four chips. Uh but what pathways does uh when you run it uh jaxs under with pathways underneath it, um all the chips in this entire training job just show up as devices for jaxs. So you have a single Python process and it looks like you just have a single uh C of say 10,000 or 20,000 uh TPU devices um and you can uh run computations on that and pathways takes care of mapping the that computation onto the actual physical devices. And one of the things we've just done the last week was made the pathway system which we've used internally for now six six or so years, uh available for cloud customers for uh people using our cloud TPU products.

💬 0 comments
Add to My Notes
00:27:57Jeff Dean

Another observation by some colleagues of mine was that thinking longer at inference time is very useful. So in the same way that your third grade math teacher told you to show your work uh when you were solving problems, uh because you were more likely to get the steps sequence of steps right in order to solve the problem correctly. Uh it turns out uh large language models are the same way. If you just give them an example problem, you know, Sean has five toys, for Christmas he got two from his mom and his dad, how many toys does he have now? The answer is nine. That's the one shot example in the input and now you're asked the new problem, John takes care of 10 dogs, each dog takes half an hour a day to walk and takes care of the business. How many hours a week does he spend taking care of dogs? Then the model uh got this particular problem wrong, said 50. That's not correct. Um, but if you encourage the model to show its work by in the one example problem you've given it, actually show it that hey, this is kind of the sequence of steps to work out the problem. Uh, you know, Sean started with five toys. If he got two toys each from his mom and his dad, then he has four more toys, 5 + 4 is 9, the answer is nine. So that's like seems very simple, but it actually turns out that this tremendously helps models become more accurate. because they are now uh encouraged to think through the steps in order to solve the problem in a finer grained way. And you see that as the model scale increases, um the the solve rate goes up somewhat if you just use standard prompting, but goes up dramatically when you use chain of thought prompting. This is for like uh benchmark of like roughly eighth grade math level problems. So prompting the model to show its work improves the accuracy on reasoning task. And you can think of this is also a way of using more compute inference time because now it has to produce all these extra tokens uh in order to actually get to the right format of answer.

💬 0 comments
Add to My Notes
00:30:03Jeff Dean

Um, in 2014, uh Jeff Hinton, Oriel Vials and I uh developed a technique called distillation, uh distilling the knowledge in a neural network. The idea was you have a really good model and you want to put its knowledge into a different model, in typically a smaller one.

💬 0 comments
Add to My Notes
00:30:23Jeff Dean

Um, and so the typical way you're training the small model is let's say you're doing next token prediction. So the prefix you see is perform the concerto for blank, and the true next word is violin. So you can train your model, your language model with that objective and if you guess violin correctly, great. If you guess it wrong, then you get uh some back propagation error from the the training objective.

💬 0 comments
Add to My Notes
00:30:48Jeff Dean

Um, and it turns out that works okay, but if you're if you can use your teacher model to give you not just the correct answer but a distribution over what it thinks are good answers for this question uh for this particular word, um, it gives you a much richer signal of training.

💬 0 comments
Add to My Notes
00:31:06Jeff Dean

So think of the loss you or the the the objectives you get for the original just violin, you get uh zero correct for everything except violin and then you get a one. But here, the distribution of probabilities is violin 0.4, piano 0.2, trumpet 0.01, but airplane is extremely unlikely in this circumstance. Uh, concerto over airplane. I don't know. I guess you could have one, but unlikely.

💬 0 comments
Add to My Notes
00:31:36Jeff Dean

Uh and that really rich gradient signal is something that you can use to inject much more knowledge into every training example for the smaller model and enables you to to get uh to convergence much more quickly.

💬 0 comments
Add to My Notes
00:31:49Jeff Dean

So if you look at some of these comparisons, this is a speech-based setting where you have a training frame accuracy, but what you really care about is the test frame accuracy. Uh did you predict the sound in this frame of audio correctly. And the baseline with 100% of the training data, uh gets 58.9% on the test frame accuracy. If you strip the training set down to only 3% of the training data, uh, then your training frame accuracy actually goes up because your model overfits to the very small now number of training examples you have. Uh but your test frame accuracy plummets because you're now you're in an overfitting regime and you can't do very well uh on new uh exam test examples you've never seen before. But if you use these soft targets produced by the distillation process, uh and use only 3% of the training data, what you see is you get uh pretty good training frame accuracy, but you get almost as accurate the test frame accuracy with only 3% of the data. And this is a really nice property because it means you can suddenly transfer the knowledge of a large neural network into a small neural network and make it almost as accurate as the the large one.

💬 0 comments
Add to My Notes
00:33:01Jeff Dean

Uh so this was rejected from Nup 2014. Uh, we published it in a workshop and put it in archive uh and uh it now has 24,000 citations, so we'll take it.

💬 0 comments
Add to My Notes
00:33:17Jeff Dean

In 2022, uh 2022, uh some colleagues and I looked at different ways of mapping computation onto uh our TPU pods for um for doing efficient inference. And there's a whole bunch of variations one can consider. Do you know, do you keep the weight stationary in one of the dimensions of the network? Do you keep them stationary in both dimensions so that your weights are now spread uh across a two-dimensional thing. Uh or do you gather the weights and bring them to the part, uh the details aren't that important, but there's a bunch of different ways of doing it. And one of the things that is true is the right choices for how to do this actually depend on a lot of different factors. So, one is, uh what is your batch size? Actually can have a lot of influence on whether one of these three techniques is actually better. And latency constraints can also have a big effect. So, if you think about this, we have these three different techniques, um weight stationary, X- weight gathered and X- Y weight gathered and there's even another one X- Y Z weight gathered. Uh, what you see is the little dotted things at the bottom of these things are the best thing to do uh at varying different batch sizes. And that the right answer changes uh as you change the batch size. And that also means your floating point utilization of your hardware also changes depending on your strategy, and the right answer depends on uh how large your batch size is. At very small batch size, you want to use a 2D weight gathered in this case and at larger batch size, a 2D, oh, sorry, I reverse that. 2D weight stationary uh small sizes and a uh 2D weight gathered at larger. So, it's just to say that there's a lot of complicated choices in how do you decide how to partition a model and do inference and uh uh scale.

💬 0 comments
Add to My Notes
00:35:12Jeff Dean

In 2023, uh some colleagues of mine, uh developed a technique called speculative decoding. Um, and so the idea here is we're going to use a small draft model, maybe 10 to 20 times smaller than the larger model with the idea being that many things are actually quite predictable by a small model. Um, and so we can sequentially predict from the very small drafter model much more rapidly than we can sequentially predict from the very large model. Um, and so we're going to predict the next K tokens with the small model, uh, and that's, and then we're going to ask the large model to predict K tokens, uh, in a row. Um, and we can advance this generation by as many tokens as match in the prefix of size K. So essentially, if you do this with just the large slow model, it's going to trundle along predicting one word at a time. But if you do this with the Ah, there we go. The the drafter model, you see the drafter is predicting, you know, four or five words at a time. And then the larger model is trying to predict and then will advance as many of the words match that the drafter model created for you. And by doing uh size K predictions for K words, you essentially amortize the memory overhead of bringing in the weights of the model in order to then predict uh K words instead of just one.

💬 0 comments
Add to My Notes
00:36:43Jeff Dean

So, there's an awful lot of things that have happened, all kind of combining together to really improve the quality of models that people are seeing uh today, you know, better accelerator hardware, that's true in TPUs, but also uh Nvidia GPUs have gotten a lot better in recent years uh for machine learning focused applications as well. Um, software abstractions are really important because they enable you to have kind of these nice layers where you can focus a lot on the performance and the abstractions provided by those things and then people on top can build, you know, useful things without necessarily having to think uh about the details uh as much under underneath those abstractions. Uh model architectures have you've seen huge improvements in, uh particular transformers, visual Transformers and MOEs are really heavily used in the most modern uh models. Uh training algorithms, unsupervised and self-supervised learning, asynchronous training, distillation, uh I didn't talk about uh supervised fine tuning after you've pre-trained your model or RL uh from human feedback or other kinds of uh computational feedback. Uh that's a super important aspect. Uh chain of thoughts, speculative decoding and inference time compute scaling. All of these are really, really important in the modern sort of era.

💬 0 comments
Add to My Notes
00:38:02Jeff Dean

So, now I'm going to talk a little bit about uh the Gemini models that we've been training and how you know, most of these innovations are used in various iterations of the Gemini models. So, Gemini is really a a project that started as a collaboration between uh Google Deep Mind, uh Google Research and the rest of Google. And we started this in February 2023 with our goal being to train the best multimodal models in the world, uh and use them all across uh Google. There's all kinds of ways in which these models can help various Google products. Uh they're also uh you know, uh available externally through our cloud APIs. Um, and so this is kind of a timeline of what we've been up to since uh February 2023. We released uh Gemini 1.0 in December, uh 2023, uh followed soon thereafter by Gemini 1.5, uh and so on.

💬 0 comments
Add to My Notes
00:38:59Jeff Dean

So, um, one of the things we wanted was to make these models multimodal from the very beginning because we felt like just text models were not as useful as models that could sort of understand language, understand visual inputs, understand audio and also produce all those things. Um, the initial versions of the model did not produce audio as output, but they could produce take audio and video and images and text as input uh and produce images and text as output. Uh we've since added the ability to produce audio output as well.

💬 0 comments
Add to My Notes
00:39:34Jeff Dean

Um, Gemini 1.5 introduced this very long context length, uh so that uh you can provide inputs that are, you know, millions of tokens in length. Um, so think about it, you know, a a thousand page document is about a million tokens. So you can now put, you know, 50 research papers or a very long book or multiple books into the context window and one of the nice things about the input data in the model, uh particularly transformer models because of the attention mechanism, that information is very very clear to the model. You know, unlike training data where you've sort of trained on trillions of tokens and you've optimized your billion billions or tens of billions of parameters of weights with those trillions of tokens, you've kind of stirred them all together and you've lost a little bit of the fidelity of the exact, you know, pieces of information there. But in the context window, that information is very clear to the model and enables it to sort of extract and summarize and and reason over that data much more uh capably than other kind of data.

💬 0 comments
Add to My Notes
00:40:41Jeff Dean

Um, and in Gemini 2.0, you know, as I said, these models build on a lot of these innovations. So we use TPUs, we do cross data center training across metropolitan areas, using pathways, uh using Jack on top of that. Uh the distributed representations of words and image data is super important, Transformers, sparse mixture of experts and distillation, uh and a lot more things besides but really these all kind of come together in our model training recipe and our model serving recipes. Um just about a month ago, I guess, uh we released Gemini 2.5 Pro, which is our our most recent model. Uh and this has been, you know, pretty well received because it has, you know, a pretty significant leap forward in some of our uh, you know, various benchmarks that it uh performs on. It's gotten, you know, a lot better at coding compared to our previous Gemini models. Um and actually has there's a arena for how to compare model quality across different models uh that is run by uh LM Arena, which is a Berkeley affiliated uh sort of group of grad students. Um and they sort of get enable users to enter a prompt and then give the pick two random models that they're uh backed by, uh that are behind the scenes and then they show the output from both models to the user uh anonymously. So you don't know which model is which, you just and then you're asked which which output do you like better. So it's sort of a head-to-head competition of language models and through thousands of trials like this, you can actually get a very good sense of the strength of models, at least in terms of how well do the answers reflect what people using this LM arena uh like. And we've found it pretty useful. It does it does correlate quite well with the strength of the models. And so this has a pretty significant elo improvement over our previous models. Um, and it's actually done pretty well on a whole bunch of independent evaluations that people do across the the web, um, and on various uh sort of more academic benchmarks on the left there. Uh, we are sadly number four on New York Times connections. So we'll have to work on that, but um, but in general, you know, this set of of leaderboards covers quite a broad set of areas. Some of these are coding related, some are math related, some are sort of multimodal related. Um, and so we really try to focus on making good general purpose models that are good at a lot of different things.

💬 0 comments
Add to My Notes
00:43:17Jeff Dean

Users are generally enjoying this. Uh, you know, some of this is a little uh over the top phrase, but whatever. Uh, people do seem to like it.

💬 0 comments
Add to My Notes
00:43:28Jeff Dean

Um, uh, in particular, the long context abilities are really, really good for coding, particularly now that the reasoning capabilities of the model are also greatly improved. You know, so having a million or two million tokens of context enables you to put quite large code bases in entirely into the context window and then ask the model to do fairly complicated things, like, can you please refactor this for me or can you introduce a a new feature that has this capability this this sort of property. Um, and, you know, uh also enables you to process um other kind of data. So like this bottom person has a data set of a thousand poems, 230,000 tokens, and then asked a bunch of stuff which requires reasoning over all those poems. Uh, and they were quite impressed by that because I guess that's hard. Very, very good. Uh, anyway.

💬 0 comments
Add to My Notes
00:44:25Jeff Dean

So, um, one of the things we really focus on, if you if you look at the Y axis here, this is the Elo score I mentioned from LMM Arena. So a higher in the Elo score means a more capable, higher quality model as judged by those users. Um, and then on the X- axis is the cost of a whole bunch of different kinds of commercial models. Uh, importantly, the X- axis is a log scale. So, don't, don't miss that important point. Okay. Uh, What advertising opportunity. Oh, yes. That really rings a bell. Um, so, where you want to be is as far up into the right as you possibly can. And, you know, we produce a series of different models with different quality and cost trade-offs. So our flash models, uh over to the right are generally quite cheap, so they're like 15 cents per million tokens. Uh, our most recent 2.5 Pro model is more expensive because it's a much heavier weight model, costs more for us to run it, but it's still quite affordable for the quality you get. And so generally, we like to see that we have a variety of offerings on the Pareto frontier of this quality cost tradeoff, and we're going to work to keep pushing up into the right there as much as we possibly can.

💬 0 comments
Add to My Notes
00:45:51Jeff Dean

Okay. Uh, let me talk a little bit we What happened? It was there.. I didn't do it. I promise.

💬 0 comments
Add to My Notes
00:46:02Jeff Dean

Um, so Gemini is a pretty large scale uh effort. Um if you look at the Gemini 1.5 paper. Uh, we do have quite a few authors. Uh, it's very hard to write a short paper if you have to list all your authors. Um, but truly, it's like really a large scale team effort and really everyone here really contributed tremendously to this. And um, you know, one of the things we've had to figure out was how can we best structure this so we can have that many people be effectively contributing to a single single model project. Um, and so some of the structuring techniques we use are to have different areas that people kind of loosely affiliate with. Um, so some people are much more focused on say the pre-training process or on data or on safety or eval. Um, not to say that these are very hard boundaries, but generally some people have, you know, some affiliation with some of these more than others. Um, so there's overall tech leads of the project. So that's uh myself, Oriel Vials and Gnome Shazier. Uh and then we have a really capable program management and product management team. Although Gemini is kind of a model creation thing, it does have a lot of product implications because we want to release that model into lots of different surfaces at Google. And so interacting with all those other teams about what features do they need, where are they seeing the model perform well, where where and more importantly, where is it not performing well, uh and, you know, getting feedback from from those is something that's really important. And then we kind of have three broad categories of these different areas. One is model development, pre-training, where you're training on a large corpus of uh text and other multimodal data. Post training where you've got you've finished pre-training the model on lots of data, and now you're trying to coax the model into behaving in certain ways with relatively small amounts of data using things like reinforcement learning or supervised fine tuning so that it has, you know, uh politeness in the responses that it gives or, you know, has a propensity to get mathematical problems more correct than just looking at the print training uh pre-trained model. Uh on device models is another important one, you know, we have Gemini models running on phones, uh and that has a slightly different character than some of the larger data center based ones. The core areas are kind of ones that cross cut uh most aspects of Gemini, so training data, evaluations, infrastructure, the codebase for research and for model expressing the production model training and inference systems. Uh serving is really important, longer-term research uh with within Gemini. There's also a lot of research that happens outside of Gemini and we sort of keep an eye on that kind of work and and our colleagues will say, hey, we have something that might be sensible to consider for the next generation of Gemini. And then capabilities are generally about particular sort of narrower aspects of the model. Can we make it safe and behave well? Is it really good for coding? Can we make it good at vision tasks in particular or audio tasks in particular? Agent behavior is now a very important aspect of what we're doing. Internationalization because we want this thing to work well in, you know, hundreds of languages, not not five. Um, so these are kind of broad areas.

💬 0 comments
Add to My Notes
00:49:34Jeff Dean

Uh, we have of those people, roughly a third of them are in the San Francisco Bay Area. Uh I'm based in Mountain View. About a third are in London and a third are in a bunch of other places including Zurich, but uh New York City, Paris, Boston, Bangalore, Tel Aviv and Seattle are some of the bigger concentrations of people not in the first two areas. Um, time zones are really annoying. I don't know if you all feel this, but uh the golden hours between the California West Coast and London, Europe uh during the work day are relatively limited. Maybe two or three hours a day that you really have sensible meeting times for both sides and pass that then one side is like, I don't know. Um, our poor Bangalore colleagues are are uh never in Golden hours with anyone else. Uh, but it's, you know, but it is a worldwide effort. So there are some benefits to having people all around the world because when the model is training, there's always someone awake and sort of paying attention to a large scale training run. Um, you know, often you might fire off a question to a colleague in London and they're not there, but when you wake up in the morning, you know, they've answered and done a bunch of work on your behalf. Uh so there are benefits but but distributed work is is challenging.

💬 0 comments
Add to My Notes
00:50:52Jeff Dean

Uh, one of the ways we've been able to make this work is we have lots and lots of large and small discussions and information sharing conducted in virtual uh Google chat spaces. Uh I'm in 200 of these, so I I wake up and I'm brushing my teeth and I get like probably seven alerts while I'm brushing my teeth in the morning. Uh because my London colleagues are busy at work and uh excited about sharing things in various chat rooms. Uh, we have a slightly formalized request for comments, which is really a, you know, think of it as a anywhere from a one to a 10-page document about some piece of work or thread of work or results that have been gotten or experiments they're thinking about doing to to sort of get some results. Uh, people will give feedback in Google Docs style things. We have a slightly formalized way for some of these to say, yes, we think this should make it into the next generation of our model training or the new recipe. We have leaderboards and common baselines to enable good data driven decision making about how to improve the model. Uh so many rounds of experimentation. Lots of experiments at small scale. You want to advance the smaller scale experiments that seem promising to the next scale to see if they the results kind of hold up and are on trend. Um, every so often, every few weeks, you incorporate successful experiments in demonstrated at the largest scale into a new candidate baseline, you run that candidate baseline, see if it's better than the previous baseline and does it have any sort of unexpected interactions among the few things you piled in there and then you repeat. So, that that's kind of a particularly for some of the pre-training recipe development, that's that's kind of the way we do that.

💬 0 comments
Add to My Notes
00:52:38Jeff Dean

Uh, I mentioned scaling of people, but also training of computing hardware, uh is scaling of computing hardware is quite annoying. Uh, so I'll give you just one example, silent data corruption. Uh, so despite the best efforts, uh given the scale of these ML systems and the size of the training jobs, you will get hardware errors that sometimes are not going to be detected by the hardware. Uh and these incorrect computations because it's a very large coupled system, uh from one buggy chip can then spread to the entire model. Uh, so non-deterministically producing incorrect results, uh which can happen for particular pieces of hardware, which can happen on any piece of hardware randomly due to like uh various uh, you know, background radiation kinds of aspects, uh these become worse at scale with synchronous stochastic gradient descent and it can spread bad results. So, uh, one of the things we do is we, as we're training, we monitor the norm of our gradients and if we see large spikes in that, we get concerned. Uh, so is it justified to be concerned? Uh, we don't know. Uh, it's certainly a large gradient relative to the ones we've seen recently. And you can also get anomalies with no uh silent data corruption error. Um, so the first one was actually a silent data corruption error and the way we detect that is we rewind a few steps and we replay uh in a deterministic manner. And if we see the same result, then we say, well, it must be in the data. Uh it's probably not hardware failures. If we see a different answer though, that's concerning because everything's supposed to be deterministic uh when we replay. And so, in this case, we did see an anomaly in the gradient, but we replayd it and we actually saw that the uh, you know, the same large gradient value occurred in in the replay as well. Now, you can also detect FTCs uh if you just happen to replay without an anomaly, right? So this is probably like the low bits of your exponent getting flipped by an error rather than the high bits. Uh the high bits being flipped is bad because then all of a sudden you have, you know, 10 to the 12 in the gradient when you expected a 0.7. Yep. Okay.

💬 0 comments
Add to My Notes
00:55:05Jeff Dean

Uh I'm going to skip that and give you some examples of what these models can do. Uh so they can help fix bugs in your code. nice. This person uploaded their entire code base all the issues and like it identified the urgent thing. I guess it was replaying, it was calling some handler twice and so the code added a flag to say has the handler been called and if it hasn't, then call it.

💬 0 comments
Add to My Notes
00:55:32Jeff Dean

Um, uh in context learning, so Kalamang is a language spoken by about 200 people in the world. Uh, and there's a woman who wrote a uh PhD thesis on a grammar of Kalamang. Uh there's no effectively written internet training data on Kalamang. Uh but what we've observed is that if you put this book into context in the model and then ask it to translate English to Kalamang or Kalamang to English, it can actually do about as well as a human language learner who's been given the grammar book and a dictionary uh for Kalamang uh to translate. So that's kind of nice because it shows in context learning uh at the level of I put in a 400 page PhD thesis uh about a topic the model has no idea about and it actually is able to sort of make sense of Kalamang and translate it uh with that.

💬 0 comments
Add to My Notes
00:56:26Jeff Dean

Um, video of a bookshelf to JSON. It's kind of fun. You might not have thought of that as an input method, but you know, you can do that. It's kind of good.

💬 0 comments
Add to My Notes
00:56:39Jeff Dean

Um video understanding summarization, so you can actually put in fairly long videos, a million tokens is about two hours of video. Um and you know, the prompt is in a table, please write the sport, the team and athletes involved, the year and a short description of why each of these moments in sports are so iconic. And the model gets to see the pixels of the video and the audio track. Uh and it's like an 11-minute video, I think. And so then the output of the model is that, which is, you know, probably more sort of text extraction, structured data extraction then you thought you might be able to get out of a uh in context video. So I think people are not yet clued into the fact that you can actually take multimodal data like that and do pretty interesting things.

💬 0 comments
Add to My Notes
00:57:26Jeff Dean

Uh, digitization of historical data. I just saw this the other day, you know, you can take weather data that looks like that from 100 years ago and just say, please give it to me in JSON. And it will do that. They've done it for 144 tables and that cost them 10 pence. Uh but now they're able to actually sort of unlock all this weather data.

💬 0 comments
Add to My Notes
00:57:48Jeff Dean

Uh, okay, code generation via a high level language. So here's the prompt we're going to give to our Gemini 2.5 model, uh P5 JS to explore a brought set. That's the prompt.

💬 0 comments
Add to My Notes
00:58:00Jeff Dean

Oh, can't Oh, I'm so sad. Why is it not able to do that? It was working before. Oh, I'm not on the Wi-Fi. It's true, I'm not. Well, anyway, it makes a really nice interactive visual Mandle Brough displayer, uh, explorer like that.

💬 0 comments
Add to My Notes
00:58:20Jeff Dean

And then I will skip over the next thing which Oops. Sorry about that. I am going to skip to uh a very brief part of this because I'm now running longer than I thought.

💬 0 comments
Add to My Notes
00:58:34Jeff Dean

So now that we have these models, what will this all mean for us in society? I think it's a really important set of topics. So I and eight other co-authors recently got together and wrote this paper called shaping AI's impact on billions of lives. Um, so uh a bunch of computer scientists and people who with machine learning backgrounds, uh from academia, like big tech companies and startups. And we wanted to propose what the impact of AI in the world could be given directed research and policy efforts. You know, a lot of people in this space are thinking about what will happen with AI if we're just lazy fair. You know, will we all be doomed or will we, you know, have incredible advances? You know, I think really a pragmatic approach is to say, let's as society and machine learning researchers and practitioners and experts all work together to try to shape things so that we get the best aspects of AI and minimize the downsides. And really that was what this paper was intended to be is a discussion of how might we do that collectively. Yeah. So, um, and there's the audience. It's all of those people. And we interviewed uh 24 different experts in seven different fields, uh employment, education, healthcare, information, media, Aha, okay. Uh so for example, we talked to former President Barack Obama, Sal Khan, education, John Jumper, who, uh, we talked to him before he won the Nobel Prize, but uh, he won the Nobel Prize later. Uh, Neil Stevenson, Dario Amade, Bob Oter. And we uncovered five guidelines for AI for public good. I will ignore everything after this, but you can see shaping.com. There's an archive paper from that site that I think is a pretty nice uh discussion of what will happen in a bunch of different areas including employment, education, healthcare, or what could happen in some of those areas. And it's pretty important for us to all work together to get this all right. And with that, I will conclude by we also proposed some nice milestones of what people should work on in some of these areas. Uh that was part of the motivation of the paper. So these models are becoming incredibly powerful and useful tools. And I think you're going to see continued improvement in this as there's more investment and more people in the field doing research and those uh sort of advances get incorporated into the leading models. You're going to see even more capable models. This is going to have a dramatic impact in a lot of areas. And it's going to potentially make really deep expertise available to a lot of people across a lot of different areas. And I think that's one of the things that is both most exciting but also kind of disconcerting to some people is that expertise being being widely available and uh um but done well, I think our AI assisted future is really bright. All right.

💬 0 comments
Add to My Notes
01:01:35Jeff Dean

Thank you.

💬 0 comments
Add to My Notes
01:01:48Host

Thank you very much for the great talk. A little token of appreciation from the department.

💬 0 comments
Add to My Notes
01:01:53Jeff Dean

Thank you so much.

💬 0 comments
Add to My Notes
01:01:54Host

Uh some chocolates and a systems group t-shirt.

💬 0 comments
Add to My Notes
01:01:56Jeff Dean

I love coming to Switzerland because I get chocolate and a t-shirt.

💬 0 comments
Add to My Notes
01:01:59Host

Thank you very much.

💬 0 comments
Add to My Notes
01:02:00Jeff Dean

So, yeah.

💬 0 comments
Add to My Notes
01:02:01Host

And uh we'll now proceed to the Q&A. Uh so we have one mic and we have one uh cube that we can toss around. Uh and we've discussed that we'll sort of also try to prioritize students especially for for questions. If you can, yeah, raise your hands if you have questions and you can point in a general area and my aim is probably not that great anyway.

💬 0 comments
Add to My Notes
01:02:20Jeff Dean

Okay.

💬 0 comments
Add to My Notes
01:02:21Host

Okay.

💬 0 comments
Add to My Notes
01:02:22Jeff Dean

Okay. Nice.

💬 0 comments
Add to My Notes
01:02:25Host

Oh, well done.

💬 0 comments
Add to My Notes
01:02:31Jeff Dean

Hi, uh, yeah, thank you so much and especially for the last paper you presented. Oh, yeah. Hold it hold it into your mouth. Like this? Yeah, perfect. Oh, there we go. So, thank you for the talk and especially the last paper. It's very important, I think. And so on that point a bit. So, Um, AI safety is definitely on our minds, I think. And um it's super unclear especially, you know, from outside, for example, big research labs, um what would even be positive, what would be uh really impactful. So maybe from the perspective of really making sure everything goes well, everything is in in human control and everything. Um, what would you do as maybe a PhD student starting a thesis, a professor with a bunch of uh research grant money or even a startup like let's say you could acquire a startup this year. What would it do in the area of AI safety?

💬 0 comments
Add to My Notes
01:03:22Jeff Dean

Particularly in the area of AI safety. Yes, exactly. Yeah. So, I mean, I think AI safety is a pretty broad topic. I I I think there's a bunch of concerns about the increasing capabilities of these models being able to enable people to do things that they wouldn't otherwise be able to do that are somewhat nefarious. Uh or or undesirable from a uh uh society perspective. So I think some of that can be addressed with some technical means, but I also think that there's going to need to be policy-based uh and regulatory based things that uh impose uh some restrictions on some aspects of that. Uh in terms of one of the topics that we covered in the paper was about misinformation and disc public discourse. And there I think, you know, there's clearly a a ability for AI models to create more realistic misinformation in the world and enable people to create it at mass scale with, you know, lower cost. You know, misinformation is not a new thing. You could always create it, but now you have these tools that enable sort of more realistic and more rapid creation. So that is definitely a an issue. I think there's a corresponding research question of how do you detect misinformation that is perhaps generated by a different machine learning model. Um there's also some questions about how do you or uh turning the problem onto a more positive spin. One of the things we've suggested in the paper was there's actually some early evidence that AI models can be used to enable more constructive discourse in online uh forums. And so that's an area where I think looking at how could AI models sort of uh encourage, you know, more positive conversations, um more, you know, identify misinformation in the flows of conversations that that people are having with each other. You know, these are some of the things that I think are pretty interesting, but there's a whole bunch of uh ideas in that paper that I think are, you know, worthy of of study. And I don't think the solution is necessarily going to be purely technical for all these problems.

💬 0 comments
Add to My Notes
01:05:33Jeff Dean

Okay?

💬 0 comments
Add to My Notes
01:05:34Jeff Dean

Yeah, thank you. Yep.

💬 0 comments
Add to My Notes
01:05:38Host

And send the cube over to him, but we'll take someone else for the moment. If that's okay. Sure. Yes. Where was the question here?

💬 0 comments
Add to My Notes
01:05:46Jeff Dean

I thought there was one over here. Yeah, there we go.

💬 0 comments
Add to My Notes
01:05:51Jeff Dean

Should I? Yep.

💬 0 comments
Add to My Notes
01:05:53Jeff Dean

All right. So, when I go to social networks, I'm very hyped, right? And I see messages that like the ones that you saw. So, these LMs are truly incredible. However, in my day-to-day work, when I try to use AI or LMs, I'm often disappointed. Who needs training? Is the LLM that needs more training or is me? I'm asking wrong.

💬 0 comments
Add to My Notes
01:06:22Jeff Dean

It's an excellent question. I suspect the answer is a bit of both. Right? I mean, I do think, you know, using these tools, like first, the arc of progress in these models has gotten quite steep, right? So the model, the Gemini models from 8 months ago are not nearly as good as the Gemini models now. And so sometimes people develop an impression of what the models are capable of from their previous experience trying to ask them to do something complicated and they failed miserably. But now that might be something that is on the border of possibility or actually will work really well there. So I think part of it is, you know, looking at what the current models can do, not what the ones of ancient history eight months ago can do. Um, another aspect is becoming facile with how to coax the models to do what you want. Right? It it's uh quite interesting that with like a one page carefully crafted prompt, you can almost create a completely different application of a general model than if you craft a different one page prompt. You know, one one page prompt might say, can you take these video contents and please make me an educational game that reflects the concepts explored in the lecture video. And it will actually in some cases do create a, you know, fully working software based game that highlights the concepts in, you know, a uh arbitrary uh, you know, lecture or scientific video. Doesn't always work, but that is kind of at the frontier possibilities now, maybe 30% of the time it might work or something. Um, but also more training for the models will help because then the models are going to get better. And I think you're seeing this from Gemini 1 to 1.5 to 2 to 2.5, a lot of progress. And I suspect Gemini 3.0 models and beyond will be substantially better than the current ones. And that's a general trend in the industry is the model is becoming better.

💬 0 comments
Add to My Notes
01:08:27Jeff Dean

You with the cube. Yep.

💬 0 comments
Add to My Notes
01:08:28Host

Uh, thanks for your talk.

💬 0 comments
Add to My Notes
01:08:30Jeff Dean

Um, I noticed on your slide where you summarized all of the innovations in AI. Uh you listed hardware, you listed algorithms, you listed um, yeah, all the improvements, but data was absent. And there's lots of concerns in the field that data might be the new bottleneck. Um I'm curious about your personal opinion on this. Is it a bottleneck? And if not, how do people get by? How do we get past scraping all of the internet?

💬 0 comments
Add to My Notes
01:08:54Jeff Dean

Yeah, I guess I didn't list data, but it has been really, really important. Uh it's just there's not like a specific artifact generally to point to in for a lot of the data related work. It's really about curation of high quality data uh that we spend a lot of time on say within the Gemini project. Um, you know, I think there's concerns I've heard of we're running out of high quality data in order to improve the capabilities of these models. And I find that not very credible at the moment because first, there's an awful lot of data we're not training on right now. So if you think about all the video data in the world. That's uh, you know, we're training on some video data, but it's a very tiny fraction of say the YouTube Corpus. Um, and that's only some of the video in the world. So I don't think we're running close to running out of of raw data. The other thing I would say as an ML research problem is there's a whole bunch of work I think we can do to get more uh quality improvements from the model per unit of training or per token of training data, right? If you think about, uh, we were just discussing this in a session earlier. Uh, you have a two sentence description of how to add numbers together. Right? So the model is just trained to absorb that by predicting the next token. But that doesn't generally mean it's actually learned the algorithm for adding two numbers together in a deep and and sort of algorithmic way. It's got a next token predictor for predicting the rule, but in some sense is oblivious to the actual algorithm. And so I think if you think about what you would really want the model to be able to do, it would be to read that algorithm and then in sort of build a representation internally that enables it to run that algorithm when it needs to. So, that would be extracting way more value out of those 15 tokens than than what it is currently. Um, so I I think there's lots of room to go. The other thing I would say in this space is in the like improving image net convolutional neural network era, uh, you know, people were training on a million images with a thousand categories and one of the ways they would make the models more powerful is they would make many, many, many, many passes over that training data. Uh, the textual data corpus we have is large enough that we're not able to computationally afford to make lots and lots of passes over it. But with improving hardware capabilities, you know, you might be able to make 50 passes over the data instead of three. And and that would probably improve the qualities of the model, uh, but you know, we don't know how much and so on.

💬 0 comments
Add to My Notes
01:11:38Host

Thanks a lot for the super interesting talk.

💬 0 comments
Add to My Notes
01:11:40Jeff Dean

Um, where in your personal life or work, uh, work life do you use AI most and where do you use it least because it doesn't work yet? Like, what are you like surprised by on both ends of the capability spectrum, like as you in your work as an employee of a, you know, of a research lab. Or leader, sorry.

💬 0 comments
Add to My Notes
01:11:58Jeff Dean

Yeah, I mean, I think uh where I personally use it and where many of my colleagues use it is like helping to write some some bits of code. You know, I often tend to ask it to do things that are, you know, not super complicated. I think with the more more capable models, I should start uh venturing out as this gentleman perhaps should to more and more expectations of what the model can do. But it will sort of do a reasonable job of writing sort of test cases for code I've written or or extensions of things that that are that are straightforward. Uh, seeing it, I've used it to generate images for various kinds of things. I think I used it for this kind of thing. Uh, um, uh, I use it to summarize papers or I put in a large piece of textual content and ask it questions about that. Um, but I think, you know, more and more you're seeing people integrate use of these models into things they find that they're able to do that are useful for them. And I think that's sort of the general trend in society.

💬 0 comments
Add to My Notes
01:13:04Host

And where doesn't it work? Like where have you tried it and it's like, yeah, it doesn't work, doesn't

💬 0 comments
Add to My Notes
01:13:09Jeff Dean

Uh, I mean, I have asked it to do more complicated coding questions and sometimes it works, sometimes it doesn't and then you're like, oh okay, I understand why it didn't work because that's pretty complicated and it would have taken me a long time to figure out. So.

💬 0 comments
Add to My Notes
01:13:20Host

Thanks.

💬 0 comments
Add to My Notes
01:13:21Jeff Dean

Okay.

💬 0 comments
Add to My Notes
01:13:27Host

Yeah.

💬 0 comments
Add to My Notes
01:13:28Jeff Dean

And thank you.

💬 0 comments
Add to My Notes
01:13:29Host

Hang on. Let's let's get him a microphone.

💬 0 comments
Add to My Notes
01:13:32Jeff Dean

Um, thank you for your presentation. It was uh super interesting and I was wondering for the uh upcoming research, what would be uh the most interesting part uh to focus on? Is it improving transformers for the computer vision area more important or um AI safety regarding to prevent hallucination of large language models? Would be the most uh important part that you are going to focus on?

💬 0 comments
Add to My Notes
01:13:59Jeff Dean

Yeah, I mean, I think one of the beauties of this field is it's not that there's just one important problem. There's many, many sort of important problems. And so one of the sort of meta things I do when I'm trying to think about uh research topics is to try to pick something that if I make progress on it or we as a collec collective set of colleagues make progress on, uh something important will be advanced, right? So I think avoiding sort of incremental things where even if you the best possible outcome happens, you're kind of like, uh you want to avoid that. Um, but I think all the areas you mentioned and like 50 other ones besides are really important. You know, other ones that I'm personally thinking about like how can we have much more efficient inference hardware, you know, how can you have much larger context windows for these models than a million tokens. Um, uh, you know, how do you identify higher quality data? How do you scale infrastructure? How do you do asynchronous training uh in a better way, uh, in a distributed fashion with low bandwidth between the systems? How do you um have interesting more exotic sparser model structures than just kind of branch out to experts and come back together, which seems kind of like relatively uh uh too simple for truly sparse interesting model structures. So I think there's like 50 other ideas that I can rattle off. You should pick something you're really excited about and that you think will matter.

💬 0 comments
Add to My Notes
01:15:34Host

Thank you.

💬 0 comments
Add to My Notes
01:15:36Jeff Dean

one more question?

💬 0 comments
Add to My Notes
01:15:37Host

Yeah, one more question.

💬 0 comments
Add to My Notes
01:15:38Jeff Dean

Uh, uh, I don't know. You picked.

💬 0 comments
Add to My Notes
01:15:50Host

How about we get one further in the back because we have ignored the back. The gentleman in the back t-shirt there. And it's close enough to throw.

💬 0 comments
Add to My Notes
01:15:58Jeff Dean

There we go.

💬 0 comments
Add to My Notes
01:15:59Host

See?

💬 0 comments
Add to My Notes
01:15:59Jeff Dean

Oh.

💬 0 comments
Add to My Notes
01:16:00Host

We're two for two on our passing. I love it.

💬 0 comments
Add to My Notes
01:16:04Jeff Dean

Hi, uh, thank you very much for the presentation. It was incredible. Uh, my question is about what's the next challenge? Because uh I see that these models are getting better and better on all the benchmarks uh gradually. But is there some sort of binary challenge, some outcome that they are not yet able to do? I don't know, formal reasoning. Some activity that uh let's call it the next breakthrough.

💬 0 comments
Add to My Notes
01:16:28Host

Mhm.

💬 0 comments
Add to My Notes
01:16:29Jeff Dean

Yeah, I mean I think one thing that's not quite a discrete step, but I think is going to be very hard is the current models, you know, if you think about what we're going to want the models to be able to do, it's to operate sort of a bit autonomously and to do fairly complicated things that you ask the model to do with relative independence, right? Can you, you know, can you go off and plan me uh visit to Zurich for two days because I have a couple extra days and I want to do some fun stuff. And then, you know, that is a little ambiguous. It might require the model to use some tools to go figure out, you know, well, what is this Zurich place and what could I do here? Um, and then I think what you're seeing is the models are capable of breaking down complex things into a few steps, maybe doing some limited amount of tool use to chain some things together in order to sort of do those relatively simple tasks. But you're not seeing models able to take a very complicated thing and break it down into 50 sub steps on on its own, use many, many complicated tools in order to accomplish some major piece of thing that might take you, you know, two months. Right? And I think there's a huge vast difference between where we are now, which is it can do those kind of three or four or five step tasks with maybe 60 to 70% accuracy. And it can do a month of work in a thousand steps uh with 95% accuracy, right? I mean, I think though that is where people would like to be able to get systems uh going, but it it's a very vast gulf between where we are now and and what uh what one imagines would be possible, but is definitely not now. So I think that's maybe uh a sort of continuum rather than a single thing that suddenly now you can do this. But you will see more and more capabilities of the models as they can do 10 steps things with 90% accuracy as an intermediate point.

💬 0 comments
Add to My Notes
01:18:40Host

All right.

💬 0 comments
Add to My Notes
01:18:41Jeff Dean

All right.

💬 0 comments
Add to My Notes
01:18:41Host

Thank you very much. Let's thank Jeff one more time for the great talk.

💬 0 comments
Add to My Notes
Video Player
My Notes📝
Highlighted paragraphs will appear here