TextPurr Logo

TextPurr

Loading...
Loading...

Solving Wordle using information theory

3Blue1Brown
An excuse to teach a lesson on information theory and entropy. These lessons are funded by viewers: https://www.patreon.com/3blue1brown Special thanks to these supporters: https://3b1b.co/lessons/wordle#thanks An equally valuable form of support is to simply share the videos. Contents: 0:00 - What is Wordle? 2:43 - Initial ideas 8:04 - Information theory basics 18:15 - Incorporating word frequencies 27:49 - Final performance Original wordle site: https://www.powerlanguage.co.uk/wordle/ Music by Vincent Rubinetti. https://www.vincentrubinetti.com/ Shannon and von Neumann artwork by Kurt Bruns. https://www.instagram.com/p/CZpRKhMJnD6/ Code for this video: https://github.com/3b1b/videos/tree/master/_2022/wordle These animations are largely made using a custom python library, manim. See the FAQ comments here: https://www.3blue1brown.com/faq#manim https://github.com/3b1b/manim https://github.com/ManimCommunity/manim/ You can find code for specific videos and projects here: https://github.com/3b1b/videos/ Thanks to these viewers for their contributions to translations German: Thadaeus, styrix560, wolfsgier ------------------ 3blue1brown is a channel about animating math, in all senses of the word animate. And you know the drill with YouTube, if you want to stay posted on new videos, subscribe: http://3b1b.co/subscribe Various social media stuffs: Website: https://www.3blue1brown.com Twitter: https://twitter.com/3blue1brown Reddit: https://www.reddit.com/r/3blue1brown Instagram: https://www.instagram.com/3blue1brown_animations/ Patreon: https://patreon.com/3blue1brown Facebook: https://www.facebook.com/3blue1brown
Hosts: Grant Sanderson
📅February 06, 2022
⏱️00:30:38
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "Solving Wordle using information theory" from "3Blue1Brown". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=v68zYyaEmEA

00:00:00Grant Sanderson

The game Wordle has gone pretty viral in the last month or two, and never wanting to overlook an opportunity for a math lesson, it occurs to me that this game makes for a very good central example in a lesson about information theory, and in particular, a topic known as entropy.

💬 0 comments
00:00:12Grant Sanderson

You see, like a lot of people, I got kind of sucked into the puzzle, and like a lot of programmers, I also got sucked into trying to write an algorithm that would play the game as optimally as it could. And what I thought I'd do here is just talk through with you some of my process in that and explain some of the math that went into it, since the whole algorithm centers on this idea of entropy.

💬 0 comments
00:00:37Grant Sanderson

First things first, in case you haven't heard of it: what is Wordle? And to kill two birds with one stone here, while we go through the rules of the game, let me also preview where we're going with this, which is to develop a little algorithm that will basically play the game for us. So, I haven't done today's Wordle—this is February 4th—and we'll see how the bot does.

💬 0 comments
00:00:55Grant Sanderson

The goal of Wordle is to guess a mystery five-letter word, and you're given six different chances to guess. For example, my Wordle bot suggests that I start with the guess "crane". Each time that you make a guess, you get some information about how close your guess is to the true answer. Here, the gray box is telling me there's no C in the actual answer. The yellow box is telling me there is an R, but it's not in that position. The green box is telling me that the secret word does have an A, and it's in the third position. And then there's no N, and there's no E.

💬 0 comments
00:01:25Grant Sanderson

So let me just go in and tell the Wordle bot that information. We started with "crane"; we got gray, yellow, green, gray, gray. Don't worry about all the data that it's showing right now—I'll explain that in due time—but its top suggestion for our second pick is "shtick". And your guess does have to be an actual five-letter word, but as you'll see, it's pretty liberal with what it will actually let you guess.

💬 0 comments
00:01:46Grant Sanderson

In this case, we try "shtick", and all right, things are looking pretty good. We hit the S and the H, so we know the first three letters. We know that there's an R, and so it's going to be like "sh-a-something-r" or "sh-a-r-something". And it looks like the Wordle bot knows that it's down to just two possibilities: either "shard" or "sharp". It's kind of a toss-up between them at this point, so I guess probably just because it's alphabetical, it goes with "shard", which—hooray!—is the actual answer. So we got it in three.

💬 0 comments
00:02:12Grant Sanderson

If you're wondering if that's any good, the way I heard one person phrase it is that with Wordle, four is par and three is birdie, which I think is a pretty apt analogy. You have to be consistently on your game to be getting four, but it's certainly not crazy, and when you get it in three, it just feels great.

💬 0 comments
00:02:30Grant Sanderson

So, if you're down for it, what I'd like to do here is just talk through my thought process from the beginning for how I approached the Wordle bot. And like I said, really, it's an excuse for an information theory lesson. The main goal is to explain: what is information, and what is entropy?

💬 0 comments
00:02:47Grant Sanderson

My first thought in approaching this was to take a look at the relative frequencies of different letters in the English language. So I thought, okay, is there an opening guess or an opening pair of guesses that hits a lot of these most frequent letters? And one that I was pretty fond of was doing "other" followed by "nails".

💬 0 comments
00:03:03Grant Sanderson

The thought is that if you hit a letter—you know, you get a green or a yellow—that always feels good, it feels like you're getting information. But in these cases, even if you don't hit and you always get grays, that's still giving you a lot of information, since it's pretty rare to find a word that doesn't have any of these letters. But even still, that doesn't feel super systematic because, for example, it does nothing to consider the order of the letters. Why type "nails" when I could type "snail"? Is it better to have that S at the end? I'm not really sure.

💬 0 comments
00:03:29Grant Sanderson

Now, a friend of mine said that he liked to open with the word "weary", which kind of surprised me because it has some uncommon letters in there, like the W and the Y. But who knows? Maybe that is a better opener. Is there some kind of quantitative score that we can give to judge the quality of a potential guess?

💬 0 comments
00:03:43Grant Sanderson

Now, to set up for the way that we're going to rank possible guesses, let's go back and add a little clarity to how exactly the game is set up. So, there's a list of words that it will allow you to enter that are considered valid guesses; that's just about 13,000 words long. But when you look at it, there's a lot of really uncommon things, things like "aahed" or "alifs" and "argh"—the kind of words that bring about family arguments in a game of Scrabble.

💬 0 comments
00:04:07Grant Sanderson

But the vibe of the game is that the answer is always going to be a decently common word. And in fact, there's another list of around 2,300 words that are the possible answers, and this is a human-curated list—I think specifically by the game creator's girlfriend, which is kind of fun.

💬 0 comments
00:04:21Grant Sanderson

But what I would like to do—our challenge for this project—is to see if we can write a program solving Wordle that doesn't incorporate previous knowledge about this list. For one thing, there's plenty of pretty common five-letter words that you won't find in that list, so it would be better to write a program that's a little more resilient and would play Wordle against anyone, not just what happens to be the official website.

💬 0 comments
00:04:41Grant Sanderson

And also, the reason that we know what this list of possible answers is is because it's visible in the source code, but the way that it's visible in the source code is in the specific order in which answers come up from day to day. So you could always just look up what tomorrow's answer will be. So clearly, there's some sense in which using the list is cheating, and what makes for a more interesting puzzle and a richer information theory lesson is to instead use some more universal data, like relative word frequencies in general, to capture this intuition of having a preference for more common words.

💬 0 comments
00:05:11Grant Sanderson

So, of these 13,000 possibilities, how should we choose the opening guess? For example, if my friend proposes "weary", how should we analyze its quality? Well, the reason he said he likes that unlikely W is that he likes the long-shot nature of just how good it feels if you do hit that W. For example, if the first pattern revealed was something like this, then it turns out there are only 58 words in this giant lexicon that match that pattern. So that's a huge reduction from 13,000.

💬 0 comments
00:05:37Grant Sanderson

But the flip side of that, of course, is that it's very uncommon to get a pattern like this specifically. If each word was equally likely to be the answer, the probability of hitting this pattern would be 58 divided by around 13,000. Of course, they're not equally likely to be answers—most of these are very obscure and even questionable words—but at least for our first pass at all of this, let's assume that they're all equally likely and then refine that a bit later. The point is, the pattern with a lot of information is, by its very nature, unlikely to occur. In fact, what it means to be informative is that it's unlikely.

💬 0 comments
00:06:10Grant Sanderson

A much more probable pattern to see with this opening would be something like this, where, of course, there's not a W in it, maybe there's an E, and maybe there's no A, there's no R, there's no Y. In this case, there are 1,400 possible matches, so if all were equally likely, it works out to be a probability of about 11% that this is the pattern you would see. So the most likely outcomes are also the least informative.

💬 0 comments
00:06:32Grant Sanderson

To get a more global view here, let me show you the full distribution of probabilities across all of the different patterns that you might see. So each bar that you're looking at corresponds to a possible pattern of colors that could be revealed, of which there are $3^5$ possibilities, and they're organized from left to right, most common to least common.

💬 0 comments
00:06:51Grant Sanderson

So the most common possibility here is that you get all grays—that happens about 14% of the time—and what you're hoping for when you make a guess is that you end up somewhere out in this long tail, like over here, where there's only 18 possibilities for what matches this pattern, that evidently look like this. Or, if we venture a little further to the left, you know, maybe we go all the way over here. Okay, here's a good puzzle for you: what are the three words in the English language that start with a W, end with a Y, and have an R somewhere in them? Turns out the answers are... uh, let's see: "wordy", "wormy", and "wryly".

💬 0 comments
00:07:27Grant Sanderson

So, to judge how good this word is overall, we want some kind of measure of the expected amount of information that you're going to get from this distribution. If we go through each pattern and we multiply its probability of occurring times something that measures how informative it is, that can maybe give us an objective score.

💬 0 comments
00:07:44Grant Sanderson

Now, your first instinct for what that something should be might be the number of matches—you know, you want a lower average number of matches—but instead, I'd like to use a more universal measurement that we often ascribe to information, and one that will be more flexible once we have a different probability assigned to each of these 13,000 words for whether or not they're actually the answer.

💬 0 comments
00:08:09Grant Sanderson

The standard unit of information is the bit, which has a little bit of a funny formula, but it's really intuitive if we just look at examples. If you have an observation that cuts your space of possibilities in half, we say that it has one bit of information. In our example, the space of possibilities is all possible words, and it turns out about half of the five-letter words have an E—a little less than that, but about half—so that observation would give you one bit of information.

💬 0 comments
00:08:33Grant Sanderson

If, instead, a new fact chops down that space of possibilities by a factor of four, we say that it has two bits of information. For example, it turns out about a quarter of these words have a T. If the observation cuts that space by a factor of eight, we say it's three bits of information, and so on and so forth: four bits cuts it into a 16th, five bits cuts it into a 32nd.

💬 0 comments
00:08:55Grant Sanderson

So now is when you might want to take a moment and pause and ask for yourself: what is the formula for information, for the number of bits, in terms of the probability of an occurrence? Well, what we're saying here is basically that when you take $(1/2)^{\text{bits}}$, that's the same thing as the probability, which is the same thing as saying $2^{\text{bits}} = 1/\text{probability}$, which rearranges further to saying the information is the $\log_2(1/\text{probability})$. And sometimes you see this with one more rearrangement still, where the information is the $-\log_2(\text{probability})$. Expressed like this, it can look a little bit weird to the uninitiated, but it really is the very intuitive idea of asking how many times you've cut down your possibilities in half.

💬 0 comments
00:09:35Grant Sanderson

Now, if you're wondering, "You know, I thought we were just playing a fun word game; why are logarithms entering the picture?" One reason this is a nicer unit is it's just a lot easier to talk about very unlikely events—much easier to say that an observation has 20 bits of information than it is to say that the probability of such and such occurring is $0.00000095$.

💬 0 comments
00:09:52Grant Sanderson

But a more substantive reason that this logarithmic expression turned out to be a very useful addition to the theory of probability is the way that information adds together. For example, if one observation gives you two bits of information, cutting your space down by four, and then a second observation, like your second guess in Wordle, gives you another three bits of information, chopping you down further by another factor of eight, the two together give you five bits of information. In the same way that probabilities like to multiply, information likes to add. So as soon as we're in the realm of something like an expected value, where we're adding a bunch of numbers up, the logs make it a lot nicer to deal with.

💬 0 comments
00:10:29Grant Sanderson

Let's go back to our distribution for "weary" and add another little tracker on here, showing us how much information there is for each pattern. The main thing I want you to notice is that the higher the probability, as we get to those more likely patterns, the lower the information—the fewer bits you gain.

💬 0 comments
00:10:43Grant Sanderson

The way we measure the quality of this guess will be to take the expected value of this information, where we go through each pattern, we say, "How probable is it?" and then we multiply that by, "How many bits of information do we get?" And in the example of "weary", that turns out to be $4.9$ bits. So on average, the information you get from this opening guess is as good as chopping your space of possibilities in half about five times.

💬 0 comments
00:11:05Grant Sanderson

By contrast, an example of a guess with a higher expected information value would be something like "slate". In this case, you'll notice the distribution looks a lot flatter. In particular, the most probable occurrence of all grays only has about a 6% chance of occurring, so at minimum you're getting, evidently, $3.9$ bits of information. But that's a minimum; more typically, you'd get something better than that, and it turns out when you crunch the numbers on this one and you add up all of the relevant terms, the average information is about $5.8$. So in contrast with "weary", your space of possibilities will be about half as big after this first guess, on average.

💬 0 comments
00:11:44Grant Sanderson

There's actually a fun story about the name for this expected value of information quantity. You see, information theory was developed by Claude Shannon, who was working at Bell Labs in the 1940s, but he was talking about some of his yet-to-be-published ideas with John von Neumann, who was this intellectual giant of the time, very prominent in math and physics and the beginnings of what was becoming computer science.

💬 0 comments
00:12:04Grant Sanderson

And when he mentioned that he didn't really have a good name for this expected value of information quantity, von Neumann supposedly said—so the story goes—"Well, you should call it entropy, and for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name, so it already has a name. And in the second place, and more important, nobody knows what entropy really is, so in a debate you'll always have the advantage."

💬 0 comments
00:12:27Grant Sanderson

So if the name seems a little bit mysterious, and if this story is to be believed, that's kind of by design. Also, if you're wondering about its relation to all of that second law of thermodynamics stuff from physics, there definitely is a connection, but in its origins, Shannon was just dealing with pure probability theory, and for our purposes here, when I use the word "entropy", I just want you to think: the expected information value of a particular guess.

💬 0 comments
00:12:51Grant Sanderson

You can think of entropy as measuring two things simultaneously. The first one is: how flat is the distribution? The closer a distribution is to uniform, the higher that entropy will be. In our case, where there are $3^5$ total patterns, for a uniform distribution, observing any one of them would have information $\log_2(3^5)$, which happens to be $7.92$. So that is the absolute maximum that you could possibly have for this entropy.

💬 0 comments
00:13:16Grant Sanderson

But entropy is also kind of a measure of how many possibilities there are in the first place. For example, if you happen to have some word where there's only 16 possible patterns and each one is equally likely, this entropy—this expected information—would be four bits. But if you have another word where there's 64 possible patterns that could come up and they're all equally likely, then the entropy would work out to be six bits.

💬 0 comments
00:13:39Grant Sanderson

So if you see some distribution out in the wild that has an entropy of six bits, it's sort of like it's saying there's as much variation and uncertainty in what's about to happen as if there were 64 equally likely outcomes.

💬 0 comments
00:13:54Grant Sanderson

For my first pass at the Wordle bot, I basically had it just do this: it goes through all of the different possible guesses that you could have—all 13,000 words—it computes the entropy for each one, or more specifically, the entropy of the distribution across all patterns that you might see for each one, and then it picks the highest, since that's the one that's likely to chop down your space of possibilities as much as possible.

💬 0 comments
00:14:16Grant Sanderson

And even though I've only been talking about the first guess here, it does the same thing for the next few guesses. For example, after you see some pattern on that first guess, which would restrict you to a smaller number of possible words based on what matches with that, you just play the same game with respect to that smaller set of words. For a proposed second guess, you look at the distribution of all patterns that could occur from that more restricted set of words, you search through all 13,000 possibilities, and you find the one that maximizes that entropy.

💬 0 comments
00:14:44Grant Sanderson

To show you how this works in action, let me just pull up a little variant of Wordle that I wrote that shows the highlights of this analysis in the margins. So after doing all its entropy calculations, on the right here, it's showing us which ones have the highest expected information. Turns out the top answer, at least at the moment—we'll refine this later—is "tares", which means, of course, a vetch, the most common vetch.

💬 0 comments
00:15:11Grant Sanderson

Each time we make a guess here, where maybe I kind of ignore its recommendations and go with "slate" because I like "slate", we can see how much expected information it had. But then on the right of the word here, it's showing us how much actual information we got given this particular pattern. So here, it looks like we were a little unlucky: we were expected to get $5.8$, but we happened to get something with less than that.

💬 0 comments
00:15:30Grant Sanderson

And then on the left side here, it's showing us all of the different possible words given where we are now. The blue bars are telling us how likely it thinks each word is. So at the moment, it's assuming each word is equally likely to occur, but we'll refine that in a moment. And then this uncertainty measurement is telling us the entropy of this distribution across the possible words, which right now, because it's a uniform distribution, is just a needlessly complicated way to count the number of possibilities. For example, if we were to take $2^{13.66}$, that should be around the 13,000 possibilities—a little bit off here, but only because I'm not showing all the decimal places at the moment. That might feel redundant and like it's overly complicating things, but you'll see why it's useful to have both numbers in a minute.

💬 0 comments
00:16:12Grant Sanderson

So here, it looks like it's suggesting the highest entropy for a second guess is "ramen", which again, just really doesn't feel like a word. So to take the moral high ground here, I'm going to go ahead and type in "reins". And again, it looks like we were a little unlucky: we were expecting $4.3$ bits and we only got $3.39$ bits of information. So that takes us down to 55 possibilities.

💬 0 comments
00:16:33Grant Sanderson

And here, maybe I'll just actually go with what it's suggesting, which is "comus"—whatever that means. And okay, this is actually a good chance for a puzzle. It's telling us this pattern gives us $4.7$ bits of information, but over on the left, before we see that pattern, there were $5.78$ bits of uncertainty. So as a quiz for you, what does that mean about the number of remaining possibilities? Well, it means that we're reduced down to one bit of uncertainty, which is the same thing as saying that there's two possible answers—it's a 50/50 choice.

💬 0 comments
00:17:05Grant Sanderson

And from here, because you and I know which words are more common, we know that the answer should be "abyss", but as it's written right now, the program doesn't know that. So it just keeps going, trying to gain as much information as it can, until there's only one possibility left, and then it guesses it. So obviously we need a better endgame strategy. But let's say we call this Version 1 of our Wordle solver, and then we go and run some simulations to see how it does.

💬 0 comments
00:17:30Grant Sanderson

So, the way this is working is it's playing every possible Wordle game. It's going through all of those 2,315 words that are the actual word answers; it's basically using that as a testing set. And with this naive method of not considering how common a word is and just trying to maximize the information at each step along the way until it gets down to one and only one choice, by the end of the simulation, the average score works out to be about $4.124$.

💬 0 comments
00:17:54Grant Sanderson

Which, you know, is not bad, to be honest. I kind of expected to do worse. But the people who play Wordle will tell you that they can usually get it in four; the real challenge is to get as many in three as you can. It's a pretty big jump between the score of four and the score of three. The obvious low-hanging fruit here is to somehow incorporate whether or not a word is common, and how exactly do we do that?

💬 0 comments
00:18:22Grant Sanderson

The way I approached it is to get a list of the relative frequencies for all of the words in the English language, and I just used Mathematica's word frequency data function, which itself pulls from the Google Books English Ngram public dataset. And it's kind of fun to look at. For example, if we sort it from the most common words to the least common words, evidently these are the most common five-letter words in the English language—or rather, these are the eighth most common: first is "which", after which there's "there" and "their". "first" itself is not first, but ninth, and it makes sense that these other words could come about more often, where those after "first" are "after", "where", and those being just a little bit less common.

💬 0 comments
00:18:59Grant Sanderson

Now, in using this data to model how likely each of these words is to be the final answer, it shouldn't just be proportional to the frequency, because, for example, "which" is given a score of $0.002$ in this dataset, whereas the word "braid" is, in some sense, about a thousand times less likely, but both of these are common enough words that they're almost certainly worth considering. So we want more of a binary cutoff.

💬 0 comments
00:19:19Grant Sanderson

The way I went about it is to imagine taking this whole sorted list of words and then arranging it on an $x$-axis, and then applying the sigmoid function, which is the standard way to have a function whose output is basically binary—it's either $0$ or it's $1$—but there's a smoothing in between for that region of uncertainty. So essentially, the probability that I'm assigning to each word for being in the final list will be the value of the sigmoid function above wherever it sits on the $x$-axis.

💬 0 comments
00:19:48Grant Sanderson

Now, obviously, this depends on a few parameters. For example, how wide a space on the $x$-axis those words fill determines how gradually or steeply we drop off from $1$ to $0$, and where we situate them left to right determines the cutoff. And to be honest, the way I did this was kind of just licking my finger and sticking it into the wind: I looked through the sorted list and tried to find a window where, when I looked at it, I figured about half of these words are more likely than not to be the final answer, and used that as the cutoff.

💬 0 comments
00:20:15Grant Sanderson

Now, once we have a distribution like this across the words, it gives us another situation where entropy becomes this really useful measurement. For example, let's say we were playing a game and we start with my old openers, which were "other" and "nails", and we end up with a situation where there's four possible words that match it, and let's say we consider them all equally likely. Let me ask you: what is the entropy of this distribution?

💬 0 comments
00:20:40Grant Sanderson

Well, the information associated with each one of these possibilities is going to be the $\log_2(4)$, since each one is one in four, and that's two—it's two bits of information. Four possibilities, all very well and good.

💬 0 comments
00:20:53Grant Sanderson

But what if I told you that actually, there's more than four matches? In reality, when we look through the full word list, there are 16 words that match it, but suppose our model puts a really low probability on those other 12 words of actually being the final answer—something like one in a thousand because they're really obscure. Now let me ask you: what is the entropy of this distribution?

💬 0 comments
00:21:14Grant Sanderson

If entropy was purely measuring the number of matches here, then you might expect it to be something like the $\log_2(16)$, which would be four—two more bits of uncertainty than we had before. But of course, the actual uncertainty is not really that different from what we had before, because just because there's these 12 really obscure words doesn't mean that it would be all that more surprising to learn that the final answer is "charm", for example. So when you actually do the calculation here and you add up the probability of each occurrence times the corresponding information, what you get is $2.11$ bits, just saying it's basically two bits—it's basically those four possibilities, but there's a little more uncertainty because of all of those highly unlikely events, though if you did learn them, you'd get a ton of information from it.

💬 0 comments
00:21:56Grant Sanderson

So zooming out, this is part of what makes Wordle such a nice example for an information theory lesson. We have these two distinct-feeling applications for entropy: the first one telling us what's the expected information we'll get from a given guess, and the second one saying, "Can we measure the remaining uncertainty among all of the words that we have possible?"

💬 0 comments
00:22:15Grant Sanderson

And I should emphasize, in that first case where we're looking at the expected information of a guess, once we have an unequal weighting to the words, that affects the entropy calculation. For example, let me pull up that same case we were looking at earlier of the distribution associated with "weary", but this time using a non-uniform distribution across all possible words. So let me see if I can find a part here that illustrates it pretty well... uh, okay, here, this is pretty good.

💬 0 comments
00:22:42Grant Sanderson

Here we have two adjacent patterns that are about equally likely, but one of them we're told has 32 possible words that match it. And if we check what they are, these are those 32, which are all just very unlikely words. As you scan your eyes over them, you know, it's hard to find any that feel like plausible answers—maybe "yells". But if we look at the neighboring pattern in the distribution, which is considered just about as likely, we're told that it only has eight possible matches. So a quarter as many matches, but it's about as likely. And when we pull up those matches, we can see why: some of these are actual plausible answers like "ring" or "wrath" or "wraps".

💬 0 comments
00:23:18Grant Sanderson

To illustrate how we incorporate all that, let me pull up Version 2 of the Wordle bot here, and there are two or three main differences from the first one that we saw. First off, like I just said, the way that we're computing these entropies—these expected values of information—is now using the more refined distributions across the patterns that incorporates the probability that a given word would actually be the answer. As it happens, "tares" is still number one, though the ones following are a bit different.

💬 0 comments
00:23:44Grant Sanderson

Second, when it ranks its top picks, it's now going to keep a model of the probability that each word is the actual answer, and it'll incorporate that into its decision, which is easier to see once we have a few guesses on the table. Again, ignoring its recommendation because we can't let machines rule our lives.

💬 0 comments
00:24:01Grant Sanderson

And I suppose I should mention another thing different here is over on the left: that uncertainty value, that number of bits, is no longer just redundant with the number of possible matches. Now, if we pull it up and, you know, we calculate, say, $2^{8.02}$, which should be a little above 256—I guess... $259$—what it's saying is even though there are 526 total words that actually match this pattern, the amount of uncertainty it has is more akin to what it would be if there were 259 equally likely outcomes. You can think of it like this: it knows "bora" is not the answer, same with "yorts" and "zorro" and "zorus", so it's a little less uncertain than it was in the previous case. This number of bits will be smaller.

💬 0 comments
00:24:37Grant Sanderson

And if I keep playing the game—I'm refining this down with a couple guesses that are apropos of what I would like to explain here—by the fourth guess, if you look over at its top picks, you can see it's no longer just maximizing the entropy. So at this point, there's technically seven possibilities, but the only ones with a meaningful chance are "dorms" and "words", and you can see it ranks choosing both of those above all of these other values that, strictly speaking, would give more information.

💬 0 comments
00:25:06Grant Sanderson

The very first time I did this, I just added up these two numbers to measure the quality of each guess, which actually worked better than you might suspect, but it really didn't feel systematic. And I'm sure there's other approaches people could take, but here's the one I landed on.

💬 0 comments
00:25:18Grant Sanderson

If we're considering the prospect of a next guess, like in this case, "words", what we really care about is the expected score of our game if we do that. And to calculate that expected score, we say: what's the probability that "words" is the actual answer? Which, at the moment, it ascribes 58% to. So we say with a 58% chance, our score in this game would be four, and then with a probability of one minus that 58%, our score will be more than that four. How much more, we don't know, but we can estimate it based on how much uncertainty there's likely to be once we get to that point.

💬 0 comments
00:25:51Grant Sanderson

Specifically, at the moment, there's $1.44$ bits of uncertainty. If we guess "words", it's telling us the expected information we'll get is $1.27$ bits, so if we guess "words", this difference represents how much uncertainty we're likely to be left with after that happens.

💬 0 comments
00:26:07Grant Sanderson

What we need is some kind of function, which I'm calling $f$ here, that associates this uncertainty with an expected score. And the way I went about this was to just plot a bunch of the data from previous games, based on Version 1 of the bot, to say, "Hey, what was the actual score after various points with certain very measurable amounts of uncertainty?"

💬 0 comments
00:26:25Grant Sanderson

For example, these data points here that are sitting above a value that's around like $8.7$ or so are saying, for some games, after a point at which there were $8.7$ bits of uncertainty, it took two guesses to get the final answer; for other games, it took three guesses; for other games, it took four guesses.

💬 0 comments
00:26:42Grant Sanderson

If we shift over to the left here, all the points over zero are saying whenever there's zero bits of uncertainty—which is to say there's only one possibility—then the number of guesses required is always just one, which is reassuring. Whenever there was one bit of uncertainty, meaning it was essentially just down to two possibilities, then sometimes it required one more guess, sometimes it required two more guesses, and so on and so forth.

💬 0 comments
00:27:05Grant Sanderson

Here, maybe a slightly easier way to visualize this data is to bucket it together and take averages. For example, this bar here is saying among all the points where we had one bit of uncertainty, on average, the number of new guesses required was about $1.5$. And the bar over here is saying among all of the different games where, at some point, the uncertainty was a little above four bits—which is like narrowing it down to 16 different possibilities—then on average, it requires a little more than two guesses from that point forward.

💬 0 comments
00:27:34Grant Sanderson

And from here, I just did a regression to fit a function that seemed reasonable to this. And remember, the whole point of doing any of that is so that we can quantify this intuition that the more information we gain from a word, the lower the expected score will be.

💬 0 comments
00:27:47Grant Sanderson

So with this as Version 2.0, if we go back and we run the same set of simulations, having it play against all 2,315 possible Wordle answers, how does it do? Well, in contrast to our first version, it's definitely better, which is reassuring. All said and done, the average is around 3.6, although, unlike the first version, there are a couple of times that it loses and requires more than six in this circumstance, presumably because there's times when it's making that trade-off to actually go for the goal rather than maximizing information.

💬 0 comments
00:28:17Grant Sanderson

So can we do better than 3.6? We definitely can. Now, I said at the start that it's most fun to try not incorporating the true list of Wordle answers into the way that it builds its model, but if we do incorporate it, the best performance I could get was around $3.43$. So if we try to get more sophisticated than just using word frequency data to choose this prior distribution, this $3.43$ probably gives a maximum at how good we could get with that—or at least, how good I could get with that.

💬 0 comments
00:28:45Grant Sanderson

That best performance essentially just uses the ideas that I've been talking about here, but it goes a little further, like it does a search for the expected information two steps forward rather than just one. Originally, I was planning on talking more about that, but I realized we've actually gone quite long as it is.

💬 0 comments
00:28:59Grant Sanderson

The one thing I'll say is, after doing this two-step search and then running a couple of sample simulations, among the top candidates so far, for me at least, it's looking like "crane" is the best opener. Who would have guessed?

💬 0 comments
00:29:11Grant Sanderson

Also, if you use the true word list to determine your space of possibilities, then the uncertainty you start with is a little over 11 bits. And it turns out, just from a brute-force search, the maximum possible expected information after the first two guesses is around 10 bits, which suggests that, best-case scenario, after your first two guesses with perfectly optimal play, you'll be left with around one bit of uncertainty, which is the same as being down to two possible guesses.

💬 0 comments
00:29:38Grant Sanderson

So I think it's fair, and probably pretty conservative, to say that you could never possibly write an algorithm that gets this average as low as three, because with the words available to you, there's simply not room to get enough information after only two steps to be able to guarantee the answer in the third slot every single time, without fail.

💬 0 comments
Video Player