Building Software Systems At Google and Lessons Learned
StanfordDisclaimer: The transcript on this page is for the YouTube video titled "Building Software Systems At Google and Lessons Learned" from "Stanford". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=modXC5IWTJI
Okay. Can you hear me? Okay. Welcome to, I guess this is E380, but it's also been sort of overridden with our distinguished lecture series. Today's speaker is Jeff Dean of Google. I don't want to use up all his time describing all his accomplishments, but I'll say he did get his degree at the University of Washington. But since his adviser was a graduate of here, we can claim him as a grandchild of our institution, and we certainly do that.
He initially landed after his degree at DEC Research Lab, worked there for a while, and then made the smart move to jump over to Google in the early days where he was instrumental in building a lot of the systems that Google uses to make money—basically their search and language translation, all these things. Along the way, he was involved with some very influential pieces of infrastructure including Bigtable, MapReduce, and Protocol Buffers. With that introduction, I'll let Jeff describe them to you.
Thank you, Malle. So welcome. Thank you for having me. I will try to speak louder. So the plan for today is, I'm going to connect my microphone. I'm going to talk about the evolution of various different kinds of systems at Google. One is how our computing hardware has evolved over about the last decade or so. And another thing is about how our core web search and retrieval systems have evolved—what we've done in order to scale those systems as traffic and index sizes have grown and so on.
I'll talk a little bit about some of the infrastructure software we've built that also underlies a bunch of Google products. And then the last little bit of the talk, I'm going to talk about some of the techniques that we've developed to build high-performance and reliable systems that have kind of cropped up in the process of building some of these systems, and sort of general patterns that seem useful across a wide variety of kinds of systems. This is joint work with a huge number of people at Google. I've worked on all the systems I'm describing, but there were many, many people involved and I always appreciate my colleagues.
So when I started working on web search in 1999 at Google, these are some metrics that I think you can use to evaluate different dimensions of a retrieval system. One is how many documents you're indexing, and that scale has increased by about a factor of a thousand from 1999 to today. The second thing that is another important dimension is how many queries you have to handle on a given day, and that's also grown by a factor of a thousand. I have a quick experiment. How many of you used Google regularly in 1999?
Yeah, this may be a biased audience, but how about 2002? How about now? Okay. Yeah. Well, so our traffic has grown, but I don't see a thousand more hands than I did earlier.
Another important thing that is pretty vital for improving your ranking quality is often you want to keep additional information around about each document in your index and use that to drive more sophisticated ranking algorithms. The amount of information we keep in our index today is about three times as much as it was then per document. One of the things, when I was making this slide, that was kind of surprising is that the metric that's grown the most or improved the most is actually our update latency. In 1999, we were basically updating our index once a month if we were lucky, and once every couple of months if something horrible went wrong. And now we have portions of our index we can update within a matter of seconds from crawling the page. So that's a pretty substantial improvement.
The other important factor for users is how quickly do you get your responses? This is measured at the server side, so it doesn't include client-side network latency, but basically, we've had about a 5x improvement in there. So the difficulty in kind of engineering a retrieval system is in some sense the product of all these things because you're dealing with larger indices, you're trying to handle more queries with more information per document, you're trying to update it more often, and you're trying to do it faster.
Now one thing that's really helped us a lot is that we've been able to use more machines and faster machines since 1999. We've upgraded our hardware a bit, and that's given us about a 1000x improvement in sort of computational oomph. The other kind of cool thing about working on search is that a lot of the stuff kind of happens behind the scenes and we don't necessarily change the user interface every time we change how the guts of our search system work. And so over the last 11 years, we've rolled out about seven very significant revisions to how our search system works. Often these have been rolled out without users really realizing we've made major fundamental changes underneath the covers, other than perhaps seeing a larger index or responses faster. The UI kind of looks the same.
Okay, so I will start at the beginning. As you obviously know, Larry and Sergey were grad students here and were doing research on how to build search engines for web pages and using the link structure of the web in order to do interesting ranking experiments. Apparently, their advisers were very mean and would not buy them enough computers. So they apparently survived by going down to the loading dock and volunteering to set up other research groups' machines and then living off the float for a while where they would actually use the machines for a little while. One problem with this approach is you end up with kind of a heterogeneous mix of Suns and IBMs and all kinds of crazy things rather than a nice homogeneous cluster which you might prefer from a software standpoint.
But the research project essentially looked like this. You have a system that takes in a query received by a front-end web server. The index is actually divided into a bunch of partitions by document. So each of these partitions has some of the documents across the whole system. You send a request to the index serving system to each partition. It computes the best results for its subset of the documents and sends results back. And the doc servers are used to generate the actual title and snippets you're going to show once you've decided which of the documents that you actually want to put on a results page.
So the basic principles of a system designed like this: given a query, each index server can return you a doc ID, which is just a numeric identifier for a document, and a score for that document, and gives you a bunch of those pairs. The cost of the index serving portion of the system is essentially order number of queries times the number of documents in the index that you're searching.
On the doc server side, the goal is given a doc ID and a query, you want to generate a title and a snippet. One of the early things Google did that was different from other search engines at the time is that the snippet was query-dependent. So given the search results, it would actually use the words in the query to decide what the summary of the document that they would show, as opposed to just showing the first 30 words of the document or something. That actually is a really important innovation, but it does mean that the snippet is query-dependent. So you can't sort of precompute what are the snippets you want to show for every document. You have to do it for every doc ID / document cross-query pair. The cost of the doc servers is really just order number of queries because for every query you're just going to show up to 10 results or 20 results or something. From a performance standpoint in a search engine, the cost is dominated by the index serving portion. The doc serving portion really doesn't matter at all, which is why you won't hear me talk about it too much.
Okay, so now that Google was a real company, we decided we should build our own hardware. We were going to live off the commodity hardware curve that was driving a lot of the plummeting prices for desktop computers. So we actually decided that we would build our own hardware from these commodity components. The commodity components were great because they were really low price for what you got. They didn't have a lot of fancy server-level features that you might see in higher-end machines, but they were pretty cheap. So we actually bought components, assembled them ourselves, and this was our first attempt before we hired mechanical engineers.
They're affectionately known as "corkboards" because we actually have trays here... there are several lessons from this particular design. One is each tray had four computers on it and they all shared a power supply to save some little bit of money. But it kind of induces additional failure modes that you really don't like when the power supply fails. It had four reset switches so an operator could reset each individual computer. And it had a thin layer of cork to insulate the motherboards from the metal tray that they were sitting on.
So the system in 1999 looked pretty similar to the original research project except it had grown a bit. Actually, the first thing I worked on when I showed up at Google is they said, "We need an ad system." So I said, "Okay, let's write an ad system." I'm not going to talk much about it in the rest of the talk, but there's all kinds of interesting features in the ad system. You can view the advertising system as really another form of information retrieval with some additional kinds of constraints like budgets for advertisers and cost-per-click metrics and so on. But I'm not going to really dwell on it in this talk.
In order to add more search capacity in a system, you essentially take the index data and you replicate it so that you have a whole bunch of machines that can deal with index shard zero and a whole bunch of machines that can deal with index shard one, and you pick one of the replicas for each shard and you send the request there.
You got the light. Did I? Thank you.
So, you pick one of the replicas for each shard and you send the request there. The other thing we added, obviously, was cache servers. This is sort of a no-brainer thing to do. So caching in web search is a very useful thing to do. We actually cache both the results of index results and doc requests. Hit rates you typically see are maybe 30 to 60% depending on a lot of different factors. One is how often you flush the cache or update the index and have to invalidate the cache. The other is kind of the mix of query traffic. If it's a data center in Europe, it'll see a wider variety of languages, and so that will mean you'll typically get fewer cache hits than a data center in the US where you see more English queries and fewer other language queries. How much you personalize the search results and how that affects what you can cache also matters.
The main benefits of the caching system are obviously performance because a few machines dedicated to caching and maintaining a cache on disk do the work that hundreds or thousands of machines in your backend systems are actually trying to do. So a few machines do 50% of your work and then the other thousand do the other 50%. You also get much lower query latency. So if you get cache hits, you just basically have to do one disk seek to read the data in the cache and you return to the users. You don't have to do any of this distributed distribution of RPCs to shards and so on.
Also, the queries that hit in the cache tend to be both popular—because obviously they hit in the cache, someone else has already issued this query—and expensive. They tend to be shorter queries, single-word queries that tend to have a lot of results, and you end up having lots of documents that get retrieved for these queries and lots of documents to score. One of the things to be aware of is that there's a big latency spike and a big capacity drop when you flush the cache or you update the index. So we also had to carefully stage when we would roll out a new index when we did it once a month to make sure we didn't do it at peak traffic times and so on.
So I'm not going to talk too much about our indexing system except maybe a little bit in the later versions of it. But in 1998 and 1999, it was basically a simple batch indexing system. The initial versions didn't really even have checkpointing. So you just kind of try to take a whole bunch of raw documents on disk, take all the words from them, sort them, and invert them, and you would essentially then end up with data structures that you could use to serve the index shards. We didn't actually have checksumming of the raw data and the machines we bought at that time, consumer-class machines, typically didn't have ECC or parity in their memory.
So a rather frustrating thing when you're sorting a terabyte of data without any parity is that it ends up mostly sorted. And if you try it again, it ends up mostly sorted a slightly different way. It's especially bad for merge sorts because you flip a bit and then all of a sudden you ignore all the data downstream of that particular input file you're merging. So a colleague of mine likened this to programming with adversarial memory. And so we developed one of the early kind of modules, a file abstraction for small records that actually appended little checksums and could resynchronize with resynchronization patterns when it detected corruption. This is still in use today for a variety of things and it's pretty useful.
So as we got more mature, we kind of got more comfortable with our computer design. We ended up building computers with cases and all the connections on the front, which was a good idea. So basically this is kind of a design where we still had pretty dense computational power per rack. So we actually were much denser than any other sort of data center user at that time. We had a single power cord and a single network connection per machine. That's all you had to plug in. And at the time we were not running our own data centers. We were in hosting centers that charged us by the square foot and not any other factors. So which was kind of a curious business model, but what do we know? So our incentives were to pack as many machines as we possibly could into these square feet. And we often had to help them a little bit with some cooling. As a consequence, we actually got pretty good at moving out of bankrupt service providers and into other ones. You can get pretty efficient at it. You can have the racks all ready to go, you just wheel them in and then you just cable together the top-of-rack switches and away you go.
So one of the important things in this period of 1999 to 2001 is we were really growing two of those dimensions at the same time. The index grew by a factor of 20 or so over that period. At the same time we were getting kind of 15-20% traffic increases per month and signing big deals. So we signed a deal with Yahoo to provide their search service in July 2000 and basically our traffic doubled overnight as a result of that deal. So you know, July 5th or whatever it was, we turn on the spigot and now we have to handle twice as much traffic.
So we did a lot of work on performance of the index serving system. We were deploying more machines kind of as fast as we could, but it was still pretty challenging and you couldn't operationally deploy them as fast as we could. So we basically needed to come up with lots of software improvements over that period. We would kind of roll out a new index, then we work on software improvements for a while, roll out the new index with some new server software that was hopefully higher performance, and so on.
So over this period, this is kind of the way things went. We wanted to increase our index size because we believed that index size was a very important way of increasing search quality. Basically, you want to be the place where if you do a search and there's a page on the web, you find that page. And in order to do that, you have to have a very large index. So we're constantly increasing the size of the index. When you increase the size of the index, you have to keep partitioning it more finely in order to keep response times within a reasonable rate. If you tried to have like one or two index partitions, your latency would be too high. So essentially, as you're increasing the index size, you're adding more and more index partitions. And as traffic is growing, you're adding more and more replicas. So there goes more shards, bunch more shards, more replicas, more shards.
Now, a rather large problem with sharding the index this way is that you end up basically on every index shard doing a seek for every term in the query. And disks are not the most highly performant operations one can do. And so you end up basically being very disk-seek limited and not really utilizing all the disk bandwidth you have there. Now there are a lot of tricks you can do. You can build other kinds of auxiliary data structures on disk that allow you to kind of get more information per seek by pre-intersecting terms that are commonly appearing in queries. You can try to compress the index data so you have to read less data from disk which will help you because then in the time you're not seeking you have to read less stuff. But essentially it becomes more and more problematic.
Now one issue is as you try to work on index compression and you add more and more machines, you eventually realize that, "Hmm, you know, if I look at all these machines we're using to serve this index, I could actually hold one copy of the index in memory across all these machines." So in 2001, that's actually what we did is we had a completely in-memory index system and we've had that pretty much ever since for all of our index servers.
So this was basically the design we came up with. And there's another layer of distribution in there, what we call balancers, where the front-end web server talks to each balancer in a shard. The balancer then talks to each machine in the shard because there's now only one replica of any given piece of index data. And so it has to talk to all these machines within the shard and gets results. The balancer kind of combines them and sends them back up to the web server. We still have cache servers and doc servers; those work pretty much the same way.
Now there are a lot of really good things about this. First, you get a really big increase in throughput because you're no longer doing disk seeks, you're just reading from in-memory data structures. And you also get a very big decrease in query latency because you're not waiting for mechanical disks, especially at the tail. So really expensive queries in a disk-based indexing system... this was our kind of canonical example that caused us all kinds of headaches. We were looking through query logs and trying to find what queries were really expensive. And this one was just orders of magnitude more than the rest of the queries in the sample of queries we were looking at: "Circle of life".
So the problem with this query is all three of these words are actually relatively common. "Of" is extremely common and they hardly ever occur as a phrase on the web. And that's just the worst thing you can possibly have in a retrieval system because you're essentially streaming through all this data for "circle" and for "of" and for "life" looking for one rare document that has them next to each other in a phrase. I think we did like 30 gigabytes of IO for this query. And in the memory system, it's very fast because the seeks don't really cost you much. You just kind of move to a different position of the posting list in memory and away you go.
Now, when you have the index system in memory, there are two main issues that really kind of bite you as you're first starting to deploy such a system. One is variance. So the query is now going to touch thousands of machines, not just dozens. And so things like randomized cron jobs can cause you trouble. The reason our operations folks had decided that we're going to have cron jobs that run every five minutes on the machine to do various kinds of housekeeping things, and it would be good to randomize those across different machines so that not all the machines were actually running those cron jobs at five-minute intervals but were kind of spaced out a bit.
It turns out for an in-memory index, this is the entirely opposite thing of what you actually want. Because at any given point, if you've randomized things, at least one machine is doing this cron job of activity and so has less CPU or is busy churning its disk or something like that, sending stuff over its network. Whereas if you were to just fix it at five-minute intervals so that everyone had a little hiccup every five minutes, that would actually be better because most of the time all the machines would be running at full throttle and they'd be a slight hiccup that would affect a relatively small number of queries but not every query throughout the five-minute period.
The other big problem you have is availability. So you used to have many replicas in the disk-based index system, but now you only have one replica. And so one thing that happens is machines do fail periodically. And so you'd like it to be the case that you don't suddenly have CNN.com drop off the face of the world when that happens. So for very important documents, you can just pick like the top PageRank documents and you can just replicate them across multiple different machines so that you don't completely lose those important documents.
The other thing that happens is you can have things called "queries of death," where when you receive the request and you start doing some processing for it, there's just some bug in your software that all of a sudden causes the process to abort. You know, it segfaults, or you didn't allocate enough space in some buffer for a really long query and you never encountered this before. Obviously your system gets more and more robust the more of these you see, but it's never going to be completely prevented. And so if you're going to send the same request to a thousand machines and it's a query of death, really, really bad things will happen. In particular, they'll all crash and your whole data center kind of goes belly up for many minutes at a time as it recovers.
So the solution is you basically can send what we call a "canary request" first to one machine. And if you get a response, excellent. Yes, then you're free to send it to all of them. If it fails, well, depending on how ambitious you're feeling, you could say, "Well, okay, I give up," or you can try another machine because it could have just been a coincidence that the RPC was sent to a machine and that machine crashed but not because of that query. If it crashes a second time, you're pretty sure this is not going to be a good thing to send to all your backends. And so you can then just reject that request. The nice thing is you then crash only a few servers, not thousands of them. And it's also a good idea to log what the query was there so that you can then investigate further.
Where was your blacklist?
Oh, we didn't really have one. I mean, we had a kind of a dynamic one in memory that we would keep in the balancers so that you would only crash a few backends and then if the user hit reload, then you'd say, "Okay, I'm going to reject it beforehand."
Were people trying to screw you or this is just...
No, not usually. It's just you know, some bug in some new release you've rolled out. You're also, remember, rolling out new releases really fast and you've replayed all of last month's logs or a large fraction of last month's logs to make sure it doesn't crash, but you're always getting new kinds of queries and things like that.
So we kind of caught our breath a bit and then redesigned things a fair amount in 2004. This is kind of the first time we were able to rethink things from scratch, right? Essentially we unified the index servers and doc servers. So we now have a multi-level tree for query distributions, kind of generalized from what we had before where we had a fixed number of levels. We have the leaf servers that are able to handle both index and doc requests.
We have a thing called a Repository Manager sitting on the side that deals with... this index is made up of a whole bunch of shards, thousands of shards. And as a new shard becomes available, we'll just switch that one shard. So an index switch is not kind of this big monolithic event. It's now kind of a very gradual process that's happening all the time. And the cache servers cache partial results and then you issue requests for the pieces of the index that are new since you last had that cache request.
Yeah, so we kind of were able to at that point clean up a lot of the abstractions that had kind of evolved over time. Um one of the things we wanted in this system was the ability to do easy experiments. So when you have a new ranking idea or a new ranking algorithm you want to try, often you need to access some new kind of information that you want to precompute for every document in the index. And the old way of doing that was you had to kind of build a test index with the new information baked in, which was a very painful process to build a whole new index just to try out your ranking idea. So we wanted it to be easy to experiment so that you could attach new data to an existing index, roll out a new ranking algorithm, try that for a fraction of the traffic. If things look good, then roll it out to a larger fraction.
We also wanted it to be higher performance because we were getting more queries all the time. And we'd actually gone a little too far in terms of our index compression schemes and we're using very dense, densely encoded bit-level encodings because the original scheme was designed for things to be on disk. Ironically, the fact that it was so small was what made it actually end up fitting in memory for the first time. But once you have a system where you now care a lot about throughput, you actually want to relax how compressed you put things to make it a little faster to decode, because now you're not streaming it off disk where you don't really care about CPU usage because the CPUs are so much faster than disk, but you actually care about CPU usage.
Yeah, so this is basically just saying we came up with a new format. So I'm going to skip ahead. The new format we actually have a single flat position space, unlike the old format which had a two-level document identifier followed by word position within that document pair. And this new format needs to be compact. We can't really afford to have a 32-bit value for every word occurrence in the index. We'd like it to be more compact than that. We want it to be very fast to decode. One of the great things about retrieval systems is you get to work on all kinds of fun encoding formats.
So I will take a brief diversion. Given that we wanted to be fast, we know we're not going to be using bit-level encoding formats but something byte-aligned. So a very simple thing you can do when you have variable-length integers encoded is to use seven bits out of every byte to encode seven bits of the value and then a continuation bit to say are there more... is there another byte that follows this that has another seven bits of data. So numbers between 0 and 127 you can encode with one byte. Those that need seven more bits take two bytes and so on. And then that continuation bit tells you/delimits whether it's part of the previous byte's number encoding or is a new number.
Now one problem with this is there's a lot of shifting and masking and looking at continuation bits in order to actually decode the number. So one thing you could do is you can say, "Well, maybe I'll just encode a number between zero and three that tells me how many bytes the full encoding of the number is and then use the other 30 bits of the number to actually encode the value." So I can limit myself to 30-bit numbers instead of 32 bits, steal two bits, and then zero means it's a one-byte number, one means it's a two-byte number, and so on. And this is actually faster. You have fewer branches and less shifting and masking to decode things if you're willing to limit yourself to 30-bit numbers.
What we actually have in our index is obviously a whole stream of numbers that we want to encode. And so we can actually afford to do things where we can say we want to encode a group of four numbers. That's the task we're going to give to ourselves. And what we're essentially going to do is take that two-bit encoding from the second thing on the previous slide and we're going to pull those out into a one-byte prefix for this group of four numbers. And so now we have it all packed together as a single byte that has the length information for each of the four numbers in this group. And then the rest of the data will be in 1 to 16... in the remaining bytes of the group. And so the whole group will be between five and 17 bytes. Then you just kind of encode the numbers that way.
One nice thing about this is decoding is pretty fast. You basically load this prefix byte. You can then use that value to look up in a 256-entry table and it tells you exactly where the offsets are in the rest of the group for the numbers and what the masks are that you should use to decode them. Furthermore, you get a lot of instruction-level parallelism. Once you've done this lookup, you can essentially decode all four of those numbers in parallel and you don't take any branch mispredicts or anything really. It's actually a factor of two roughly faster than most of these alternatives. So you know, there's lots of different encoding formats. It's all fun.
What's the raw number? The raw number you didn't...
Oh, it's probably quite a bit faster, but it's kind of a memory trade-off, right? So if you just were streaming through memory, you could do it at gigabytes per second, but it would cost you, you know, 3 to 4x in memory probably.
So then the system obviously has continued to evolve. One of the major changes we made in 2007 was this notion of Universal Search. So previously we would just search web documents when you went to google.com and you'd have to go to these other properties that we developed over time, like books.google.com and news.google.com, to search these other kinds of corpora. And a problem with that is the user doesn't really know necessarily that these other corpora even exist in some cases, or they're certainly not going to take the time to go to each one of these and perform their search on all the different properties. So we'd like to actually search all the information we have when they go to google.com.
And in order to do that, we essentially took all these different properties and said, "Okay, we're going to search them at full web traffic levels." And we have actually an indexing service that sits underneath it that is able to roll out new index updates to these different corpora and switch things over. So the main issues here are performance. Obviously most of the corpora, since they were just their own isolated property, were used to traffic levels that were 10 to 100 times lower than what they were going to get if we were searching them with all web traffic. So a lot of them had ranking functions that weren't necessarily designed as efficiently as they could be. So we had to do a lot of work there.
There's a bunch of issues in the information retrieval area about how do you decide which corpora are relevant to a given query. One thing we found was that it's very hard to predict just from the query which corpora are going to give you good results. And so actually what we found easier is to issue the query to all the corpora, get the scores back, and then use that information to try to decide which corpora are most relevant. It's a much harder problem to do it just from the couple of words in the query. And then there's lots of interesting UI issues about how you organize the results from different corpora. Should you interleave them? Should you have separate sections for different kinds of things? And actually turns out for different kinds of documents, the answer is we do both depending on the kind of document.
Okay. So I'm going to switch gears a bit and talk about how kind of our underlying system infrastructure has evolved. This is a fairly modern incarnation of our rack design. We're back to not having cases on our machines. You get better air flow as it turns out, but they look better than the corkboard, don't they? So essentially these are still commodity-class machines. They have kind of a rack-level switch that connects to a central switch. They're all running Linux plus a bunch of in-house software that I'll describe.
Now this is kind of a list of things that happen that go wrong that I collected from one of our operations people. Things get a little bit better after the first year of a cluster because it's kind of broken in a little bit. But you know, these are the kinds of things that you have to deal with when you're building software on top of this kind of computing platform. You know, obviously things like individual machine failures or disk drive failures you would kind of expect. But there's a lot of kinds of things that affect multiple machines at the same time, like a whole switch dying. That's actually not so bad because those machines just kind of drop off the face of the map. But if it gets slightly flaky and starts dropping some packets but not all, you can still kind of talk to the machines but it's just very slow, then that's actually kind of worse from a software standpoint because the machines kind of look alive but are kind of half there. That's actually quite painful.
In terms of long-distance links that connect our data centers, these are all actual reasons that have caused our long-distance links to die. Things you would not expect. They're not taught in software and hardware reliability courses. What should I tell you about? Dead horses. Yes. Turns out horse graves are very deep and will actually sever fiber that you've buried. Drunken hunters. This was in Oregon apparently. There wasn't much actual things to hunt and so some hunters saw some nice things mounted on poles across the valley and decided they would be interesting to try to shoot them.
So in this kind of environment, you really have to have the reliability and availability come from the software, not from the hardware. And even if you were to spend more money and buy more reliable hardware, at the scale that we're operating, that hardware is still going to fail. So you still need that reliability and availability to come from the software. And so I'd actually much rather have three times as many machines that are less reliable because you get a lot more computing oomph per dollar that way.
Okay. But assuming you have a lot of machines and you're running in this environment, there are a few things you'd like to be able to do. One is you'd like to be able to store data persistently with high availability. So that means not on a single machine obviously, or maybe not even on a single rack given the problems we saw previously. And you'd like high read and write bandwidth to the data you've stored. You'd also like to have the ability to run large-scale computations reliably and without having to sort of deal with individual machine failures.
So let me briefly talk about the Google File System (GFS) that was developed at Google in 2003. Essentially it was a file system optimized for the kinds of workloads we had which were very large files. So typically a whole record of files of things that we documents we'd crawled. So the files were hundreds of megabytes, gigabytes in size, not tiny 5k files. The design that we came up with was the Master was going to manage the file system metadata—so file names and information about which different servers were storing different pieces of that file—but the data would actually be spread across lots of machines in the data center running a process we call the Chunkserver.
Clients when they actually wanted to read the data in the file would talk directly to the appropriate Chunkserver. So clients talk to the Master to figure out where the data is, but then they read directly from the Chunkserver. And the files are broken into chunks of roughly 64 megabytes in size. And because we want to tolerate machine failures, we're going to replicate chunks across multiple machines. So we're going to make multiple identical replicas of the same chunk, typically three. So in this example, I've shown C0 is replicated on both Chunkserver 1 and Chunkserver N. C5 is actually replicated on three of the ones illustrated here, and so on. So a client actually has three possible locations they can pick from to read the data.
So a cluster in our environment is actually somewhere between 5,000 and 20,000 machines. Typically one or a handful of hardware configurations. So the different configurations might be some machines with one disk and some machines with 12 disks, but other than that they're pretty much identical. Different clusters built at different times will have slightly different evolutions of our hardware. So this one built two years ago might have older processors that are a little bit slower than one built today, but within a cluster is typically pretty homogeneous. As I said, they're all running on this commodity hardware. They're all running Linux. They're all running Chunkserver processes.
And within the cluster somewhere, there's a GFS Master. We actually have a scheduling system so that you can run jobs and tasks. So each machine runs a scheduling daemon that the scheduling master talks to to kind of coordinate tasks startup and find out when machines have failed or have come back up and so on. We also run something I'm not really going to talk about much. It's a distributed lock service so that different processes running on these machines can talk to this lock service and grab distributed locks or share small amounts of configuration information amongst themselves. And then on top of that we just run a whole bunch of different kinds of user-level jobs. So there might be one job that is index servers in our web search system. Another one might be a big batch production job to rebuild the index.
So one of the problems that you have when you have lots of data is that it takes a long time to do anything with it. And for example, you know, if you have 20 billion web pages and that's going to be 400 terabytes of data, if you tried to do this all on one computer, you basically take a month, several months to do anything with it. And that's obviously not going to be very reasonable. So you have to parallelize computations somehow. The good news is that if you parallelize them, you can actually get pretty decent response times for fairly data-intensive tasks. Like I want to do something with all the web pages. If I'm able to parallelize that across a thousand machines, then I can do it in three hours instead of three months.
The bad news is there's a lot of issues that make this difficult. You have to communicate between the different pieces of your parallelized job. You have to coordinate, figure out who's going to do what, recover from machine failure somehow. Status reporting is pretty important because I'm sitting there impatiently waiting for my three-hour job to finish. And so on. And the "bad news part two" is that you're going to repeat a lot of this for every different problem that you're going to solve in slightly different ways.
So in 2003, my colleague Sanjay Ghemawat and I were working on rewriting our indexing system to try to make it easier to incorporate new kinds of information into it. One of the things about an indexing system is it starts with raw page contents on disk and then it goes through a whole bunch of phases to kind of compute intermediate data structures that you're eventually going to bake into either the index serving system or the doc serving system. And over time you kind of accrete more and more of these phases to compute other kinds of derived information that you either know is useful in your ranking algorithm or if you're experimenting you think it might be cool to have this information available.
Prior to this period, each phase was essentially this handwritten parallel computation where we hand-parallelized it across a bunch of different chunks of input. We would have handwritten checkpointing code to basically deal with fault tolerance. So if a machine crashed, you would revert to the last checkpoint that that machine had saved and restart the computation from there and roll forward.
So eventually we squinted at all these different phases in our indexing system and said, "You know, a lot of these look pretty similar and that they're extracting something from some input and transforming it in some way and then producing some output." So we kind of squinted at this and came up with this programming model called MapReduce that allows you to express relatively simply in a couple of functions what it is you're trying to do to the input data and how you want to transform it and then produce the output data. And the nice thing about that is if you take computations expressed in this way, you can hide a lot of the messy details that were previously kind of intermingled with the actual simple computation you're trying to do. And put all that gunk in a library and let the library deal with a lot of these issues and use that library for all kinds of different computations.
So the typical problem you're trying to solve in MapReduce is you want to read a lot of data, extract something you care about with a map function. The library internally will shuffle and sort it and then you want to apply a reduce function that says how you want to combine data that you generated in the map phase. And so this outline is basically the MapReduce programming model and that outline stays the same and you write the map and reduce functions to fit the problem.
I'm going to walk through an example that's maybe a little more interesting than the typical examples done for MapReduce, like an introduction to MapReduce. So this is actually a real example from our production maps system. So our input in this case is we have a whole bunch of lists of geographic features, roads in the world, and we want to end up generating map tiles where we've actually rendered all those roads at some resolution. I'm ignoring multiple resolutions here. Let's say we just have one resolution we care about. And so we need to somehow collect together all the data we need for rendering each tile in the world, starting with this input data.
So we can actually assign a numeric key to each tile in the world. In this case, I'm showing two tiles: Tile 0, Tile 1. This is actually Seattle if you're familiar with Seattle. And so there are four roads shown here. I-5, which actually intersects both these tiles. So my map function is basically going to take a feature... so the map function gets called on each of these geographic features and it's going to figure out which tiles that that particular feature intersects with and emit information about that feature to each tile that it intersects with. So in this case when it's called on I-5, it's going to determine that it intersects both Tile 0 and Tile 1. It's going to emit all the data it needs for I-5 and generate that set of key-value pairs.
Same thing with Lake Washington that actually runs north-south and intersects both tiles. So we're going to do the same thing there. The 520 bridge actually only goes in the top tile. So we're going to emit just one copy of that to Tile 0. Similarly with I-90 which intersects only Tile 1. Then there's this big shuffle phase where we take all the keys that the map function has output and we combine things by key. So we generate... we cluster everything together that has key 0 and we have all the data we need to render Tile 0. And then we cluster everything that has Tile 1 or key 1 and then we have everything we need. And then we invoke the user's reduce function which for every tile is then going to get a list of all the geographic features that intersect that tile. And the actual produce function in this case is actually going to render things in a JPEG image and generate the JPEG image as the output. So pretty simple computation when you express it in MapReduce and the underlying implementation will take care of a lot of the details.
In particular, there's going to be a Master that's going to coordinate the process of splitting the input into a bunch of different chunks so that it can get parallelized across a whole bunch of different machines. I didn't mention that the user can specify a partitioning function for the reduce phase to group different... reduce different intermediate keys to different partitions so they can control the level of parallelism you get on the reduce side. And essentially the Master is going to find out who the free workers are. It's going to assign them map tasks and the worker is going to read all the records in that particular map task and invoke the user's map function and generate the intermediate data into files. And the Master is then going to assign reduce tasks to free workers and those workers will then read intermediate data that's been generated by the workers who perform tasks and then it's going to sort and then apply the user's reduce function.
So the way this looks in graphical form is essentially like this. You have a whole bunch of input data sitting on a distributed file system. The Master breaks it into chunks, will tell different machines to process different chunks. The map tasks generate output onto their local disk or in memory if there's not very much output, and then it'll get shuffled by the framework and then the user reduce function will be invoked and will produce output. One thing to note is that for really large computations, it's really more about network bandwidth and making efficient use of your disks and network transfers that you're able to do in your network rather than CPU and DRAM. For small computations or CPU intensive ones, that's not necessarily the case, but a lot of the performance in large MapReduces is guided by the shuffle and the framework kind of takes care of that for you.
Another thing that's worth pointing out is that you have pretty fine granularity tasks. We don't want it to be the case that in the course of a single MapReduce computation, each worker does just one map task. We'd rather have them do 10 or 20. So that when a machine fails, you can actually recover from that very fast by giving the 20 tasks this guy has done to... one of them to each of 20 other machines. And recovery is very fast that way. It also allows you to pipeline the network transfer. So here's a picture with three map tasks, two reduce tasks, and the first two workers are assigned to do the map tasks. This guy is first assigned map task one. Then he finishes that because it was pretty fast and does map task three. And then this guy's going to work on map task two, which takes a little longer.
And so as soon as map task one finishes, these two guys which have been assigned to do the reduce work, they can actually start reading the output of map task one from across the network. So you're actually shuffling in parallel with running other map task computations. And so you end up with a lot of pipelining in this system and you get better dynamic load balancing by having finer granularity tasks.
It's actually pretty fault tolerant. So you can actually handle a lot of the issues when a machine fails by just re-executing the work it was supposed to have done and do a little bookkeeping to keep track of who actually now has the current result output for map task 7 or whatever. It's actually pretty robust. So actually we were running experimental MapReduces early in the MapReduce development and we kind of got our signals crossed for their operations people and turns out they were rewiring the network in the data center we were running our large MapReduce on. And so they kept unplugging rack after rack of machines, rewiring it, you know, we'd lose 80 machines and like, "Oh well, I wonder what's going on." And the MapReduce framework would recover. And then they'd unplug the next rack and we lose another 80 machines. It's quite painful... But we didn't really know what was going on at the time and it took a little longer than we thought, but we actually handled this and only discovered this after the fact, which is kind of cool.
Another refinement you can make in this environment that's all handled automatically by the framework is backup tasks. So if you have slow workers that are slow for reasons independent of the actual input data—you know, maybe they're just running a lot of other jobs at the same time or their network is congested—then they can lengthen the completion time a lot not because the input data is slow but because of other issues. So one of the things we do is near the end of either the map or reduce phase, we spawn backup copies of these tasks. And so we actually run multiple copies of map task 7 and whichever one finishes first wins. And this brings in the job completion time tremendously for MapReduces.
Another important thing in our case is a locality optimization. So the scheduling policy is such that we're going to ask GFS for locations of replicas of input file blocks and we're going to try to schedule map tasks so that they're local to that particular machine or at least on the same rack in the interest of... So here's some stats about how MapReduce has been used over time within Google. It's, you know, I won't really read it all. We're roughly processing an exabyte a month now. Running 4 million jobs a month. Kind of interesting. I don't know.
That's a petabyte.
946,000 terabytes. Yeah, that's a petabyte. No, nearly an exabyte. Exabyte. Yes. Yes. August '04 was three terabytes.
So I'm not going to touch on this too much. I'll just briefly mention a current system I'm working on with seven or eight other people. Basically it's a system called Spanner that is designed to be a piece of infrastructure that runs in a single instance across multiple data centers. So a lot of the systems we developed at Google run one instance in one data center and then another instance in another data center. And this is kind of the first system we've really tried to build that spans many data centers at a large scale.
So it has a single global namespace for data so that if you decide to move or change the replication of some data or move a copy of some data, the name of that data doesn't change, which is a pretty important operational property within Google. Because right now if the calendar group has some state stored in this data center and they decide to move it, that actually changes the name when they store it in a different GFS cell, which is kind of a pain because that means sharing across different groups is difficult. You always have to go ask the calendar people, "Okay, where's your current data now?" And we're working on supporting a mix of strong and weakly consistent operations across different... across data centers. And the hope is that the system is much more automated, especially when it's moving data across different data centers, than right now which is kind of a fairly manual labor-intensive process.
So this is kind of a very broad high-level view of the zone design of Spanner. So you have going to have a whole bunch of zones around the world in different data centers. And these zones are going to store copies of data and might also have replicas of that data in different other zones. And we'd like the zones to be semi-autonomous, so they can still continue functioning and doing load balancing within the zone on their own, even if they're partitioned away from the rest of the system. And we also want them to be consistent after the... like let's say those two network links are severed and then they come back up. We'd like to recover a consistent view of the data. And users are going to specify kind of high-level desires rather than very specific things. We'd rather they say "I want three copies of the data, two in Europe and one in the US," rather than saying "I want it in exactly these three data centers."
Okay. All right. The final portion of the talk is a set of kind of design experiences and patterns that I think you'll see derived from some of the systems I've described. They're not obviously all-encompassing, but they're kind of good rules of thumb that we've found crop up across multiple systems.
So, one of the first things that is pretty important when you're developing large complicated software systems is to be able to break them down into separate subsystems. And especially at Google because everything is going to be distributed across many machines. Those separate distributed services provide good natural interface points where you can, you know, fairly easily specify the interface to the spelling correction system. And then this group of people can work on the spelling correction system fairly independently of the people working on the ad system or the clients of the spelling system, and they can roll out new versions at whatever pace makes sense for them and are largely decoupled from the other people working on other distributed services. So this is pretty important. Small teams can work independently of other ones by carefully defining these interfaces and services. That also makes it easier for us to have a lot of engineering offices around the world. You know, we went through a big expansion about three or four years ago where we went from, you know, two or three engineering offices to 30 around the world. And part of the reason we're able to do this is because we were able to break things down into these separate services and give ownership of those services to different offices. So as an example, every search you do on Google.com touches 200 separate services. Probably more than 200. Who can keep track?
Another pretty important thing when you're trying to build a design for a system is given some basic problem definition, how can you choose what the best solution is? And the "best" has a lot of kind of qualitative aspects to it. Is it easy to understand? Is the interface clean? But it also has some quantitative aspects like how is it going to perform? And so on. And so a really important skill when you're designing systems is being able to estimate with a back of the envelope kind of calculation what the performance of a system or several alternative designs is going to be without actually having to build it. You know, if you have to build four versions and measure them in order to figure out that, wow, those three were really bad ideas, then that's not going to be very good.
So, a while ago, someone asked me to give a talk and I put together this list of numbers I think everyone should know. Maybe not everyone, but everyone designing software systems. And you know, they're a pretty broad spectrum in terms of magnitude. You know one of the things you notice about data centers is that they're far apart and so it takes you a long time to go between them. The other thing is that memory is really fast and disks are pretty slow. The middle one there, "compress a kilobyte with cheap compression algorithm," that's actually pretty interesting because often you can get a factor of two compression with a very lightweight compression algorithm and save yourself actually quite a bit of network bandwidth. And try to avoid disk seeks if ever possible. You know, motherhood and apple pie, I guess.
But that's pretty impressive about Netherlands because that's... I mean that would be 30,000 miles if you were going at the speed of light, but you're obviously not, but it's a significant fraction of the speed of light.
Yes. Well, I mean it is basically sending over fiber optic cables.
So for example of how you could do this. So, how long is it going to take you to generate an image results page? And let's say I'm tasked with this thing, I have to generate 30 thumbnails for an image search. And I have one design where I'm going to basically for each of the 30 images, I'm going to do a disk seek and then I'm going to read the quarter megabyte image and then I'm going to go on to the next image. So that's one design. It would take me, you know, roughly half a second. Obviously design two... not obviously too difficult to figure out that if you issue the reads in parallel and you have enough disks then it'll actually go quite a bit faster. Back of the envelope tells you you know it should be 20 milliseconds or something. Actually probably because of variance when you're talking to lots of disks some of them are going to be a little bit slower than others but it'll be significantly better than the first design.
And the back of the envelope calculations allow you to work out lots of different variations. You know, if you're going to cache in this system, does it make sense to cache the thumbnails of single images? Should you cache a whole set of thumbnails in one cache entry? Does it make sense to precompute thumbnails? And these kinds of calculations are the kinds of things you should be doing over and over in your head when you're designing a system. The other thing that's pretty important is to know kind of the back of the envelope numbers for things you're building on top of. If you don't know, you know, roughly how long it takes to do a write into your cache system—and by cache, I mean like a disk-based cache, higher level software system—then you really can't tell if it's a good idea to cache things or not.
So another thing that I found when you're designing software infrastructures that can be used by a bunch of different groups or people is that if you ask them what they want, they'll tell you many different things. And it's really important to listen to the commonalities and figure out what it is they all really want, which is a good thing. But to, you know, if they tell you they want eight different things, usually six of them might be in common and you should pick those six and you should do them because that's clearly going to help a lot of people. If you really stretch yourself, you can usually handle an extra one that would help a lot of people, too. But if you try to do all eight, it's really going to probably result in a worse system for everyone because the system will get much more complicated because that eighth feature just drove you over the edge of complexity in some way. And so that really is going to compromise all the other clients or possibly even the clients you're trying to help while trying to do this.
What about what they need instead of what they want?
Yeah. Well, it's a subtle distinction, but it is also important to listen to what they're saying and then sometimes translate that into "if I did this feature then you wouldn't need this other thing you're asking for." And they're like, "Uh, sometimes they say, oh yeah, that's true."
Another thing I found is that you don't want to build infrastructure just because it's fun to build infrastructure. You want to build it to address real needs. And another trap is kind of to when you're designing something to imagine these hypothetical uses like "what if we had a client who wanted to do this" and you design a lot of complexity because you imagine that would be useful but no one is actually asking for it today. So you need to kind of try to imagine what are likely potential uses versus unlikely ones that you think would be just challenging to handle. And the best approach ideally is to use your own infrastructure at the same time you're trying to build something that sits on top of it because you get very, very fast iteration with this approach. You know, you can roll out a new feature. You can put a feature in the infrastructure that makes sense or put it in your application if it doesn't make sense to put it in the infrastructure and get quick feedback on how the interfaces are evolving and how hard they are to use. If you can't do that, then you should at least sit very close to the people trying to use your infrastructure and get very rapid feedback from them.
Another thing is design for growth. So, you try to anticipate which parameters of your system are actually going to grow and how much over time. Obviously, you can't do this perfectly. If you could, you'd predict the future and you'd all be better off. But I've often found that it's not really useful to design a system so that it can scale infinitely. If you think about the in-memory example, you know, our original disk-based index, we made that evolve pretty well. But once the number of queries we were handling per day crossed a threshold that was like 100x what it was handling originally, then a very different design made sense. You know, not having any of the index data on disk, having it completely in memory. And so when you have parameters that change by two orders of magnitude, that probably means the original design is not going to be the right one at that much larger scale.
Another common pattern that crops up is when you're building a distributed system, there's a temptation to build completely distributed state as opposed to having a centralized component. Turns out you can actually go quite far in a lot of distributed systems with a centralized component that has some amount of state that makes a lot of things easier. So good examples of this are GFS where we had the centralized metadata master; Bigtable, which I didn't talk about but has a centralized master that coordinates load balancing across a bunch of machines; MapReduce has the Master for coordinating handing out of map tasks; and various other examples.
It's important to keep the master from being a scalability bottleneck. So you don't want it to be involved in a very frequent kind of operation. But in terms of oversight of a much larger system, that system works pretty well and will actually scale to, you know, thousands or 10,000 workers, probably not 100,000 machines. And you have to have careful design in order to keep that master out of the common case kind of operations. But as a benefit of having a centralized system, you get it's much simpler to reason about the state of the system and you also get a nice centralized place where you can hook on status information or a place where you can go find out what is the state of the system in a more centralized manner. You often want to have a hot standby of the master because if that single machine fails then you want to be able to recover quickly.
Obviously, when you're sending a request to lots of machines, you don't want the network interface or the CPU on the root of the guy who's sending the requests out to be a bottleneck. This wide fan-in tends to cause drops in TCP packets and then retransmits which adds significant latency and the CPU for either sending out the requests or processing responses on the root can be a big bottleneck. So obviously introducing a tree is a good idea and then the parent here is responsible for sending the request to a subset of the leaves in the system. The other benefit is that the CPU cost of processing responses gets spread across a whole bunch of parents and the network congestion is much lower especially if the parent is able to take some of the data that was sent by the leaves and filter it in some way.
In our search systems, for example, the leaves generate, you know, their best 10 or 15 responses and then the parent sends back the best 20 or 30 out of the 30 leaves it's responsible for. So that's actually quite a bit of reduction in the amount of data that gets sent up to the root compared to if the root talked directly to the leaves. And it also gives you an opportunity to collocate responsibility so that a parent on a given rack is responsible for talking to all the leaves on that rack and that can help if your network has interesting properties where you don't have full bisection bandwidth but have more bandwidth within a rack of machines than across.
Another pattern that crops up is backup requests to minimize latency. I described this briefly in MapReduce where we spawn the backup map tasks near the end of the computation in order to take the first one that finishes. It's actually useful also in fairly fine-grained things like query serving systems. And so if you have multiple replicas towards the end of sending it out to all thousand leaves, if you've heard back from 995 and you have replicas for the other five, why not send, you know, five extra requests to the replicas and see which ones respond first? The granularity of that is milliseconds as opposed to map tasks where the granularity is many seconds.
Another good thing is multiple smaller units of work per machine. You want to minimize recovery time when machines crash. So if you have a single monolithic thing a machine is responsible for, that's bad because you're going to have another machine who's going to have to then load up state proportional to that single monolithic thing. And it also makes it fairly inflexible for load balancing purposes. So that if this machine is overloaded because this particular work chunk is slightly more expensive than some other work chunk, you really don't have any flexibility here because well it's going to be slow on some other machine too. So if instead you give, you know, 10, 20, 100 pieces of work to a given machine, that gives you both fast recovery—because if this machine dies, nine other machines can each pick up one of these chunks and recover in one ninth the time the monolithic one could—and it also gives you load balancing. So I can remove one of these chunks and now it's, you know, 9/10ths as fast. This is present in a lot of our systems.
And a final pattern that you want to think about is elastic systems. So if you can't quite figure out what your peak load in your system is going to be—it's often difficult to figure that out, it changes over time throughout the day, you can get things like denial of service attacks where someone suddenly floods you with queries you didn't expect—you want to design the system to adapt. Ideally, it should automatically shrink the capacity at night when traffic is low and then grow it back during the day when it's high, but that takes a little bit of time. And so, you also want to make the system resilient when you do get overload unexpectedly. So you kind of want to do something reasonable even when you get twice as much load as you were expecting.
And there's lots of ways you can do this in different systems. So for example in web search, you can stop searching the full index, just search, you know, the first so many billion documents and chop off a little bit of the tail. Your quality will suffer a little bit but you're overloaded and it's better to return responses than not. You can drop the spelling correction tips if the spelling correction system suddenly starts being slow. And another thing that helps in overload is to start employing more aggressive load balancing when the imbalance becomes more severe.
I think I'm going to skip this slide in the interest of time and go ahead to final thoughts. So, you know, I think one of the really exciting things about the period that we're in now is that we have really large collections of computational power in single locations and have very interesting data sets available. You know, large fractions of the world's knowledge are now online, there's all kinds of interesting new data sets available, and there's a proliferation of more powerful client devices that I think can interact with these data center-based services in interesting ways. So that brings about lots of interesting opportunities for new kinds of research. Thank you. I put together a list of papers that cover some of the material there, but if you go to labs.google.com/papers, there's a bunch of other ones there too and I'll take questions.
So when applications are deployed to data centers, do they express their demands as demands for a small number of fungible resources like, i.e., so many CPUs, so many disks, or is it more complicated than that?
So the question is how are application resource requirements actually specified in our scheduling system? You actually ask for a certain number of tasks and each task has a set of properties like I need two CPU cores, I need this many megabytes/gigabytes of memory. And the scheduling system then tries to fit all these different... basically it's a big bin packing problem. That's kind of the level at which people schedule things. And then on top of that people build higher level things where you don't have to specify quite as much.
It's a multi-dimensional big thing. It is. Yes. How many dimensions are there? Is it a handful or is it...
It is just a handful. So it's basically memory, CPU, network, and disk are the main ones.
Seems very nerd-centric.
Node-centric. Sorry.
I was hoping it was nerd centric.
Yeah, well that too. But node-centric.
Yeah, I mean I didn't talk a lot about the networking because I tend to work on sort of system level software. But uh yeah, there's lots of interesting networking issues and over time the networks we've been building in our data centers have grown in sort of capabilities and bandwidth and reduced latency.
Does that... do you try to coordinate with networks or does that sort of go on in parallel?
It goes on in parallel but we kind of keep each other informed of what, you know, we want to see and what we expect to see in the network in the year from now and how can we take advantage of that.
What are the sort of top things that keep you awake at night of sort of the future for infrastructure? What are the big problems?
Um so the question is what are the big infrastructure problems? I think you know one of them which is Spanner is partly a response to is that it's very hard when you're operating a service that needs to run in multiple data centers because a lot of the infrastructure we've built to this point has been very data center-centric. And so we've kind of cobbled together a bunch of tools on the side that allow... that solve some aspects of the cross data center service deployment. Like there's one tool you can use that helps you monitor your jobs in multiple data centers, there's another one that helps you transfer files around. And I think that's caused a fair amount of complexity because there's lots of different tools that each solve one aspect of the larger problem and it means more complexity for people trying to do those deployments. So I think simplification of complexity in this area is a big one.
So the volume of data on the web is growing and the amount of information you can store on in solid state in on your in an office is also growing. How do those trends compare? I mean are we reaching a point where you can put the whole web in a room or is it burgeoning to where it'll be impossible to buy enough machines to keep up?
Um so the question is how's the growth of the web keeping up with our friends in the hardware disk drive and flash industries? I think it depends how you define the web. I think the textual web is actually not growing as fast as the device capabilities are. So you can actually, you know, use a few hundred disk drives now and keep a large fraction of the web in, you know, one cabinet or something. I think if you include video though, then there's a proliferation of high-quality video devices which have a tremendous propensity to generate very high bandwidth requirements and very large amounts of data. So I think that'll continue to be a big issue for the foreseeable future because that I think will generate a lot more data than the devices are really ready to deal with large amounts of. You will not have all the world's video in your pocket anytime soon. Let me put it that way.
How will you pay the electricity bills in 20 years?
How will we pay the electricity bills in 20 years? Probably through ads, but... you know, I think power usage within the data centers is a big issue and there's a lot of both software and hardware work that can go on there. You know, trying to generate idleness, trying to make systems more energy proportional so that if you're using half of a machine's CPU it uses half the power. Right now it's more like 70 or 80%. So I think a bunch of factors like that can help. Looking at lower power CPUs is another interesting area. You know, all these things I think together we'll make it so that we will be able to pay electricity bills, but it's not an easy thing.
What kinds of special things do you do when computing PageRank with MapReduce? There are a lot of optimizations but for PageRank but a lot of them aren't on MapReduce and I haven't seen much from Google in particular about what you guys actually do.
Um yeah, I mean there's a lot of things we don't necessarily publish very detailed papers on. You know it is possible to compute PageRank with MapReduce. It's just an iterative process. So in each iteration, you essentially read through the link graph, flow the weights for that iteration, and then generate the outputs. It's not necessarily the best tool to use for that particular task. So you can build a more specialized system, and if computing PageRank very fast is very important to you, then you might not use MapReduce. I think MapReduce is pretty effective where the model fits really well or you're trying to get something quick and dirty running and you don't want to worry about optimizing the heck out of it. So once you have a system that you know for PageRank you need a thousand nodes or something to compute, then it makes sense to focus a little more optimization effort and maybe build something more customized.
Do you have any patterns that follow a MapReduce chain kind of thing?
Um so the question is do we have MapReduce chains? Yeah, we would tend to implement that as a sequence of MapReduces where the map task... the subsequent map tasks just generate output for the reduce tasks. But a pretty common pattern is to have a whole chained sequence of MapReduces that each do some part of a larger algorithm in a sequence of steps.
What you do write out and then read as a separate...
Sometimes you might write out to a cheaper storage system, for example, if it's intermediate data and we know we could restart the whole computation from scratch if we want.
Can you say something about Google Instant and what the impact of infrastructure is?
Sure. So Google Instant is basically a system that is in the background predicting what query you're actually trying to issue when you type a few characters of it and then we'll actually prefetch the results for that. So from a web search standpoint, it's essentially just getting certain kinds of requests. One of the things we do is, when I mentioned making your system resilient to overload, those requests come tagged with a bit that says they're a predictive prefetch, not something the user actually pressed enter on. And so if we get overloaded, we can drop those requests. And other than that, it's basically just capacity. You know, you just need more machines to do it because you're doing that. Turns out the predictive stuff does cache reasonably well because you're essentially predicting queries that users have issued in the past. But you know it's mostly just a resource issue, not fundamental changes in the underlying infrastructure.
Okay, one last question. What is your experience on using distributed transactions within the... within a single data center?
So the question is what is our experience with using distributed transactions? Um so we don't have a huge amount of experience with that. A lot of the infrastructure we've built in the past has kind of avoided implementing distributed transactions. For example, Bigtable has single-row transactions but not distributed transactions. In retrospect, I think that was a mistake. We probably should have added them in because what ended up happening is a lot of people did want distributed transactions and so they hand-rolled their own protocols, sometimes incorrectly, and it would have been better to build it into the infrastructure. So in Spanner for example, we do have distributed transactions. We don't have a lot of experience with it yet. There are actually places in the Bigtable implementation where it would have been useful because we also hand-rolled our own protocols for a couple of the subtle features in there and it would have been better just to have that available to everyone. Okay, thank you.
For more, please visit us at stanford.edu.