This stuff is tough
(Cross-posted from the Google European Public Policy Blog)
Yesterday's news that the European Commission has opened a preliminary inquiry into competition complaints from three companies has generated a lot of questions about how Google's ranking works. Here, Amit Singhal, a Google Fellow responsible for ranking, who has worked in search for almost 20 years, explains the principles behind our algorithm.
Pop quiz. Get ready. You're only going to have a few milliseconds to answer this question, so look sharp. Here goes: "know the way to San Jose?" Now display the answer on a screen that’s about 14 inches wide and 12 inches tall. Find the answer from among billions and billions of documents. Wait a second - is this for directions or are we talking about the song? Too late. Just find the answer and display it. Now on to the next question. Because you'll have to answer hundreds of millions each day to do well at this test. And in case you find yourself getting too good at it, don’t worry: at least 20% of those questions you get every day you’ll have never seen before. Sound hard? Welcome to the wild world of search at Google. More specifically, welcome to the world of ranking.
Google ranking is a collection of algorithms used to seek out relevant and useful results for a user's query. There's a ton that goes into building a state-of-the-art ranking system like ours. Our algorithms use hundreds of different signals to pick the top results for any given query. Signals are indicators of relevance, and they include items as simple as the words on a webpage or more complex calculations such as the authoritativeness of other sites linking to any given page. Those signals and our algorithms are in constant flux, and are constantly being improved. On average, we make one or two changes to them every day. Lately, I’ve been reading about whether regulators should look into dictating how search engines like Google conduct their ranking. While the debate unfolds about government-regulated search, let me provide some general thinking behind our approach to ranking. Future ranking experts (inside or outside government) might find it helpful. Our philosophy has three main elements:
1. Algorithmically-generated results.
2. No query left behind.
3. Keep it simple.
After nearly two decades, I’ve lost count of how many times I've been asked why Google chooses to generate its search results algorithmically. Here's how we see it: the web is built by people. You are the ones creating pages and linking to pages. We are utilizing all this human contribution through our algorithms to order and rank our results. We think that's a much better solution than a hand-arranged one. Other search engines approach this differently -- selecting some results one at a time, manually curating what you see on the page. We believe that approach which relies heavily on an individual's tastes and preferences just doesn't produce the quality and relevant ranking that our algorithms do. And given the hundreds of millions of queries we have to handle every day, it wouldn't be feasible to handle each by hand anyway.
This brings me to the next point: leaving no query behind. Usually once I've explained to people the thinking behind algorithmically-generated results, some will ask me, "But what if you do a search, and the results you see are just plain lousy? Why wouldn't you just go in there by hand and change them?" The part of this question that's valid is in terms of lousy results. It happens. It happens all the time. Every day we get the right answers for people, and every day we get stumped. And we love getting stumped. Because more often than not, a broken query is just a symptom of a potential improvement to be made to our ranking algorithm. Improving the underlying algorithm not only improves that one query, it improves an entire class of queries, and often for all languages around the world in over 100 countries. I should add, however, that we do have clear written policies for websites that are included in our results, and we do take action on sites that are in violation of our policies or for a small number of other reasons (such as legal requirements, child porn, spam, viruses/malware, etc.). But those cases are quite different from the notion of rearranging the page you see one result at a time.
Finally, simplicity. This seems pretty obvious. Isn't it the desire of all system architects to keep their systems simple? We work very hard to keep our system simple without compromising on the quality of results. This is an ongoing effort, and a worthy one. Our commitment to simplicity has allowed us innovate quickly, and it shows.
Ultimately, search is nowhere near a solved problem. Although I've been at this for almost two decades now, I'd still guess that search isn't quite out of its infancy yet. The science is probably just about at the point where we're crawling. Soon we'll walk. I hope that in my lifetime, I'll see search enter its adolescence.
In the meantime, we're working hard at our ongoing pop quizzes. Here's one last one: "search engine." In 0.14 seconds from among a few hundred million pages, our initial results are: AltaVista, Dogpile Web Search, Bing and Ask.com. I guess I'd better get back to work.