Working with news publishers
(Cross-posted from the European Public Policy Blog)
Last week, a group of newspaper and magazine publishers signed a declaration stating that "Universal access to websites does not necessarily mean access at no cost," and that they "no longer wish to be forced to give away property without having granted permission."
We agree, and that's how things stand today. The truth is that news publishers, like all other content owners, are in complete control when it comes not only to what content they make available on the web, but also who can access it and at what price. This is the very backbone of the web -- there are many confidential company web sites, university databases, and private files of individuals that cannot be accessed through search engines. If they could, the web would be much less useful.
For more than a decade, search engines have routinely checked for permissions before fetching pages from a web site. Millions of webmasters around the world, including news publishers, use a technical standard known as the Robots Exclusion Protocol (REP) to tell search engines whether or not their sites, or even just a particular web page, can be crawled. Webmasters who do not wish their sites to be indexed can and do use the following two lines to deny permission:
User-agent: *
Disallow: /
If a webmaster wants to stop us from indexing a specific page, he or she can do so by adding '<meta name="googlebot" content="noindex">' to the page. In short, if you don't want to show up in Google search results, it doesn't require more than one or two lines of code. And REP isn't specific to Google; all major search engines honor its commands. We're continuing to talk with the news industry -- and other web publishers -- to develop even more granular ways for them to instruct us on how to use their content. For example, publishers whose material goes into a paid archive after a set period of time can add a simple unavailable_after specification on a page, telling search engines to remove that page from their indexes after a certain date.
Today, more than 25,000 news organizations across the globe make their content available in Google News and other web search engines. They do so because they want their work to be found and read -- Google delivers more than a billion consumer visits to newspaper web sites each month. These visits offer the publishers a business opportunity, the chance to hook a reader with compelling content, to make money with advertisements or to offer online subscriptions. If at any point a web publisher feels as though we're not delivering value to them and wants us to stop indexing their content, they're able to do so quickly and effectively.
Some proposals we've seen from news publishers are well-intentioned, but would fundamentally change -- for the worse -- the way the web works. Our guiding principle is that whatever technical standards we introduce must work for the whole web (big publishers and small), not just for one subset or field. There's a simple reason behind this. The Internet has opened up enormous possibilities for education, learning, and commerce so it's important that search engines makes it easy for those who want to share their content to do so -- while also providing robust controls for those who want to limit access.
Update on 7/20/2009: The word "crawling" in the fourth paragraph has been replaced with "indexing."