Google's cache causes copyright concern

Daily Newsletters

Sign up to ZDNet UK's daily newsletter.

NEWS
Like other online publishers, The New York Times charges readers to access articles on its Web site. But why pay when you can use Google instead? Through a caching feature on the popular Google search site, people can sometimes call up snapshots of archived stories at NYTimes.com and other registration-only sites. The practice has proved a boon for readers hoping to track down Web pages that are no longer accessible at the original source, for whatever reason. But the feature has recently been putting Google at odds with some unhappy publishers. "We are working with Google to fix that problem -- we're going to close it so when you click on a link it will take you to a registration page," said Christine Mohan, a spokeswoman at New York Times Digital, the publisher of NYTimes.com. "We have established these archived links and want to maintain consistency across all these access points." Google offers publishers a simple way to opt out of its temporary archive, and scuffles have yet to erupt into open warfare or lawsuits. Still, Google's cache links illustrate a slippery side of innovation on the Web, where cool new features that seem benign on the surface often carry unintended consequences. The issue is particularly relevant at Google, a company that prides itself on creativity and routinely floats trial balloons for new features and services. Its culture of innovation may become increasingly risky as Google, which draws millions of visitors to its site daily and redirects them to others through secretive search formulas, cements its position as one of the most popular and powerful companies on the Web. At the heart of Google's caching dilemma lies a thorny legal problem involving a core Web technology: When is it acceptable to copy someone else's Web page, even temporarily? A phantom life for dead pages
Google's cache, a feature introduced in 1997, is unique among commercial search engines, but it's not unlike other archival sites on the Web that keep digital copies of Web pages. Google's relatively little-known feature lets people access a copy of almost any Web page, within Google's own site, in the form it was in whenever last indexed by the search giant. That could mean the page accessed is either minutes or months old, depending on when Google last crawled it. Unlike formal Web archive projects, Google says its cache feature does not attempt to create a permanent historical record of the Web. Rather, the company actively seeks to delete dead links; once a Web page disappears, the search engine seeks to purge that record and any related cached page as quickly as possible. Still, Google's cached pages have proven to be a treasure trove for investigators seeking to recover data pulled from public Web sites. In one high-profile example, security and privacy expert Richard Smith copied Web pages detailing the backgrounds of Dr. John Poindexter, head of the Pentagon's Information Awareness Office (IAO), and other officials, from the Google cache days after they were removed from the IAO Web site. The pages were deleted after public reports surfaced on the office's development of a massive computer system to spy on Americans and potential terrorists. "When something's been yanked, Google cache is a good place to grab it and save for posterity, because you don't know how long Google will have it," said Smith. Google claims its caching feature benefits Web surfers by letting them access a site that may be malfunctioning or offline. Also, its cached pages highlight terms that match a search query "to make it easier for users to find relevant information," according to a spokesman at the California-based company. Lawyers, start your search engines
As seemingly benign and beneficial as it is, some Web site operators take issue with the feature and digitally prevent Google from recording their pages in full by adding special code to their sites. Among other arguments, they say that cached pages at Google have the potential to detour traffic from their own site, or, at worst, constitute trademark or copyright violations. In the case of an out-of-date news page in Google's cache, a Web publisher could even face legal troubles because of false data remaining on the Web but corrected at its own site. For this reason, search experts and copyright lawyers expect the issue to come up in a court of law, joining the leagues of copyright disputes that have surfaced because of technology innovation. "It's very much an issue that has yet to be tested, and I fully expect that it will be," said Danny Sullivan, industry pundit and editor of Search Engine Watch. Admittedly, Google's cache is like any number of backdoors to information on the Web. For example, proxy servers can be the keys to a site that is banned by a visitor's hosting Web server. And technically, any time a Web surfer visits a site, that visit could be interpreted as a copyright violation, because the page is temporarily cached in the user's computer memory. The digital universe is constantly changing, but its content can be either fleeting or permanent. Several Web sites, including the Internet Archive Wayback Machine and the 11 September Digital Archive, have surfaced to preserve information on the Web and to keep permanent historical accounts of events and Web pages. Yet, many more pages, and even those in Google's cache, are eventually lost in the digital ether. The average lifespan of a Web site is 100 days, according to estimates by the Internet Archive. Still, copyright lawyers and industry experts say that there are legally uncharted waters around a commercial caching service. "Many of us copyright lawyers have been waiting for this issue to come up: Google is making copies of all the Web sites they index and they're not asking permission," said Fred Lohman, an attorney at the Electronic Frontier Foundation. "From a strict copyright standpoint, it violates copyright." Most search engines make a statistical record of a Web page when they "spider" it, or use "robots" to scan the page for meaning or context to related queries. For example, the engine can point to specific information contained on a page that's related to a search term, but it often doesn't have the complete picture of the page. Google goes one step further, however, by taking a digital picture of pages and making it available to visitors in cached links. Those pictures exist temporarily on its site until the next time Google crawls that particular page, which can happen in a few days or in six weeks or more. Legally, what could differentiate Google from other archival sites that record pages is that it is a commercial site and that it has enormous scope and influence on the Web. But what's kept the feature off most Web sites' radar is that, anecdotally, most people don't click on the cache. Even Google says people only "occasionally" click its cached links. If more people did, Web publishers might lose visitors -- and potentially advertising dollars, which no one can afford as Web publishing gets back on its feet. Practically speaking, Web sites can "opt out," or include code in their pages that bars Google from caching the page. A tag to exclude "robots" such as "www.nytimes.com/robots.txt" or "NOARCHIVE" typically does the job. And that's largely what's kept the cache feature from being controversial. Search Engine Watch's Sullivan said that, even though some publishers are wary of the caching feature, many don't block Google's robots for fear of losing favour in the company's powerful search rankings. He said some Webmasters believe there's a stigma associated the "no cache" tag, because many sites that use it have been accused of attempting to use banned methods to manipulate Google's rankings. Google said the "no cache" tag does not affect rankings. Cache now, pay later?
Some legal experts say Google may be on shaky ground by caching first and asking questions later. A provision in the Digital Millennium Copyright Act (DMCA) includes a safe harbour for Web caching. The safe harbour is narrowly defined to protect Internet service providers that cache Web pages to make them more readily accessible to subscribers. For example, AOL could keep a local copy of high-trafficked Web pages on its servers so that its members could access them with greater speed and less cost to the network. Various copyright lawyers argue that the safe harbour may or may not protect Google if it was tested. "Most people agree that the caching exception in the DMCA is obsolete," Lohman said. "I don't think it would cover Google's cache. Google is not waiting for users to request the page. It spiders the page before anyone asks for it." Still, other lawyers argue that Google's practice would be protected by fair-use laws. A judge might look at the market impact of Google's caching and find that it's valuable, given that it could ultimately drive traffic to the cached site. Or the reverse could be true, depending on the nature of the page. For its part, Google is confident that the service is within the law. "We've evaluated this from a legal perspective, including copyright law, and have determined that Google's cached-page service complies with the law," a Google spokesman said. A similar issue has played out in the courts in an image-searching case, Kelly v. Arriba Soft, filed in April 1999. Leslie Kelly, a photographer, sued the company for copyright infringement when its visual search finder catalogued thumbnails and full-sized images of his digital photos and made them accessible via its own search engine. The court initially ruled against Kelly based on the "established importance of search engines," but Kelly appealed and won. In February 2002, the 9th US Circuit Court of Appeals held that Arriba's use of thumbnail images of Kelly's photos was fair use, but its display of full-size images was not fair use, because it was likely to harm the market for Kelly's work by reducing visits to his Web site and by allowing free downloads. But the opinion on full-size images was remanded by the 9th Circuit Court this week and is set to go to trial in the lower court of central California. Judith Jennison, defence lawyer for Arriba Soft, said that one of the issues in the case is that Arriba Soft, in its process of indexing the Web, made copies of Kelly's photos and saved them for 24 hours in its servers. The 9th Circuit Court agreed that creating that copy is fair use under copyright law, she said, adding that there would be a slightly different analysis in a case related to Google. Also, the fact that the search site has an opt-out program would probably illustrate that the market for original copyrighted works can be protected, which is a significant factor in fair-use analysis. "In Google's case, the result would likely be the same, because the temporary caching for indexing purposes would be fair use per Kelly v. Arriba Soft," Jennison said. While it seems that many Net publishers haven't formed an official policy on Google caching, they say they are examining how it affects their business. Randy Stearns, executive producer for ABCNews.com, said he's somewhat concerned about his company's news pages being archived temporarily on Google, because readers might access information that is not up-to-date or, in the worst case for a daily news outlet, is inaccurate. Theoretically, if a news report was issued with errors and was subsequently fixed on the publisher's site, but the erroneous report still existed in a cached version, it could raise legal issues for the publisher, he said. Other publishers dismiss any threat, saying that not enough people actually click on those links to be a detriment to traffic. "People who find objection to what Google does likely spend enormous amounts (of time) on their content and refresh it regularly," said Harry Lin, head of ABC.com. In contrast with the priorities of some news publishers, Web archivists say preserving pages as they first appeared can offer important documentary records for historians and others. Brewster Kahle, head of the Wayback Machine, said many people use its archive for patent research, or "prior art" searches. Designers and students have used the archive to see the evolution of Web site design and display, he added, and the Smithsonian has used subsets of the collection in the Presidential Election memorabilia room. News publishers agree that Google's cache is also valuable if, for example, their site was inaccessible because of technical difficulties. "It's a great, wonderful feature, and I don't know that copyright laws would protect them," said Search Engine Watch's Sullivan. "But most people are concerned about getting into Google, not getting out of it."
For everything Internet-related, from the latest legal and policy-related news, to domain name updates, see ZDNet UK's Internet News Section. Let the editors know what you think in the Mailroom.

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

Paul Fezziwig

Keep the crap apps out?! How will they compete with Android and Apple's claim to fame of having so many life changing apps? I wonder if the media...

40 minutes ago by Paul Fezziwig via Facebook on RIM: BlackBerry will keep 'garbage' apps out of store
Aigars Mahinovs

It has been shown time after time that if there is an author store that sells the songs at even 1$ per song and gives you a high-quality digital...

1 hour ago by Aigars Mahinovs via Facebook on Copyright isn't working, says European Commission
EvaBrian

This is a great start to leverage virtualization and standardized deployments, however even a well-managed virtualization solution has its...

2 hours ago by EvaBrian on AWS CloudFormation automates cloud app deployment
EvaBrian

that's a great news for android users! The cloud is set as the battleground that will decide the fate of Google’s Android and Apple’s iOS as...

3 hours ago by EvaBrian on Google's 'Bouncer' scans Android apps for malware
EvaBrian

Google knows that the only way that Android is going to survive is by a superiority of numbers. By doing that, it is playing a completely different...

3 hours ago by EvaBrian on Apple vs Google: Cloud concepts that clash
awbMaven

""As a result of Butyka's alleged conduct, researchers were unable to use the computers for more than two months while NASA removed the malicious...

4 hours ago by awbMaven on US indicts Romanian over NASA climate change hack
subhorup

It simultaneously worries me and uplifts me that a self-proclaimed group of internet activists name themselves after Indian mythical figures....

12 hours ago by subhorup on Anonymous activists release PCAnywhere source code
naviathan

It's actually far easier to work anonymously on the internet than you think. With tools like Tor bouncing your traffic around the world before...

15 hours ago by naviathan on Anonymous activists release PCAnywhere source code
Agnostic_OS

1000272134 and bluedalmatian with you both there but then I'm still in 10.04 land (and happy with it)

16 hours ago by Agnostic_OS on Ten factors that make Ubuntu 11.10 a hit
apexwm

Interesting article and definitely see your points on the products mentioned. One of the top products for our Help Desk (approximately 20% of all...

23 hours ago by apexwm on Ten flawed products that derail productivity
Paul Hutchinson

Absolutely - this should obviously not be handled my isp - but handled by their hosting operator. What's been suggested here is that my isp police...

23 hours ago by Paul Hutchinson via Facebook on MPs urge ISPs to take down terrorist material
Techs UK

Looks like a great phone. I don't notice any deficiencies in WP7. used IOS before, that's pretty good. I don't spend much time in Apps, all i need...

1 day ago by Techs UK on Nokia pins US 're-entry' hopes on Lumia 900
Larry Bloggy

Now with the help of these apps you are always synced with MS outlook while on the move. Just download apps like xobni or outlookreflex and get...

1 day ago by Larry Bloggy via Facebook on Outlook Social Connector beta 2 and the LinkedIn connector
mike40g123

Your details are wrong. The version currently being made is the one with 2 USB ports, 256MB RAM and a network port. This is the Model B. The...

1 day ago by mike40g123 on Raspberry Pi boards set to go on sale
Moley

The thing that has been puzzling me for quite a while is how Anonymous can remain anonymous whilst not only being active on the Internet but also...

2 days ago by Moley on Anonymous activists release PCAnywhere source code
Don Dilly

If what Semantec is saying is rue, that is even worse and shows a complete disregard for thier users. If what Anonymous claims is true and the...

2 days ago by Don Dilly via Facebook on Anonymous activists release PCAnywhere source code
MattChurchy

Didn't seem particularly biased to me either. Oh though you might have mentioned some other competitors with free search and email services...

2 days ago by MattChurchy on Time for an evil umpire: Google, Microsoft & privacy
Simon Bisson and Mary Branscombe

James - exactly as much as anyone paid you for your comment; I don't feel that I need to say that I'm independant and unbiased, but just for you...

2 days ago by Simon Bisson and Mary Branscombe on Time for an evil umpire: Google, Microsoft & privacy
Carl White

Once they realise symantec are willing to pay real money, they will simply keep extorting, unless of course symantec/authorities can use the...

2 days ago by Carl White via Facebook on Symantec offered hackers $50k in source code sting
Jonathan Hassell

You can find more information on BS 8878 by Jonathan Hassell its lead-author at http://www.hassellinclusion.com/bs8878/ The page includes a...

3 days ago by Jonathan Hassell on BSI publishes first British web accessibility standard