Google's cache causes copyright concern

Daily Newsletters

Sign up to ZDNet UK's daily newsletter.

NEWS
Like other online publishers, The New York Times charges readers to access articles on its Web site. But why pay when you can use Google instead? Through a caching feature on the popular Google search site, people can sometimes call up snapshots of archived stories at NYTimes.com and other registration-only sites. The practice has proved a boon for readers hoping to track down Web pages that are no longer accessible at the original source, for whatever reason. But the feature has recently been putting Google at odds with some unhappy publishers. "We are working with Google to fix that problem -- we're going to close it so when you click on a link it will take you to a registration page," said Christine Mohan, a spokeswoman at New York Times Digital, the publisher of NYTimes.com. "We have established these archived links and want to maintain consistency across all these access points." Google offers publishers a simple way to opt out of its temporary archive, and scuffles have yet to erupt into open warfare or lawsuits. Still, Google's cache links illustrate a slippery side of innovation on the Web, where cool new features that seem benign on the surface often carry unintended consequences. The issue is particularly relevant at Google, a company that prides itself on creativity and routinely floats trial balloons for new features and services. Its culture of innovation may become increasingly risky as Google, which draws millions of visitors to its site daily and redirects them to others through secretive search formulas, cements its position as one of the most popular and powerful companies on the Web. At the heart of Google's caching dilemma lies a thorny legal problem involving a core Web technology: When is it acceptable to copy someone else's Web page, even temporarily? A phantom life for dead pages
Google's cache, a feature introduced in 1997, is unique among commercial search engines, but it's not unlike other archival sites on the Web that keep digital copies of Web pages. Google's relatively little-known feature lets people access a copy of almost any Web page, within Google's own site, in the form it was in whenever last indexed by the search giant. That could mean the page accessed is either minutes or months old, depending on when Google last crawled it. Unlike formal Web archive projects, Google says its cache feature does not attempt to create a permanent historical record of the Web. Rather, the company actively seeks to delete dead links; once a Web page disappears, the search engine seeks to purge that record and any related cached page as quickly as possible. Still, Google's cached pages have proven to be a treasure trove for investigators seeking to recover data pulled from public Web sites. In one high-profile example, security and privacy expert Richard Smith copied Web pages detailing the backgrounds of Dr. John Poindexter, head of the Pentagon's Information Awareness Office (IAO), and other officials, from the Google cache days after they were removed from the IAO Web site. The pages were deleted after public reports surfaced on the office's development of a massive computer system to spy on Americans and potential terrorists. "When something's been yanked, Google cache is a good place to grab it and save for posterity, because you don't know how long Google will have it," said Smith. Google claims its caching feature benefits Web surfers by letting them access a site that may be malfunctioning or offline. Also, its cached pages highlight terms that match a search query "to make it easier for users to find relevant information," according to a spokesman at the California-based company. Lawyers, start your search engines
As seemingly benign and beneficial as it is, some Web site operators take issue with the feature and digitally prevent Google from recording their pages in full by adding special code to their sites. Among other arguments, they say that cached pages at Google have the potential to detour traffic from their own site, or, at worst, constitute trademark or copyright violations. In the case of an out-of-date news page in Google's cache, a Web publisher could even face legal troubles because of false data remaining on the Web but corrected at its own site. For this reason, search experts and copyright lawyers expect the issue to come up in a court of law, joining the leagues of copyright disputes that have surfaced because of technology innovation. "It's very much an issue that has yet to be tested, and I fully expect that it will be," said Danny Sullivan, industry pundit and editor of Search Engine Watch. Admittedly, Google's cache is like any number of backdoors to information on the Web. For example, proxy servers can be the keys to a site that is banned by a visitor's hosting Web server. And technically, any time a Web surfer visits a site, that visit could be interpreted as a copyright violation, because the page is temporarily cached in the user's computer memory. The digital universe is constantly changing, but its content can be either fleeting or permanent. Several Web sites, including the Internet Archive Wayback Machine and the 11 September Digital Archive, have surfaced to preserve information on the Web and to keep permanent historical accounts of events and Web pages. Yet, many more pages, and even those in Google's cache, are eventually lost in the digital ether. The average lifespan of a Web site is 100 days, according to estimates by the Internet Archive. Still, copyright lawyers and industry experts say that there are legally uncharted waters around a commercial caching service. "Many of us copyright lawyers have been waiting for this issue to come up: Google is making copies of all the Web sites they index and they're not asking permission," said Fred Lohman, an attorney at the Electronic Frontier Foundation. "From a strict copyright standpoint, it violates copyright." Most search engines make a statistical record of a Web page when they "spider" it, or use "robots" to scan the page for meaning or context to related queries. For example, the engine can point to specific information contained on a page that's related to a search term, but it often doesn't have the complete picture of the page. Google goes one step further, however, by taking a digital picture of pages and making it available to visitors in cached links. Those pictures exist temporarily on its site until the next time Google crawls that particular page, which can happen in a few days or in six weeks or more. Legally, what could differentiate Google from other archival sites that record pages is that it is a commercial site and that it has enormous scope and influence on the Web. But what's kept the feature off most Web sites' radar is that, anecdotally, most people don't click on the cache. Even Google says people only "occasionally" click its cached links. If more people did, Web publishers might lose visitors -- and potentially advertising dollars, which no one can afford as Web publishing gets back on its feet. Practically speaking, Web sites can "opt out," or include code in their pages that bars Google from caching the page. A tag to exclude "robots" such as "www.nytimes.com/robots.txt" or "NOARCHIVE" typically does the job. And that's largely what's kept the cache feature from being controversial. Search Engine Watch's Sullivan said that, even though some publishers are wary of the caching feature, many don't block Google's robots for fear of losing favour in the company's powerful search rankings. He said some Webmasters believe there's a stigma associated the "no cache" tag, because many sites that use it have been accused of attempting to use banned methods to manipulate Google's rankings. Google said the "no cache" tag does not affect rankings. Cache now, pay later?
Some legal experts say Google may be on shaky ground by caching first and asking questions later. A provision in the Digital Millennium Copyright Act (DMCA) includes a safe harbour for Web caching. The safe harbour is narrowly defined to protect Internet service providers that cache Web pages to make them more readily accessible to subscribers. For example, AOL could keep a local copy of high-trafficked Web pages on its servers so that its members could access them with greater speed and less cost to the network. Various copyright lawyers argue that the safe harbour may or may not protect Google if it was tested. "Most people agree that the caching exception in the DMCA is obsolete," Lohman said. "I don't think it would cover Google's cache. Google is not waiting for users to request the page. It spiders the page before anyone asks for it." Still, other lawyers argue that Google's practice would be protected by fair-use laws. A judge might look at the market impact of Google's caching and find that it's valuable, given that it could ultimately drive traffic to the cached site. Or the reverse could be true, depending on the nature of the page. For its part, Google is confident that the service is within the law. "We've evaluated this from a legal perspective, including copyright law, and have determined that Google's cached-page service complies with the law," a Google spokesman said. A similar issue has played out in the courts in an image-searching case, Kelly v. Arriba Soft, filed in April 1999. Leslie Kelly, a photographer, sued the company for copyright infringement when its visual search finder catalogued thumbnails and full-sized images of his digital photos and made them accessible via its own search engine. The court initially ruled against Kelly based on the "established importance of search engines," but Kelly appealed and won. In February 2002, the 9th US Circuit Court of Appeals held that Arriba's use of thumbnail images of Kelly's photos was fair use, but its display of full-size images was not fair use, because it was likely to harm the market for Kelly's work by reducing visits to his Web site and by allowing free downloads. But the opinion on full-size images was remanded by the 9th Circuit Court this week and is set to go to trial in the lower court of central California. Judith Jennison, defence lawyer for Arriba Soft, said that one of the issues in the case is that Arriba Soft, in its process of indexing the Web, made copies of Kelly's photos and saved them for 24 hours in its servers. The 9th Circuit Court agreed that creating that copy is fair use under copyright law, she said, adding that there would be a slightly different analysis in a case related to Google. Also, the fact that the search site has an opt-out program would probably illustrate that the market for original copyrighted works can be protected, which is a significant factor in fair-use analysis. "In Google's case, the result would likely be the same, because the temporary caching for indexing purposes would be fair use per Kelly v. Arriba Soft," Jennison said. While it seems that many Net publishers haven't formed an official policy on Google caching, they say they are examining how it affects their business. Randy Stearns, executive producer for ABCNews.com, said he's somewhat concerned about his company's news pages being archived temporarily on Google, because readers might access information that is not up-to-date or, in the worst case for a daily news outlet, is inaccurate. Theoretically, if a news report was issued with errors and was subsequently fixed on the publisher's site, but the erroneous report still existed in a cached version, it could raise legal issues for the publisher, he said. Other publishers dismiss any threat, saying that not enough people actually click on those links to be a detriment to traffic. "People who find objection to what Google does likely spend enormous amounts (of time) on their content and refresh it regularly," said Harry Lin, head of ABC.com. In contrast with the priorities of some news publishers, Web archivists say preserving pages as they first appeared can offer important documentary records for historians and others. Brewster Kahle, head of the Wayback Machine, said many people use its archive for patent research, or "prior art" searches. Designers and students have used the archive to see the evolution of Web site design and display, he added, and the Smithsonian has used subsets of the collection in the Presidential Election memorabilia room. News publishers agree that Google's cache is also valuable if, for example, their site was inaccessible because of technical difficulties. "It's a great, wonderful feature, and I don't know that copyright laws would protect them," said Search Engine Watch's Sullivan. "But most people are concerned about getting into Google, not getting out of it."
For everything Internet-related, from the latest legal and policy-related news, to domain name updates, see ZDNet UK's Internet News Section. Let the editors know what you think in the Mailroom.

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

Jack Schofield

@openhgs Windows users have had multiple desktops since Linus started writing Linux. They just haven't shipped as standard because not enough...

2 hours ago by Jack Schofield on Windows 8 could speed multi-monitor uptake
Jack Schofield

@Phil at Cloud4 What, Microsoft gets £1,200 per PC and £1,622 per server? Gosh, I'm amazed....

2 hours ago by Jack Schofield on 6 million wasted licences and £1,200 PCs: welcome to government IT
craigsc

You guys have no idea what is going on at Autonomy. Autonomy could have been a much more profitable organization. The sales operations at Autonomy...

4 hours ago by craigsc on HP cuts 27,000 staff as Autonomy chief Lynch leaves
Moley

How does this impact on dual or multi booting? Seems to me to more or less prohibit this, from Windows 8 anyway. Will Grub 2 recognise Windows 8,...

4 hours ago by Moley on Windows 8 start-up speed forces USB boot workaround
apexwm

I don't understand why there cannot be a slight pause during the boot process so the user can press a key. Many operating systems do this, even if...

5 hours ago by apexwm on Windows 8 start-up speed forces USB boot workaround
Gavin Goodman

You can now buy the Xi3 modular computer in the UK at http://www.ocdistribution.com . This can be bought with the Tand3m software, pricing and...

5 hours ago by Gavin Goodman on CES 2012: Xi3 microSERV3R
Phil at Cloud4

I agree: Mike Lynch can clearly build a business and manage strategy. I suspect the exit of Mike is more likely the end of a planned handover...

9 hours ago by Phil at Cloud4 on HP cuts 27,000 staff as Autonomy chief Lynch leaves
Phil at Cloud4

This is unbeleivable government wastage with only one winner... Microsoft 1 - Tax payer Nil!

9 hours ago by Phil at Cloud4 on 6 million wasted licences and £1,200 PCs: welcome to government IT
Mispam

So what do you do when you can't boot into windows? Why can't I just hold Shift while I power up instead of having to boot into windows and click a...

10 hours ago by Mispam on Windows 8 start-up speed forces USB boot workaround
apexwm

I've also seen that Mac OS X for Intel machines is supposed to run in VirtualBox, which would also be a nice solution. I've never tried it though.

11 hours ago by apexwm on xTreme Triple Booting: Linux, Mac & Windows
dave heasman

What I wonder is why when companies are caught bang to rights in not providing contracted services, people bend over to smear the customers? Surely...

12 hours ago by dave heasman on Virgin throttles broadband for high-speed customers
pjc158

Strange statement from HP regarding Mike Lynch and not capable of scaling a company. Autonomy was a $7bn purchase which started as a small company...

12 hours ago by pjc158 on HP cuts 27,000 staff as Autonomy chief Lynch leaves
lojolondon

Or - possibly, they will destroy business by ensuring people do not invest where there is no return. Another socialist idea, well beyond it's...

15 hours ago by lojolondon on Open Data Institute will act as biz incubator
J.A. Watson

Good stuff Jake, very interesting. Thanks. jw

16 hours ago by J.A. Watson on xTreme Triple Booting: Linux, Mac & Windows
openhgs

"the cost of a second LCD screen is about the same as one day of an office worker's time, so this should soon be recouped in extra productivity."...

17 hours ago by openhgs on Windows 8 could speed multi-monitor uptake
Thomas Gellhaus

I also installed the KDE version; I also will probably try out razorqt since I really haven't had a chance to before. I'm looking forward to the...

1 day ago by Thomas Gellhaus via Facebook on Mageia 2 Released
francisabigail

Acquiring when reinvention/cannibalization is too challenging for a large organization can be an excellent strategy- still, so many mergers stumble...

1 day ago by francisabigail on Ariba buy parks SAP on Oracle's cloud turf
apexwm

All of the feedback regarding using a touch monitor for a desktop PC is right on. Several months ago, we installed a "demo" multitouch all-in-one...

1 day ago by apexwm on Windows 8 could speed multi-monitor uptake
191706

anyone wanting to triple boot *their* own Mac

2 days ago by 191706 on xTreme Triple Booting: Linux, Mac & Windows
SoapyTablet

Cont.. Biggest Bugbear: Win7's stop-animate-go approach to work, you develop a staggered (not in the above alchohol sense of the word) approach to...

2 days ago by SoapyTablet on Windows 8 could speed multi-monitor uptake