Anti-spam tool helps digitise books

Daily Newsletters

Sign up to ZDNet UK's daily newsletter.

Topics

Captcha, Spam

NEWS

A group of Carnegie Mellon University programmers has launched a service called "ReCaptcha" that can help cut down on spam while letting people digitise books.

The project is a variation of the widely used "Captcha" (Completely Automated Public Turing test to tell Computers and Humans Apart) technique used to weed out computer abuse such as emailing spam or posting spam on blog comments. Captchas require users to pass little pattern-recognition tests, commonly reading distorted or obscured words.

ReCaptcha turns this chore into a productive task by letting users digitise scanned images of words that computers couldn't figure out.

"Not only can you solve your problems with spam, you can help preserve mankind's written history into the digital age," said Ben Maurer, the project's chief architect and a Carnegie Mellon University undergraduate, announcing the project on his blog on Wednesday.

Vista Upgrade Blog

Vista Upgrade Blog
Grappling with the OS

How is the switch to Vista affecting your workplace? Take a look at our new group blog and share your pain and praise.

Read more +

Since the project launched on Tuesday, 150 websites have begun using it, said Luis von Ahn, a Carnegie Mellon assistant professor and ReCaptcha's "executive producer". In just the first half of Thursday, the project had digitised 8,000 words, he said.

It's a new example of how the internet can harness the collective energies of large numbers of people. Other examples include news sites such as Digg and Slashdot, which give prominence to content that users rate highly, and stock photography seller iStockphoto, which is beta testing an "Image Fight" site to rate photo quality.

ReCaptcha has the potential to digitise vast quantities of words. Von Ahn estimates that people perform 60 million Captcha tests daily.

The service presents users with two words, one from a conventional Captcha test and the other an unknown word that a computerised optical character recognition couldn't figure out. If the user correctly identifies the known word, he or she is presumed to have decoded the unknown one. Currently, ReCaptcha requires three separate people to digitise the word the same before it's determined to be correct, von Ahn said.

Von Ahn was a member of the Carnegie Mellon team that developed Captcha in response to a Yahoo request for technology to keep computers from registering bogus email accounts, according to Carnegie Mellon. He's a recipient of a MacArthur Foundation "genius grant", which funded some ReCaptcha work.

Digital libraries
The ReCaptcha project is digitising books in the Internet Archive, a project building a digital library of cultural materials and which operates the Wayback Machine of historical website snapshots.

Among the first books being digitised is Psychology by philosopher John Dewey, von Ahn said. The project is considering other book archives, too, he added.

The ReCaptcha service is available now through an application programming interface (API) for people to integrate into their websites. Software plug-ins to use the API are open-source software packages hosted at Google Code.

ReCaptcha also can be used to shield email addresses from computers that harvest them for spam mailing lists.

Von Ahn's specialty is what he calls "human computation", which he defines as "novel techniques for utilising the computational abilities [or "cycles"] of humans".

Microsoft Research has its own philanthropic variation of Captcha technology: a project called Asirra that shows pictures of cats and dogs rather than text. Computers do a poor job telling the animals apart, but people can. To get a supply of constantly refreshed pet images, Microsoft pulls photos — and "adopt me" links — from the Petfinder website.

Two of his higher-profile projects were online games ESP Game and Peekaboom, which rely on crowds to label images. Like reading obfuscated text, it's a task at which computers are poor.

Google licensed the ESP Game technology and offers it as its Google Image Labeler to improve its own image-search technology.

Carnegie Mellon is hosting the ReCaptcha service on $30,000 (£15,112) worth of servers donated by Intel, von Ahn said. Other sponsors include Novell, which contributed its Suse Linux Enterprise Server support subscriptions, and Carnegie Mellon.

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

BrownieBoy

@Jack, > Works really well for thieves.... Nice attempt to deflect the argument by tossing in a point that's totally irrelevant, even it were...

10 hours ago by BrownieBoy on AMD Ultrathins to challenge Intel Ultrabooks
bootlegger

Make that 13 people now - I got refused today at Manchester airport. I thought I was up to date on this legislation - I knew of the EU ruling from...

13 hours ago by bootlegger on UK airport body scans will not be opt out
tinycg

Don't forget to check out apps like GoodReader or SlideShark either, they're indispensible for people on the go in presentation situations. Best...

15 hours ago by tinycg on Four top iPad apps for people on the move
TerryRK

Well it seems there is something a number of us agree on. Why is the Ubuntu Unity launcher so ugly? I thought perhaps it was something to do with...

20 hours ago by TerryRK on A tale of two distros: Ubuntu and Linux Mint
Freebies202

Duplicate comments are not made intentionally. Its very good to know that now you are keeping check on this problem because sometimes a commenter...

1 day ago by Freebies202 on Microsoft fixes blog comments, speeds up blogs with open source
kevinmchapman

"the very significant number of users" and "many (most) of us" - you have no evidence for these statements. It is a fact that most users are saying...

2 days ago by kevinmchapman on A tale of two distros: Ubuntu and Linux Mint
Marg Menzies Harrison

Another grammar faux pas is the improper use of "you". When sitting down down in a restaurant, for example, I get cringe when the waitress...

2 days ago by Marg Menzies Harrison via Facebook on 10 flagrant grammar mistakes that make you look stupid
zdnetukuser

And NOW, folks, for Canonical's next trick... Kubuntu is late. Here's a pencil. Draw your own conclusions. cf.:...

2 days ago by zdnetukuser on Linux Minterface
Moley

@kevinmchapman. The discussion here reflects the very significant number of users who really do like the traditional menu system and who wish to...

2 days ago by Moley on A tale of two distros: Ubuntu and Linux Mint
kevinmchapman

Er, no... It is an efficient means of finding the application/file/setting you need in one place. The icons are a simply a fallback for when you...

2 days ago by kevinmchapman on A tale of two distros: Ubuntu and Linux Mint
TerryRK

Isn't the provision of a text based search an admission by the developers that the mass of icons approach does not work? I don't need to use a...

2 days ago by TerryRK on A tale of two distros: Ubuntu and Linux Mint
kevinmchapman

"Unity and GNOME 3 both abandon the old text-based cascading menus in favour of a graphical icon-driven system." Point truly missed. Both use a...

2 days ago by kevinmchapman on A tale of two distros: Ubuntu and Linux Mint
TerryRK

whs001 - Thank you, I'm glad you liked the article. I absolutely agree with you on your first point. I should perhaps have made it clearer that...

2 days ago by TerryRK on A tale of two distros: Ubuntu and Linux Mint
Dennis Nilsson

If we allow corporate interest to dictate the way our government circumvents due process against foreign entities then we should accept the same...

2 days ago by Dennis Nilsson via Facebook on ACTA stumbles in Germany
GHar123

I totally dislike pirating of works, I fear that artists will be deterred from creating works if they think that they are going to get ripped off....

2 days ago by GHar123 on ACTA stumbles in Germany
JCB33

How dare film makers, artists or anybody that invests in creativity stop us pirating their works for free. I want to be able to walk into my local...

2 days ago by JCB33 on ACTA stumbles in Germany
Moley

@GrueMaster. I prefer horses for courses rather than one size fits all. I, and I suspect most other computer users, do not really wish to have...

2 days ago by Moley on A tale of two distros: Ubuntu and Linux Mint
greycynic

The product that scares me every time I have to use it is the Office 2007 version of Excel. The first bug that I found was applying the median...

2 days ago by greycynic on Ten flawed products that derail productivity
GrueMaster

Nice review and very informative. One thing I'd like to add (in reply to whs001's 1st question), the main reason to have the same interface from...

2 days ago by GrueMaster on A tale of two distros: Ubuntu and Linux Mint
Frederick Wrigley

I'be been using Mint 12 since the RC came out, and I am far more happy with the Cinnamon, the Mate, and, yes (with extensions), theGnome 3...

2 days ago by Frederick Wrigley via Facebook on A tale of two distros: Ubuntu and Linux Mint