Why reCAPTCHA is Good for Humanity

Last week we talked about KittenAuth, a novel CAPTCHA system used to differentiate between humans and spambots -- by using pictures of kittens. Today let's take a look at reCAPTCHA, the system in use by this very blog. What does it do, and why is it good for humanity?

What's a CAPTCHA?

First let's review the term CAPTCHA. It's a loose acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart." The idea is to force humans to do a (relatively) simple task like read a few words presented in an image, then type them into the form -- but this trick only works if the task is hard for computers (ahem, spambots) to do.

CAPTCHA systems are used on forms all over the web in order to cut down on spam form submissions. If you've ever run a blog, you'll know that legions of spambots are crawling the web, submitting every form they find -- so having a CAPTCHA on the form drastically reduces form spam. However, in most CAPTCHA systems the text you type in is meaningless, purposely scrambled text. reCAPTCHA is different.

What's Different About reCAPTCHA?

reCAPTCHA was born when Luis von Ahn, an assistant professor at Carnegie Mellon, realized that millions of people were spending time typing meaningless words into forms. Why not turn this word-decipherment into useful work that helped with some common goal? What if there was a set of words (as images) that needed to be viewed and deciphered by humans? It turns out that book scanning projects (including the Internet Archive) have just this problem: when scanning a print book into a computer -- particularly an old book in poor condition -- some words can't be deciphered automatically by Optical Character Recognition (OCR) software, and need a human to figure them out. In order to get a good text-only copy of a scanned book, lots of human attention is needed.

So reCAPTCHA is conceptually simple: take the words the OCR software can't read and put them in front of human users. If multiple users decipher the same hard-to-read word using the same text, reCAPTCHA can safely assume that it has been properly deciphered, and feed that word back into the book scanning project, slotting it into its associated book. Thus, text that is by definition difficult or impossible for a computer to accurately scan has been deciphered by humans -- and the humans doing the work generally don't even know it!

Yeah, But...

There's one technical catch -- what's to stop people from typing in random gibberish as "decipherment" of the words? Given that reCAPTCHA by definition doesn't know the correct decipherment of its subject words, how can it judge whether you've gotten it right? To solve this problem, reCAPTCHA presents two words together: one unknown and one known (the latter meaning a word for which reCAPTCHA already has a good decipherment). You have to get the known word correct, and the unknown word is (as described above) compared with other users' decipherments to eventually determine whether it's correct. There's also an audio variant for users with visual impairment, in which they listen to spoken language and convert it to written text.

So next time you fill out a reCAPTCHA form when commenting on a Mental Floss blog post, remember: you're helping to digitize books!

Further reading: Carnegie Mellon press release, Wikipedia page, reCAPTCHA project site.

Shhh...super secret special for blog readers.