Last week we talked about KittenAuth, a novel CAPTCHA system used to differentiate between humans and spambots — by using pictures of kittens. Today let’s take a look at reCAPTCHA, the system in use by this very blog. What does it do, and why is it good for humanity?
First let’s review the term CAPTCHA. It’s a loose acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart.” The idea is to force humans to do a (relatively) simple task like read a few words presented in an image, then type them into the form — but this trick only works if the task is hard for computers (ahem, spambots) to do.
CAPTCHA systems are used on forms all over the web in order to cut down on spam form submissions. If you’ve ever run a blog, you’ll know that legions of spambots are crawling the web, submitting every form they find — so having a CAPTCHA on the form drastically reduces form spam. However, in most CAPTCHA systems the text you type in is meaningless, purposely scrambled text. reCAPTCHA is different.
reCAPTCHA was born when Luis von Ahn, an assistant professor at Carnegie Mellon, realized that millions of people were spending time typing meaningless words into forms. Why not turn this word-decipherment into useful work that helped with some common goal? What if there was a set of words (as images) that needed to be viewed and deciphered by humans? It turns out that book scanning projects (including the Internet Archive) have just this problem: when scanning a print book into a computer — particularly an old book in poor condition — some words can’t be deciphered automatically by Optical Character Recognition (OCR) software, and need a human to figure them out. In order to get a good text-only copy of a scanned book, lots of human attention is needed.
So reCAPTCHA is conceptually simple: take the words the OCR software can’t read and put them in front of human users. If multiple users decipher the same hard-to-read word using the same text, reCAPTCHA can safely assume that it has been properly deciphered, and feed that word back into the book scanning project, slotting it into its associated book. Thus, text that is by definition difficult or impossible for a computer to accurately scan has been deciphered by humans — and the humans doing the work generally don’t even know it!
There’s one technical catch — what’s to stop people from typing in random gibberish as “decipherment” of the words? Given that reCAPTCHA by definition doesn’t know the correct decipherment of its subject words, how can it judge whether you’ve gotten it right? To solve this problem, reCAPTCHA presents two words together: one unknown and one known (the latter meaning a word for which reCAPTCHA already has a good decipherment). You have to get the known word correct, and the unknown word is (as described above) compared with other users’ decipherments to eventually determine whether it’s correct. There’s also an audio variant for users with visual impairment, in which they listen to spoken language and convert it to written text.
So next time you fill out a reCAPTCHA form when commenting on a Mental Floss blog post, remember: you’re helping to digitize books!
Further reading: Carnegie Mellon press release, Wikipedia page, reCAPTCHA project site.
Shhh…super secret special for blog readers.
Look ma, I’m helping type books!
I was just thinking about this, and considering using it in future projects that require forms on the web. Thank you for the great writeup and the extra resources you’ve provided here.
posted by Benjamin M. Strozykowski on 6-10-2008 at 12:30 pm
Amazing. My reCAPTCHA is “Cooley ty-four” by the way, whatever that means.
posted by Xavier on 6-10-2008 at 12:37 pm
That’s awesome. :)
mine = Feist Dunwoodie
posted by caitlen315 on 6-10-2008 at 1:04 pm
That is cool, kindof makes me want to post a bunch of non-sense comments just to help translate books.
posted by Witty Nickname on 6-10-2008 at 2:08 pm
You don’t have to read this comment.
posted by Witty Nickname on 6-10-2008 at 2:09 pm
Just doing my part for the sake of literacy.
‘metals his_name,’ whatever that means.
posted by adrienne on 6-10-2008 at 2:10 pm
Witty Nickname is on to something here…
autograph 251
posted by adrienne on 6-10-2008 at 2:13 pm
Is my reCAPTCHA political or social commentary? highways fail
posted by Judy on 6-10-2008 at 2:27 pm
Russian Christ
posted by Amy on 6-10-2008 at 2:40 pm
Haha, this actually gives me some incentive to comment. I hope more places pick this up, it’s a great idea.
posted by jopari on 6-10-2008 at 3:36 pm
That’s brilliant!
(Clington 8vo….???)
posted by Dawn on 6-10-2008 at 4:12 pm
I really love this idea! However, it does make my librarian-brain open up with all the thoughts about the good and bad stuff about digitizing books. Copyright, fair use, the potential for damage to books in the digitization process, blah blah.
That being said, I’m still commenting. :)
posted by Julia on 6-10-2008 at 4:28 pm
I heard about this before. Nice to see it boiled down like this instead of in a ten minute video. They spent the whole time saying the same thing.
I have to see if I can misspell my captcha. Evil, I know, but others will get it right and mine will be voted out.
posted by Nicole on 6-10-2008 at 4:34 pm
My reCAPTCHA says “Claus Lydia.” Sounds like a weirdoid performance artist to me!
posted by Chris Higgins on 6-10-2008 at 5:28 pm
That’s pretty cool.
Accurate Maritime, sounds like a band xD
posted by Nick on 6-10-2008 at 5:32 pm
The fastest way through Amish country: Obadiah Railroad. Too bad they can’t use it!
posted by pocketdoc on 6-10-2008 at 5:34 pm
I often have to reload the page to get two words I can decipher. Now I can try just guessing if I can read one of the words. For this comment, they are giving me a long number.
posted by Miss Cellania on 6-10-2008 at 6:28 pm
Totally cool!! I just may tell a blogger friend about this, maybe he can put it on his blog. He’s an English teacher by training, and a newspaper reporter by trade, so he might really dig this.
posted by Amy on 6-10-2008 at 6:54 pm
Oh, also note that if you can’t read your reCAPTCHA words, you can hit the little “reload” button (two arrows making a circle) in the reCAPTCHA itself — it’ll load up two more words.
(Ahem, “Associated Judge” this time.)
posted by Chris Higgins on 6-10-2008 at 7:00 pm
That is awesome!
posted by krikketgirl on 6-10-2008 at 8:28 pm
I had been wondering what the reCAPTCHA had to do with books, and now I know! Thanks, mental_floss for doing the research that I am too busy playing MarioKart Wii to do!
The best one I ever had was a couple of weeks ago: ‘butt stickers.’
Today? ‘masked man’
posted by Rachel on 6-10-2008 at 8:33 pm
So if they want us to decipher text that OCR can’t read, why do they make it wavy and put a line thru it?
persuasive PUNTA
posted by PartiallyDeflected on 6-10-2008 at 8:42 pm
On track for record posts b/c ya know now we HAVE to post our fantastically random (or poetically appropriate) recaptchas. This far addicting than looking for meaning in your horoscope or fortune cookie. And you can always hit refresh to get a better answer!
My 1st was “trading hardly”; intriguing, but since I could not decipher a fully relevant meaning I moved on… 2nd was “society durling” (alas, how i wanted it to declare “society darling”, so my southern grandmother could have her intentions of me being a debutante realized, at least by a computer.
My 3rd and final recaptha was Migration Bourbon. I’m not sure what it means, but I really like the sound of it, so I’ll stop…for now…
posted by ashleyrobin on 6-10-2008 at 10:53 pm
Or…this time it says
On track for record posts b/c ya know now we HAVE to post our fantastically random (or poetically appropriate) recaptchas. This IS far MORE addicting than looking for meaning in your horoscope or fortune cookie. And you can always hit refresh to get a better answer!
My 1st was “trading hardlyâ€; intriguing, but since I could not decipher a fully relevant meaning I moved on… 2nd was “society durling†(alas, how i wanted it to declare “society darlingâ€, so my southern grandmother could have her intentions of me being a debutante realized, at least by a computer.
My 3rd and final recaptha was Migration Bourbon. I’m not sure what it means, but I really like the sound of it, so I’ll stop…for now…
It probably doesn’t read any better, my spellcheck filter is off, due to wine consumption, but i could have sworn that some of the words necessary for basic comprehension were previously there….oh, well, i got “yes, easily” for the repost, obviously i’m in love with recaptha :-)
posted by ashleyrobin on 6-10-2008 at 11:49 pm
Ignore me, I’m just here to help bump up the comment count.
front Leaden
blanks pei
signer loan (gonna go with this ons)
posted by adrienne on 6-10-2008 at 11:59 pm
this is awesome in several ways
posted by pc on 6-11-2008 at 1:00 am
I admit that reCAPTCHA is cool but the one with the pictures on it is better!
posted by Leizl on 6-11-2008 at 1:34 am
Trading hardly… I’m teaching English in Korea at the moment and this is a really common mistake students make here! “I am studying so hardly!” they proclaim.
Amandes bread is my recaptcha.
posted by Steve on 6-11-2008 at 2:12 am
HA! My reCAPTCHA? submitted post
posted by Mandy on 6-11-2008 at 8:45 am
I’m with PartiallyDeflected on this one – if they want us to decipher text that OCR can’t read, why do they make it wavy and put a line thru it?
“Millard of” – eh
“Cates side-tracking” Perfect! If only they had spelled my name right.
posted by caitlen315 on 6-11-2008 at 9:36 am
I’m going to be so much less annoyed now! I had no idea. I thought the words were random.
Mine right now says “gruesome man”
posted by Melissa on 6-11-2008 at 11:13 am
Funny I never noticed that it said “stop spam read books” under the recaptcha logo.
Makes sense.
research de- was mine
posted by Nerak on 6-11-2008 at 11:19 am
Mine- Mrs snowy. I love it!
posted by Christina on 6-11-2008 at 11:50 am
Awesome!
previously pores
As headliners
Mortgage hat
Cortlandt aid
Nippu would
posted by Jess on 6-11-2008 at 11:59 am
Mine sounds like something I’d say in everyday speech
“k anyway”…
posted by Fruppi on 6-11-2008 at 12:22 pm
I’ve just gotta try this reCAPTCHA thing out…
Duffield plays
Hmmm… Yes they do… I guess…
posted by LinuXtreme on 6-11-2008 at 5:27 pm
My reCAPTCHA is:
grindings have
Grindings have? Have what? Don’t leave me hanging!
posted by Pitr on 6-11-2008 at 5:30 pm
Mine says “vigor includes” and now I’m curious what vigor actually does include!
posted by Elphaba on 6-11-2008 at 6:31 pm
“unconstrained 10″
I have a new nickname for my incredible wife!
posted by teammilehi on 6-12-2008 at 9:38 am
I commented just to do the captcha
posted by . on 6-15-2008 at 11:46 pm
stated prettiest. (nya nya)
posted by reCAPTCHA says I am, so it is on 8-21-2008 at 4:18 pm
prone Masonic
posted by . on 10-11-2009 at 2:36 pm
One for the Books.
posted by jen on 3-25-2010 at 3:29 am
Just figured I’d comment for the fun of it.
posted by Alex on 6-16-2010 at 3:03 pm
That’s one of the best ideas I’ve heard in a while
posted by Gerald on 3-26-2011 at 6:59 pm
People use a program to bypass the capthca now and program it to use the word “nigger” instead of the scanned word. If you try it you will see it works. Just type the word that is the one the captcha knows then for the scanned word use “nigger” It passes it no problem.
posted by cookies on 4-11-2011 at 3:46 pm
i don’t enjoy being used without pay
posted by Ziggymonkeydoo on 4-11-2011 at 9:25 pm
Just to help out
posted by lolt on 4-11-2011 at 11:24 pm
what they said
posted by Ryan on 5-30-2011 at 8:48 am
hmm
posted by Leah on 6-4-2011 at 10:52 pm
reCaptcha has been so terrible that, even as a human, I have a hard time proving it.
There are better ways to prevent spam, such as the honeypot, which is unobtrusive.
I’d rather not punish my visitors by making them figure out illegible words like a child.
posted by Joe Sak on 6-22-2011 at 1:08 am
so all of the people on 4chan that figure out which the unknown word is and then always type the N word or something else for it are seriously messing with books and science?
posted by Anon on 6-22-2011 at 8:08 am
Superb idea…awesome….amazing just amazing
posted by Aarsha on 8-27-2011 at 6:59 am