Where Knowledge Junkies Get Their Fix
IN:
Chris Higgins
Why reCAPTCHA is Good for Humanity
by Chris Higgins - June 10, 2008 - 12:11 PM

Last week we talked about KittenAuth, a novel CAPTCHA system used to differentiate between humans and spambots — by using pictures of kittens. Today let’s take a look at reCAPTCHA, the system in use by this very blog. What does it do, and why is it good for humanity?

What’s a CAPTCHA?

First let’s review the term CAPTCHA. It’s a loose acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart.” The idea is to force humans to do a (relatively) simple task like read a few words presented in an image, then type them into the form — but this trick only works if the task is hard for computers (ahem, spambots) to do.

CAPTCHA systems are used on forms all over the web in order to cut down on spam form submissions. If you’ve ever run a blog, you’ll know that legions of spambots are crawling the web, submitting every form they find — so having a CAPTCHA on the form drastically reduces form spam. However, in most CAPTCHA systems the text you type in is meaningless, purposely scrambled text. reCAPTCHA is different.

What’s Different About reCAPTCHA?

reCAPTCHA screenshotreCAPTCHA was born when Luis von Ahn, an assistant professor at Carnegie Mellon, realized that millions of people were spending time typing meaningless words into forms. Why not turn this word-decipherment into useful work that helped with some common goal? What if there was a set of words (as images) that needed to be viewed and deciphered by humans? It turns out that book scanning projects (including the Internet Archive) have just this problem: when scanning a print book into a computer — particularly an old book in poor condition — some words can’t be deciphered automatically by Optical Character Recognition (OCR) software, and need a human to figure them out. In order to get a good text-only copy of a scanned book, lots of human attention is needed.

So reCAPTCHA is conceptually simple: take the words the OCR software can’t read and put them in front of human users. If multiple users decipher the same hard-to-read word using the same text, reCAPTCHA can safely assume that it has been properly deciphered, and feed that word back into the book scanning project, slotting it into its associated book. Thus, text that is by definition difficult or impossible for a computer to accurately scan has been deciphered by humans — and the humans doing the work generally don’t even know it!

Yeah, But…

There’s one technical catch — what’s to stop people from typing in random gibberish as “decipherment” of the words? Given that reCAPTCHA by definition doesn’t know the correct decipherment of its subject words, how can it judge whether you’ve gotten it right? To solve this problem, reCAPTCHA presents two words together: one unknown and one known (the latter meaning a word for which reCAPTCHA already has a good decipherment). You have to get the known word correct, and the unknown word is (as described above) compared with other users’ decipherments to eventually determine whether it’s correct. There’s also an audio variant for users with visual impairment, in which they listen to spoken language and convert it to written text.

So next time you fill out a reCAPTCHA form when commenting on a Mental Floss blog post, remember: you’re helping to digitize books!

Further reading: Carnegie Mellon press release, Wikipedia page, reCAPTCHA project site.

Shhh…super secret special for blog readers.

Send this Post » Suggest a Topic/Link »
Comments (41)
  1. Look ma, I’m helping type books!

    I was just thinking about this, and considering using it in future projects that require forms on the web. Thank you for the great writeup and the extra resources you’ve provided here.

  2. Amazing. My reCAPTCHA is “Cooley ty-four” by the way, whatever that means.

  3. That’s awesome. :)

    mine = Feist Dunwoodie

  4. That is cool, kindof makes me want to post a bunch of non-sense comments just to help translate books.

  5. You don’t have to read this comment.

  6. Just doing my part for the sake of literacy.

    ‘metals his_name,’ whatever that means.

  7. Witty Nickname is on to something here…

    autograph 251

  8. Is my reCAPTCHA political or social commentary? highways fail

  9. Russian Christ

  10. Haha, this actually gives me some incentive to comment. I hope more places pick this up, it’s a great idea.

  11. That’s brilliant!

    (Clington 8vo….???)

  12. I really love this idea! However, it does make my librarian-brain open up with all the thoughts about the good and bad stuff about digitizing books. Copyright, fair use, the potential for damage to books in the digitization process, blah blah.

    That being said, I’m still commenting. :)

  13. I heard about this before. Nice to see it boiled down like this instead of in a ten minute video. They spent the whole time saying the same thing.

    I have to see if I can misspell my captcha. Evil, I know, but others will get it right and mine will be voted out.

  14. My reCAPTCHA says “Claus Lydia.” Sounds like a weirdoid performance artist to me!

  15. That’s pretty cool.

    Accurate Maritime, sounds like a band xD

  16. The fastest way through Amish country: Obadiah Railroad. Too bad they can’t use it!

  17. I often have to reload the page to get two words I can decipher. Now I can try just guessing if I can read one of the words. For this comment, they are giving me a long number.

  18. Totally cool!! I just may tell a blogger friend about this, maybe he can put it on his blog. He’s an English teacher by training, and a newspaper reporter by trade, so he might really dig this.

  19. Oh, also note that if you can’t read your reCAPTCHA words, you can hit the little “reload” button (two arrows making a circle) in the reCAPTCHA itself — it’ll load up two more words.

    (Ahem, “Associated Judge” this time.)

  20. That is awesome!

  21. I had been wondering what the reCAPTCHA had to do with books, and now I know! Thanks, mental_floss for doing the research that I am too busy playing MarioKart Wii to do!

    The best one I ever had was a couple of weeks ago: ‘butt stickers.’

    Today? ‘masked man’

  22. So if they want us to decipher text that OCR can’t read, why do they make it wavy and put a line thru it?

    persuasive PUNTA

  23. On track for record posts b/c ya know now we HAVE to post our fantastically random (or poetically appropriate) recaptchas. This far addicting than looking for meaning in your horoscope or fortune cookie. And you can always hit refresh to get a better answer!
    My 1st was “trading hardly”; intriguing, but since I could not decipher a fully relevant meaning I moved on… 2nd was “society durling” (alas, how i wanted it to declare “society darling”, so my southern grandmother could have her intentions of me being a debutante realized, at least by a computer.
    My 3rd and final recaptha was Migration Bourbon. I’m not sure what it means, but I really like the sound of it, so I’ll stop…for now…

  24. Or…this time it says
    On track for record posts b/c ya know now we HAVE to post our fantastically random (or poetically appropriate) recaptchas. This IS far MORE addicting than looking for meaning in your horoscope or fortune cookie. And you can always hit refresh to get a better answer!
    My 1st was “trading hardly”; intriguing, but since I could not decipher a fully relevant meaning I moved on… 2nd was “society durling” (alas, how i wanted it to declare “society darling”, so my southern grandmother could have her intentions of me being a debutante realized, at least by a computer.
    My 3rd and final recaptha was Migration Bourbon. I’m not sure what it means, but I really like the sound of it, so I’ll stop…for now…

    It probably doesn’t read any better, my spellcheck filter is off, due to wine consumption, but i could have sworn that some of the words necessary for basic comprehension were previously there….oh, well, i got “yes, easily” for the repost, obviously i’m in love with recaptha :-)

  25. Ignore me, I’m just here to help bump up the comment count.

    front Leaden
    blanks pei
    signer loan (gonna go with this ons)

  26. this is awesome in several ways

  27. I admit that reCAPTCHA is cool but the one with the pictures on it is better!

  28. Trading hardly… I’m teaching English in Korea at the moment and this is a really common mistake students make here! “I am studying so hardly!” they proclaim.

    Amandes bread is my recaptcha.

  29. HA! My reCAPTCHA? submitted post

  30. I’m with PartiallyDeflected on this one - if they want us to decipher text that OCR can’t read, why do they make it wavy and put a line thru it?

    “Millard of” - eh
    “Cates side-tracking” Perfect! If only they had spelled my name right.

  31. I’m going to be so much less annoyed now! I had no idea. I thought the words were random.

    Mine right now says “gruesome man”

  32. Funny I never noticed that it said “stop spam read books” under the recaptcha logo.
    Makes sense.

    research de- was mine

  33. Mine- Mrs snowy. I love it!

  34. Awesome!

    previously pores
    As headliners
    Mortgage hat
    Cortlandt aid
    Nippu would

  35. Mine sounds like something I’d say in everyday speech

    “k anyway”…

  36. I’ve just gotta try this reCAPTCHA thing out…

    Duffield plays

    Hmmm… Yes they do… I guess…

  37. My reCAPTCHA is:

    grindings have

    Grindings have? Have what? Don’t leave me hanging!

  38. Mine says “vigor includes” and now I’m curious what vigor actually does include!

  39. “unconstrained 10″

    I have a new nickname for my incredible wife!

  40. I commented just to do the captcha

  41. stated prettiest. (nya nya)

Comment

commenting policy