Even humans sometimes misread sarcastic comments (especially online and in text messages), so imagine how hard deciphering sarcasm must be for a robot. Sarcasm is a complex cognitive process—understanding it means not only figuring out the meaning of the words, but examining the context and intent of the speaker. Luckily, there’s Reddit.
In order to help train artificial intelligence programs in natural language processing, Princeton University computer scientists recently scraped together a huge dataset of sarcastic remarks from self-tagged comments on Reddit, according to their paper posted on arXiv.org.
Reddit is a treasure trove of data on sarcasm because users themselves identify their comments as sarcasm, so there’s no room for misclassification—we know that remark was definitely made sarcastically, because the person already said so. On the site, users end sarcastic statements with the marker “/s” to prevent confusion, since it can be hard to read sarcasm without facial expressions, tone of voice, or other in-person contextual clues.
The Self-Annotated Reddit Corpus consists of 1.3 million sarcastic remarks from the social media site, which the researchers say is 10 times more than any other training dataset for sarcastic language. The corpus also contains non-sarcastic remarks for a total of 500 to 600 Reddit comments. The comments pulled only include those from users who have employed the “/s” tag in their posts, meaning that they are familiar with and use the tag, so their posts are less likely to contain unmarked examples of sarcasm.
Future artificial intelligence and natural-language processing researchers can now make use of this dataset to teach machines sarcasm, creating a future in which Siri can talk back to us.