INLS 512 Spring 2011
15 March 2011
Intentional insincerity, which includes sarcasm, verbal irony, and satire, is a popular tone in
natural language, but it can be difficult for humans to detect, much less computers. Some work has been
done in recent years on automatic detection of this tone, but the field is still in its infancy.
Utsumi (1996) was a foundational work on developing a computational model of irony. The
paper described the ironic environment, which contains certain requirements for speech or writing to be
considered ironic. Utsumi found that an ironic utterance “implicitly displays the fact that its utterance
situation is surrounded by ironic environment” (p. 2). This environment is displayed when an utterance
“alludes to the speaker’s expectation, violates pragmatic principles, and implies the speaker’s emotional
attitude” (p. 1). Specifically, the utterance must allude to an expectation and the failure of that
expectation, resulting in the speaker’s disappointment or other negative attitude. Utsumi concluded
that in order for a person to interpret an utterance as ironic, only two of these three components needs
to be displayed (p. 5).
Utsumi’s unified theory of irony laid the foundation for automating irony detection. He criticized
previous theories of irony for not being “clear enough to be formalized in a computable fashion” (p. 2),
but as he wrote, “This paper provides a basis for dealing with irony in NLP systems” (p. 6). Some
important implications of his research were the potential of a system to distinguish between ironic and
non-ironic utterances and a demonstration that ironic utterances can be interpreted without
intonational cues (p. 6).
INLS 512 Spring 2011
How well can humans detect intentional insincerity?
Researchers have struggled to automate the detection of intentional insincerity, partly due to its
subtlety; even humans do not always interpret it correctly. Kreuz and Caucci (2007) tested human
recognition of sarcasm by asking college students to evaluate the sarcasm levels in several texts. Some
of the texts originally included the phrase “said sarcastically.” Kreuz and Caucci deleted the word
“sarcastically” in each of these texts and randomly interspersed them with control texts with utterances
that were not sarcastic in tone. The students rated each of the excerpts (including the utterance and the
two paragraphs before and after it) on a seven-point scale of likely sarcastic intent (p. 2). Then two
judges coded each text based on the presence of adjectives and adverbs, the presence of interjections,
and the use of exclamation points or question marks (pp. 2-3). These dimensions were hypothesized to
be relevant to human readers’ interpretation of sarcasm. Kreuz and Caucci found that the sarcastic
excerpts were rated significantly higher (more likely to be sarcastic) than the control excerpts (p. 3). The
hand-coded dimensions were also analyzed, but only the presence of interjections was found to be a
significant predictor of high ratings of sarcasm (p. 3). Kreuz and Caucci’s research demonstrates that
humans have an ability to detect sarcasm and suggests some of the lexical cues they use. It could also
prove useful for automatic detection of sarcasm. The results “suggest that, in some contexts, the use of
interjections, and perhaps other textual factors, may provide reliable cues for identifying sarcastic
Kreuz’s earlier work with Roberts (1995) assessed the relative importance of hyperbole and
veridicality (truthfulness) in interpretation of irony. Kreuz and Roberts compiled short scenario texts and
asked college students to evaluate their levels of verbal irony. Kreuz and Roberts explained that
nonveridicality “is essential for the perception of irony. That is, an ironic statement must be contrary to
the true state of affairs to be interpreted correctly. There must be some discrepancy between the reality
and the utterance, and the listener must recognize this discrepancy in order to interpret the utterance
INLS 512 Spring 2011
as it was intended” (p. 22). They also explained the importance of hyperbole, suggesting that “There
seems to be a standard frame for such [ironic] utterances in English; it can be characterized as an
adverb, followed by an extreme, positive adjective” (p. 24).
In the study, Kreuz and Roberts presented college students with texts that included scenarios
with different variations of veridicality and hyperbole (p. 26). In other words, some scenarios set up
situations with utterances that made sense and others with utterances that were contrary to fact. In
some scenarios, the utterances were exaggerated, and in others they were non-hyperbolic. In this way,
the researchers were able to compare the relative importance of veridicality and hyperbole. They found
that both veridicality and hyperbole were significant, with the scenarios presenting a combination of the
two factors being rated most sarcastic (p. 27). Kreuz and Roberts did not suggest any computer-based
applications for their findings, but the lexical patterns they discovered could perhaps be used in
automatic irony detection.
How do automatic systems compare?
Tsur et al (2010) devised the most successful system to date: a novel algorithm for sarcasm
identification using 66,000 Amazon.com reviews as a corpus. Their Semi-supervised Algorithm for
Sarcasm Identification (SASI) included two steps: a semi-supervised pattern acquisition algorithm and a
classification algorithm (p. 163). The pattern acquisition algorithm was trained with manually labeled
sentences rated one to five (not at all sarcastic to definitely sarcastic) (p. 163). The researchers then
extracted syntactic and pattern-based features (p. 163). The patterns were based on high-frequency
words and content words (p. 164). The strongest patterns were selected for the feature vectors, which
also included sentence length and punctuation features for analysis (p. 164). The researchers added to
their data set by searching on the web for sentences with similar patterns (p. 165). They compared their
results to a star-sentiment baseline based on the star rating associated with each review on
Amazon.com (p. 165). They used 5-fold cross validation and a gold-standard annotation to evaluate their
INLS 512 Spring 2011
results. For the 5-fold cross validation, the combination of all features yielded the best results, with
patterns+punctuation close behind (p. 166). The gold-standard annotation evaluation revealed a
significant improvement over the baseline (p. 165).
Davidov et al (2010) was written by the same three researchers as Tsur et al. They used the SASI
algorithm to investigate 66,000 Amazon.com product reviews and 5.9 million Twitter messages (p. 107).
Davidov et al used the same training process as described in Tsur et al (2010). They used the #sarcasm
Twitter hashtag to train the system on the Twitter corpus, but this proved too noisy, so they performed
cross-domain training instead, using the Amazon data set (p. 111). The researchers used 15 annotators
from Amazon's Mechanical Turk service for annotating a gold standard for evaluation purposes (p. 112).
They also used the #sarcasm Twitter hashtag as a secondary gold standard; all tweets with this hashtag
were considered to be sarcastic (p. 113). As in Tsur et al, the combination of all features yielded the best
results, again with patterns+punctuation close behind (p. 113). The gold-standard evaluations were high
for the new sentences (both Twitter and Amazon), and the Mechanical Turk standard outperformed the
#sarcasm Twitter hashtag (p. 113). Davidov et al built on the success of Tsur et al and showed that the
SASI algorithm could be expanded successfully.
Carvalho et al (2009) investigated the use of linguistic cues for detecting irony in user comments
on a Portuguese newspaper website. They achieved relatively high levels of precision “by exploring
certain oral or gestural clues in user comments, such as emoticons, onomatopoeic expressions for
laughter, heavy punctuation marks, quotation marks and positive interjections” (p. 53). The paper
followed a previous Carvalho et al study on opinion mining that achieved high precision for negative
opinions but lower precision for positive opinions. One of the major errors was found to be verbal irony,
and the 2009 paper investigated how to detect verbal irony in order to avoid false positive opinions in
INLS 512 Spring 2011
opinion detection (p. 53). In particular, it focused on “the specific case where a word or expression with
prior positive polarity is figuratively used for expressing a negative opinion” (p. 53).
Carvalho et al devised eight linguistic patterns that they hypothesized would be related to verbal
irony. The patterns were constrained by required inclusion of positive opinion polarity and human
named entities (p. 54). The patterns included diminutive forms, demonstrative determiners,
interjections, verb morphology, cross-constructions, heavy punctuation, quotation marks, and laughter
expressions (pp. 54-55). They used a named-entity lexicon and a sentiment lexicon to find excerpts from
the newspaper corpus that fit the constraints of the study (p. 54). The researchers evaluated common
sentence patterns and marked them as ironic, not ironic, undecided, or ambiguous (p. 54). They found
that the most productive patterns were the ones that relied on punctuation and keyboard characters,
“which are ways of representing oral or gestural expressions in written text” (p. 54). The patterns based
on laughter and quotation yielded the best results (p. 55).
Burfoot and Baldwin (2009) used support vector machines and feature weighting to
differentiate between true and satirical news stories using newswire and satirical news articles for their
corpus. They focused on three feature types that were strongly related to satirical stories: headlines,
profanity, and slang (pp. 162-63). They also evaluated the validity of stories by comparing the
combinations of named entities in each story with web queries for the same combinations. Valid (and
less likely to be satirical) combinations of named entities had more web matches than the novel
combinations of named entities that characterized satirical stories (p. 163). The classifiers achieved high
precision but low recall. They detected the most obvious satirical stories but they could not catch the
subtler ones (p. 164).
Tepperman et al (2006) studied sarcasm recognition in spoken dialogue using prosodic, spectral,
and contextual cues (p. 1838). They built their study around occurrences of the expression “yeah right”
INLS 512 Spring 2011
from telephone dialogues in the Switchboard and Fisher corpora in order to capture a variety of
sarcastic and non-sarcastic uses (p. 1838). First they categorized each example of the expression “yeah
right” as one of four types of speech act: acknowledgment, agreement/disagreement, indirect
interpretation, or internal phrase (pp. 1838-39). They also coded the excerpts for the “yeah right”
expression preceded or followed by laughter, the expression as a question or answer, the expression as
the start or end of a turn, the expression preceded or followed by a pause, and the gender of the
speaker (p. 1839). They used 19 prosodic features to characterize the tone of voice for each utterance,
as well (p. 1839). Spectral information was recorded automatically. Two human annotators annotated
the excerpts, both with and without the context of the two or three turns before and after the target
expression (p. 1840). Having the context included improved inter-annotator agreement significantly,
suggesting that the context was an important factor in correctly interpreting the excerpts (p. 1840). The
researchers found that contextual and spectral features outperformed prosodic features (p. 1841).
Laughter was found to be the most important predictive feature (p. 1841). They concluded “that
prosody alone is not sufficient to discern whether a speaker is being sarcastic” and “that spectral and
contextual features can be used to detect sarcasm as well as a human annotator would” (p.1838).
Why is it important?
The automatic detection of irony, sarcasm, and satire is important to the broader area of
sentiment detection as an interesting computational problem. It is also important as a way to clear away
sentimental noise in texts that causes false sentiment detection. For example, a sarcastic movie review
might contain all positive words but actually portray negative sentiment. Several researchers have run
up against this problem in sentiment detection research. Read (2005) performed a sentiment
classification study employing emoticons. He compiled a corpus of excerpts that included smile or frown
emoticons (p. 45) and optimized them for sentiment classification on a second corpus (p. 46). The
optimized system performed well on data from the emoticon corpus but not on data from the second
INLS 512 Spring 2011
corpus (p. 47). Read suggested that the problem may be that the emoticon extracts “may be noisy with
respect to sentiment” (p. 47). He suggested that sarcasm was a significant contributor of noise in this
Das et al (2009) used unsupervised learning to detect sentiment in political blogs by detecting
their themes and orientations. The results were mixed, partly because of problematic noise caused by
sarcasm. Das et al wrote, “Some articles have a vocabulary that is dominated by all terms related to the
promises made by one candidate, but ends with a sentence that changes the overall tone of the article.
Some articles are humor based where all the policies made by a candidate is debated using sarcasm or
jokes” (p. 91). They concluded, “Detecting sarcasm in text is indeed very hard and remains an open
problem” (p. 92).
Davidov et al (2009) summarized the problem, “The difficulty in recognition of sarcasm causes
misunderstanding in everyday communication and poses problems to many NLP systems such as online
review summarization systems, dialogue systems or brand monitoring systems due to the failure of state
of the art sentiment analysis systems to detect sarcastic comments” (p. 107).
Automatic detection of intentional insincerity has a rich variety of potential applications. Tsur et
al (2010) wrote:
Beyond the obvious psychology and cognitive science interest in suggesting models for the use
and recognition of sarcasm, automatic detection of sarcasm is interesting from a commercial
point of view. Studies of user preferences suggest that some users find sarcastic reviews biased
and less helpful while others prefer reading sarcastic reviews. (p. 163)
Tsur et al also list content ranking personalization, recommendation systems, and review summarization
and opinion mining systems as areas of potential application (p. 164). Meanwhile, Kreuz and Caucci
INLS 512 Spring 2011
(2007) wrote about the potential to investigate “certain formulaic expressions (e.g., thanks a lot, good
job), foreign terms (e.g., au contraire), rhetorical statements (e.g., tell us what you really think), and
repetitions (e.g., perfect, just perfect) [that] are also common in sarcastic statements” (p.4). Further
research building on the early successes in automatic irony, sarcasm, and satire detection should
continue to yield better results with commercial applications, particularly in product reviews and
INLS 512 Spring 2011
Burfoot, C. & Baldwin, T. (2009). Automatic satire detection: Are you having a laugh? Proceedings of the
ACL-IJCNLP 2009 Conference Short Papers, 161–64.
Carvalho, P., Silva, M., Sarmento, L., & de Oliveira, E. (2009). Clues for detecting irony in user-generated
contents: Oh...!! It's "so easy" ;-) TSA'09 - 1st International CIKM Workshop on Topic-Sentiment
Analysis for Mass Opinion Measurement, 53-56.
Das, P., Srihari, R., & Mukund, S. (2009). Discovering voter preferences in blogs using mixtures of topic
models. AND '09 Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text
Davidov, D., Tsur, O., & Rappoport, A. (2010). Semi-supervised recognition of sarcastic sentences in
Twitter and Amazon. Proceedings of the Fourteenth Conference on Computational Natural
Language Learning, 107–16.
Kreuz, R. & Caucci, G. (2007). Lexical influences on the perception of sarcasm. Proceedings of the
Workshop on Computational Approaches to Figurative Language, 1-4.
Kreuz, R. & Roberts, R. (1995). Two cues for verbal irony: Hyperbole and the ironic tone of voice.
Metaphor and Symbol, 10(1), 21-31.
Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment
classification. Proceedings of the ACL Student Research Workshop, 43-48.
Tepperman, J., Traum, D., & Narayanan, S. (2006). "Yeah right": Sarcasm recognition for spoken dialogue
systems. INTERSPEECH 2006 - ICSLP, 1838-41.
Tsur, O., Davidov, D., & Rappoport, A. (2010). ICWSM - A great catchy name: Semi-supervised
recognition of sarcastic sentences in online product reviews. Proceedings of the Fourth
International AAAI Conference on Weblogs and Social Media, 162-69.
Utsumi, A. (1996). A unified theory of irony and its computational formalization. COLING, 962–67.