Document Sample
mihalcea.emnlp05b Powered By Docstoc
					www.laptop1.blogbus.com                  Making Computers Laugh:
                              Investigations in Automatic Humor Recognition

                        Rada Mihalcea                          Carlo Strapparava
                 Department of Computer Science Istituto per la Ricerca Scientifica e Tecnologica
                    University of North Texas                       ITC – irst
                    Denton, TX, 76203, USA                 I-38050, Povo, Trento, Italy
                     rada@cs.unt.edu                           strappa@itc.it

                               Abstract                                                Previous work in computational humor has fo-
                                                                                    cused mainly on the task of humor generation (Stock
        Humor is one of the most interesting and                                    and Strapparava, 2003; Binsted and Ritchie, 1997),
        puzzling aspects of human behavior. De-                                     and very few attempts have been made to develop
        spite the attention it has received in fields                                systems for automatic humor recognition (Taylor
        such as philosophy, linguistics, and psy-                                   and Mazlack, 2004). This is not surprising, since,
        chology, there have been only few at-                                       from a computational perspective, humor recogni-
        tempts to create computational models for                                   tion appears to be significantly more subtle and dif-
        humor recognition or generation. In this                                    ficult than humor generation.
        paper, we bring empirical evidence that                                        In this paper, we explore the applicability of
        computational approaches can be success-                                    computational approaches to the recognition of ver-
        fully applied to the task of humor recogni-                                 bally expressed humor. In particular, we investigate
        tion. Through experiments performed on                                      whether automatic classification techniques are a vi-
        very large data sets, we show that auto-                                    able approach to distinguish between humorous and
        matic classification techniques can be ef-                                   non-humorous text, and we bring empirical evidence
        fectively used to distinguish between hu-                                   in support of this hypothesis through experiments
        morous and non-humorous texts, with sig-                                    performed on very large data sets.
        nificant improvements observed over apri-                                       Since a deep comprehension of humor in all of
        ori known baselines.                                                        its aspects is probably too ambitious and beyond
                                                                                    the existing computational capabilities, we chose
                                                                                    to restrict our investigation to the type of humor
   1 Introduction
                                                                                    found in one-liners. A one-liner is a short sen-
           ... pleasure has probably been the main goal all along. But I hesitate   tence with comic effects and an interesting linguistic
           to admit it, because computer scientists want to maintain their image
           as hard-working individuals who deserve high salaries. Sooner or
                                                                                    structure: simple syntax, deliberate use of rhetoric
           later society will realize that certain kinds of hard work are in fact   devices (e.g. alliteration, rhyme), and frequent use
           admirable even though they are more fun than just about anything
           else. (Knuth, 1993)                                                      of creative language constructions meant to attract
      Humor is an essential element in personal com-                                the readers attention. While longer jokes can have
   munication. While it is merely considered a way                                  a relatively complex narrative structure, a one-liner
   to induce amusement, humor also has a positive ef-                               must produce the humorous effect “in one shot”,
   fect on the mental state of those using it and has the                           with very few words. These characteristics make
   ability to improve their activity. Therefore computa-                            this type of humor particularly suitable for use in an
   tional humor deserves particular attention, as it has                            automatic learning setting, as the humor-producing
   the potential of changing computers into a creative                              features are guaranteed to be present in the first (and
   and motivational tool for human activity (Stock et                               only) sentence.
   al., 2002; Nijholt et al., 2003).                                                  We attempt to formulate the humor-recognition

                                                 This is trial version
   problem as a traditional classification task, and feed
   positive (humorous) and negative (non-humorous)
                                                                                 automatically identified

                                                                                     seed one−liners
   examples to an automatic classifier. The humor-
   ous data set consists of one-liners collected from                                  Web search

   the Web using an automatic bootstrapping process.
   The non-humorous data is selected such that it                                 webpages matching
   is structurally and stylistically similar to the one-                          thematic constraint (1)?

   liners. Specifically, we use three different nega-                                           yes

   tive data sets: (1) Reuters news titles; (2) proverbs;                                candidate

   and (3) sentences from the British National Corpus
   (BNC). The classification results are encouraging,
                                                                                 enumerations matching
   with accuracy figures ranging from 79.15% (One-                                stylistic constraint (2)?

   liners/BNC) to 96.95% (One-liners/Reuters). Re-                                             yes

   gardless of the non-humorous data set playing the
   role of negative examples, the performance of the          Figure 1: Web-based bootstrapping of one-liners.
   automatically learned humor-recognizer is always
   significantly better than apriori known baselines.
      The remainder of the paper is organized as fol-        struction of a very large one-liner data set may be
   lows. We first describe the humorous and non-              however problematic, since most Web sites or mail-
   humorous data sets, and provide details on the Web-       ing lists that make available such jokes do not usu-
   based bootstrapping process used to build a very          ally list more than 50–100 one-liners. To tackle this
   large collection of one-liners. We then show experi-      problem, we implemented a Web-based bootstrap-
   mental results obtained on these data sets using sev-     ping algorithm able to automatically collect a large
   eral heuristics and two different text classifiers. Fi-    number of one-liners starting with a short seed list,
   nally, we conclude with a discussion and directions       consisting of a few one-liners manually identified.
   for future work.
                                                                The bootstrapping process is illustrated in Figure
   2 Humorous and Non-humorous Data Sets                     1. Starting with the seed set, the algorithm auto-
                                                             matically identifies a list of webpages that include at
   To test our hypothesis that automatic classification       least one of the seed one-liners, via a simple search
   techniques represent a viable approach to humor           performed with a Web search engine. Next, the web-
   recognition, we needed in the first place a data set       pages found in this way are HTML parsed, and ad-
   consisting of both humorous (positive) and non-           ditional one-liners are automatically identified and
   humorous (negative) examples. Such data sets can          added to the seed set. The process is repeated sev-
   be used to automatically learn computational mod-         eral times, until enough one-liners are collected.
   els for humor recognition, and at the same time eval-        An important aspect of any bootstrapping algo-
   uate the performance of such models.                      rithm is the set of constraints used to steer the pro-
                                                             cess and prevent as much as possible the addition of
   2.1   Humorous Data                                       noisy entries. Our algorithm uses: (1) a thematic
   For reasons outlined earlier, we restrict our attention   constraint applied to the theme of each webpage;
   to one-liners, short humorous sentences that have the     and (2) a structural constraint, exploiting HTML an-
   characteristic of producing a comic effect in very        notations indicating text of similar genre.
   few words (usually 15 or less). The one-liners hu-           The first constraint is implemented using a set
   mor style is illustrated in Table 1, which shows three    of keywords of which at least one has to appear
   examples of such one-sentence jokes.                      in the URL of a retrieved webpage, thus poten-
      It is well-known that large amounts of training        tially limiting the content of the webpage to a
   data have the potential of improving the accuracy of      theme related to that keyword. The set of key-
   the learning process, and at the same time provide        words used in the current implementation consists
   insights into how increasingly larger data sets can       of six words that explicitly indicate humor-related
   affect the classification precision. The manual con-       content: oneliner, one-liner, humor, humour, joke,

                                     This is trial version
www.laptop1.blogbus.com       One-liners
           Take my advice; I don’t use it anyway.
           I get enough exercise just pushing my luck.
                                                            to the one-liners. We do not want the automatic clas-
                                                            sifiers to learn to distinguish between humorous and
           Beauty is in the eye of the beer holder.         non-humorous examples based simply on text length
                            Reuters titles
           Trocadero expects tripling of revenues.
                                                            or obvious vocabulary differences. Instead, we seek
           Silver fixes at two-month high, but gold lags.    to enforce the classifiers to identify humor-specific
           Oil prices slip as refiners shop for bargains.    features, by supplying them with negative examples
                           BNC sentences
           They were like spirits, and I loved them.
                                                            similar in most of their aspects to the positive exam-
           I wonder if there is some contradiction here.    ples, but different in their comic effect.
           The train arrives three minutes early.              We tested three different sets of negative exam-
           Creativity is more important than knowledge.
                                                            ples, with three examples from each data set illus-
           Beauty is in the eye of the beholder.            trated in Table 1. All non-humorous examples are
           I believe no tales from an enemy’s tongue.       enforced to follow the same length restriction as the
   Table 1: Sample examples of one-liners, Reuters ti-      one-liners, i.e. one sentence with an average length
   tles, BNC sentences, and proverbs.                       of 10–15 words.

                                                              1. Reuters titles, extracted from news articles pub-
   funny. For example, http://www.berro.com/Jokes                lished in the Reuters newswire over a period of
   or http://www.mutedfaith.com/funny/life.htm are the           one year (8/20/1996 – 8/19/1997) (Lewis et al.,
   URLs of two webpages that satisfy this constraint.            2004). The titles consist of short sentences with
      The second constraint is designed to exploit the           simple syntax, and are often phrased to catch
   HTML structure of webpages, in an attempt to iden-            the readers attention (an effect similar to the
   tify enumerations of texts that include the seed one-         one rendered by one-liners).
   liner. This is based on the hypothesis that enumer-        2. Proverbs extracted from an online proverb col-
   ations typically include texts of similar genre, and          lection. Proverbs are sayings that transmit, usu-
   thus a list including the seed one-liner is likely to         ally in one short sentence, important facts or
   include additional one-line jokes. For instance, if a         experiences that are considered true by many
   seed one-liner is found in a webpage preceded by the          people. Their property of being condensed, but
   HTML tag <li> (i.e. “list item”), other lines found           memorable sayings make them very similar to
   in the same enumeration preceded by the same tag              the one-liners. In fact, some one-liners attempt
   are also likely to be one-liners.                             to reproduce proverbs, with a comic effect, as
      Two iterations of the bootstrapping process,               in e.g. “Beauty is in the eye of the beer holder”,
   started with a small seed set of ten one-liners, re-          derived from “Beauty is in the eye of the be-
   sulted in a large set of about 24,000 one-liners.             holder”.
   After removing the duplicates using a measure of
                                                              3. British National Corpus (BNC) sentences, ex-
   string similarity based on the longest common sub-
                                                                 tracted from BNC – a balanced corpus covering
   sequence metric, we were left with a final set of
                                                                 different styles, genres and domains. The sen-
   approximately 16,000 one-liners, which are used in
                                                                 tences were selected such that they were similar
   the humor-recognition experiments. Note that since
                                                                 in content with the one-liners: we used an in-
   the collection process is automatic, noisy entries are
                                                                 formation retrieval system implementing a vec-
   also possible. Manual verification of a randomly se-
                                                                 torial model to identify the BNC sentence most
   lected sample of 200 one-liners indicates an average
                                                                 similar to each of the 16,000 one-liners 1 . Un-
   of 9% potential noise in the data set, which is within
                                                                 like the Reuters titles or the proverbs, the BNC
   reasonable limits, as it does not appear to signifi-
                                                                 sentences have typically no added creativity.
   cantly impact the quality of the learning.
                                                                 However, we decided to add this set of negative
   2.2   Non-humorous Data                                       examples to our experimental setting, in order
   To construct the set of negative examples re-               1
                                                                 The sentence most similar to a one-liner is identified by
   quired by the humor-recognition models, we tried         running the one-liner against an index built for all BNC sen-
                                                            tences with a length of 10–15 words. We use a tf.idf weighting
   to identify collections of sentences that were non-      scheme and a cosine similarity measure, as implemented in the
   humorous, but similar in structure and composition       Smart system (ftp.cs.cornell.edu/pub/smart)

                                         This is trial version
         to observe the level of difficulty of a humor-
         recognition task when performed with respect
                                                                tracted using an index created on top of the CMU
                                                                pronunciation dictionary2 .
         to simple text.
                                                                Antonymy. Humor often relies on some type of
   To summarize, the humor recognition experiments              incongruity, opposition or other forms of apparent
   rely on data sets consisting of humorous (positive)          contradiction. While an accurate identification of
   and non-humorous (negative) examples. The posi-              all these properties is probably difficult to accom-
   tive examples consist of 16,000 one-liners automat-          plish, it is relatively easy to identify the presence of
   ically collected using a Web-based bootstrapping             antonyms in a sentence. For instance, the comic ef-
   process. The negative examples are drawn from: (1)           fect produced by the following one-liners is partly
   Reuters titles; (2) Proverbs; and (3) BNC sentences.         due to the presence of antonyms:
                                                                 A clean desk is a sign of a cluttered desk drawer.
   3 Automatic Humor Recognition                                 Always try to be modest and be proud of it!

   We experiment with automatic classification tech-             The lexical resource we use to identify antonyms
   niques using: (a) heuristics based on humor-specific          is W ORD N ET (Miller, 1995), and in particular the
   stylistic features (alliteration, antonymy, slang); (b)      antonymy relation among nouns, verbs, adjectives
   content-based features, within a learning framework          and adverbs. For adjectives we also consider an in-
   formulated as a typical text classification task; and         direct antonymy via the similar-to relation among
   (c) combined stylistic and content-based features,           adjective synsets. Despite the relatively large num-
   integrated in a stacked machine learning framework.          ber of antonymy relations defined in W ORD N ET,
                                                                its coverage is far from complete, and thus the
   3.1   Humor-Specific Stylistic Features                       antonymy feature cannot always be identified. A
                                                                deeper semantic analysis of the text, such as word
   Linguistic theories of humor (Attardo, 1994) have
                                                                sense disambiguation or domain disambiguation,
   suggested many stylistic features that characterize
                                                                could probably help detecting other types of seman-
   humorous texts. We tried to identify a set of fea-
                                                                tic opposition, and we plan to exploit these tech-
   tures that were both significant and feasible to im-
                                                                niques in future work.
   plement using existing machine readable resources.
   Specifically, we focus on alliteration, antonymy, and         Adult slang. Humor based on adult slang is very
   adult slang, which were previously suggested as po-          popular. Therefore, a possible feature for humor-
   tentially good indicators of humor (Ruch, 2002; Bu-          recognition is the detection of sexual-oriented lexi-
   caria, 2004).                                                con in the sentence. The following represent exam-
   Alliteration. Some studies on humor appreciation             ples of one-liners that include such slang:
   (Ruch, 2002) show that structural and phonetic prop-          The sex was so good that even the neighbors had a cigarette.
   erties of jokes are at least as important as their con-       Artificial Insemination: procreation without recreation.
   tent. In fact one-liners often rely on the reader’s          To form a lexicon required for the identification of
   awareness of attention-catching sounds, through lin-         this feature, we extract from W ORD N ET D OMAINS 3
   guistic phenomena such as alliteration, word repeti-         all the synsets labeled with the domain S EXUALITY.
   tion and rhyme, which produce a comic effect even if         The list is further processed by removing all words
   the jokes are not necessarily meant to be read aloud.        with high polysemy (
                                                                                          ¢ ). Next, we check for the
   Note that similar rhetorical devices play an impor-          presence of the words in this lexicon in each sen-
   tant role in wordplay jokes, and are often used in           tence in the corpus, and annotate them accordingly.
   newspaper headlines and in advertisement. The fol-           Note that, as in the case of antonymy, W ORD N ET
   lowing one-liners are examples of jokes that include         coverage is not complete, and the adult slang fea-
   one or more alliteration chains:                             ture cannot always be identified.
    Veni, Vidi, Visa: I came, I saw, I did a little shopping.
    Infants don’t enjoy infancy like adults do adultery.        Finally, in some cases, all three features (alliteration,
   To extract this feature, we identify and count the               Available at http://www.speech.cs.cmu.edu/cgi-bin/cmudict
                                                                    W ORD N ET D OMAINS assigns each synset in W ORD N ET
   number of alliteration/rhyme chains in each exam-            with one or more “domain” labels, such as S PORT, M EDICINE,
   ple in our data set. The chains are automatically ex-        E CONOMY. See http://wndomains.itc.it.

                                            This is trial version
   antonymy, adult slang) are present in the same sen-
   tence, as for instance the following one-liner:
                                                                  4.1     Heuristics using Humor-specific Features
                                                                  In a first set of experiments, we evaluated the classi-
    Behind every great
                     ¡ ¢    man   is a great woman , and
                                ¤ £
                                ¥¢           ¡
                                             ¢         §¦ 
                                                       ¤ £        fication accuracy using stylistic humor-specific fea-
    behind every great ¢ 
                       ¡    woman    is some guy staring at her
                                      ¤ £
                                      §¢                          tures: alliteration, antonymy, and adult slang. These
    behind !
          ¡ ¨
                                                                  are numerical features that act as heuristics, and the
   3.2   Content-based Learning                                   only parameter required for their application is a
                                                                  threshold indicating the minimum value admitted for
   In addition to stylistic features, we also experi-
                                                                  a statement to be classified as humorous (or non-
   mented with content-based features, through ex-
                                                                  humorous). These thresholds are learned automat-
   periments where the humor-recognition task is for-
                                                                  ically using a decision tree applied on a small subset
   mulated as a traditional text classification problem.
                                                                  of humorous/non-humorous examples (1000 exam-
   Specifically, we compare results obtained with two
                                                                  ples). The evaluation is performed on the remaining
   frequently used text classifiers, Na¨ve Bayes and
                                                                  15,000 examples, with results shown in Table 2 4 .
   Support Vector Machines, selected based on their
   performance in previously reported work, and for                                    One-liners   One-liners     One-liners
   their diversity of learning methodologies.                           Heuristic       Reuters       BNC          Proverbs
                                                                        Alliteration    74.31%       59.34%         53.30%
      ı                                    ı
   Na¨ve Bayes. The main idea in a Na¨ve Bayes text                     Antonymy        55.65%       51.40%         50.51%
   classifier is to estimate the probability of a category               Adult slang     52.74%       52.39%         50.74%
   given a document using joint probabilities of words                  A LL            76.73%       60.63%         53.71%
   and documents. Na¨ve Bayes classifiers assume
   word independence, but despite this simplification,             Table 2: Humor-recognition accuracy using allitera-
   they perform well on text classification. While there           tion, antonymy, and adult slang.
   are several versions of Na¨ve Bayes classifiers (vari-
                                                                     Considering the fact that these features represent
   ations of multinomial and multivariate Bernoulli),
                                                                  stylistic indicators, the style of Reuters titles turns
   we use the multinomial model, previously shown to
                                                                  out to be the most different with respect to one-
   be more effective (McCallum and Nigam, 1998).
                                                                  liners, while the style of proverbs is the most sim-
   Support Vector Machines. Support Vector Ma-                    ilar. Note that for all data sets the alliteration feature
   chines (SVM) are binary classifiers that seek to find            appears to be the most useful indicator of humor,
   the hyperplane that best separates a set of posi-              which is in agreement with previous linguistic find-
   tive examples from a set of negative examples, with            ings (Ruch, 2002).
   maximum margin. Applications of SVM classifiers
   to text categorization led to some of the best results         4.2     Text Classification with Content Features
   reported in the literature (Joachims, 1998).                   The second set of experiments was concerned with
                                                                  the evaluation of content-based features for humor
   4 Experimental Results                                         recognition. Table 3 shows results obtained using
   Several experiments were conducted to gain insights            the three different sets of negative examples, with
   into various aspects related to an automatic hu-                      ı
                                                                  the Na¨ve Bayes and SVM text classifiers. Learning
   mor recognition task: classification accuracy using             curves are plotted in Figure 2.
   stylistic and content-based features, learning rates,
                                                                                       One-liners    One-liners    One-liners
   impact of the type of negative data, impact of the                   Classifier       Reuters        BNC         Proverbs
   classification methodology.                                             ı
                                                                        Na¨ve Bayes     96.67%        73.22%        84.81%
      All evaluations are performed using stratified ten-                SVM             96.09%        77.51%        84.48%
   fold cross validations, for accurate estimates. The
                                                                  Table 3: Humor-recognition accuracy using Na¨ve
   baseline for all the experiments is 50%, which rep-
                                                                  Bayes and SVM text classifiers.
   resents the classification accuracy obtained if a label
   of “humorous” (or “non-humorous”) would be as-                     4
                                                                        We also experimented with decision trees learned from a
   signed by default to all the examples in the data set.         larger number of examples, but the results were similar, which
                                                                  confirms our hypothesis that these features are heuristics, rather
   Experiments with uneven class distributions were               than learnable properties that improve their accuracy with addi-
   also performed, and are reported in section 4.4.               tional training data.

                                            This is trial version
www.laptop1.blogbus.com                100

                                                      Classification learning curves

                                                                                                                                                  Classification learning curves

                                                                                                                                                                                                                                              Classification learning curves

     Classification accuracy (%)

                                                                                                     Classification accuracy (%)

                                                                                                                                                                                                 Classification accuracy (%)
                                       80                                                                                          80                                                                                          80

                                       70                                                                                          70                                                                                          70

                                       60                                                                                          60                                                                                          60

                                       50                                                                                          50                                                                                          50
                                                                                 Naive Bayes                                                                                 Naive Bayes                                                                                 Naive Bayes
                                                                                        SVM                                                                                         SVM                                                                                         SVM
                                       40                                                                                          40                                                                                          40
                                             0   20          40           60            80     100                                       0   20          40           60            80     100                                       0   20          40           60            80     100
                                                          Fraction of data (%)                                                                        Fraction of data (%)                                                                        Fraction of data (%)

                                                             (a)                                                                                         (b)                                                                                         (c)

   Figure 2: Learning curves for humor-recognition using text classification techniques, with respect to three
   different sets of negative examples: (a) Reuters; (b) BNC; (c) Proverbs.

      Once again, the content of Reuters titles appears                                                                                                    4 shows the results obtained in this experiment, for
   to be the most different with respect to one-liners,                                                                                                    the three different data sets.
   while the BNC sentences represent the most simi-
                                                                                                                                                                                    One-liners        One-liners                              One-liners
   lar data set. This suggests that joke content tends to                                                                                                                            Reuters            BNC                                   Proverbs
   be very similar to regular text, although a reasonably                                                                                                                            96.95%            79.15%                                  84.82%
   accurate distinction can still be made using text clas-
   sification techniques. Interestingly, proverbs can be                                                                                                    Table 4: Humor-recognition accuracy for combined
   distinguished from one-liners using content-based                                                                                                       learning based on stylistic and content features.
   features, which indicates that despite their stylistic
   similarity (see Table 2), proverbs and one-liners deal                                                                                                     Combining classifiers results in a statistically sig-
   with different topics.                                                                                                                                  nificant improvement (                  , paired t-test)
                                                                                                                                                           with respect to the best individual classifier for the
                                                                                                                                                                                                                               © ¥£ ¡
   4.3                                   Combining Stylistic and Content Features                                                                          One-liners/Reuters and One-liners/BNC data sets,
   Encouraged by the results obtained in the first                                                                                                          with relative error rate reductions of 8.9% and 7.3%
   two experiments, we designed a third experiment                                                                                                         respectively. No improvement is observed for the
   that attempts to jointly exploit stylistic and con-                                                                                                     One-liners/Proverbs data set, which is not surpris-
   tent features for humor recognition. The feature                                                                                                        ing since, as shown in Table 2, proverbs and one-
   combination is performed using a stacked learner,                                                                                                       liners cannot be clearly differentiated using stylistic
   which takes the output of the text classifier, joins it                                                                                                  features, and thus the addition of these features to
   with the three humor-specific features (alliteration,                                                                                                    content-based features is not likely to result in an
   antonymy, adult slang), and feeds the newly created                                                                                                     improvement.
   feature vectors to a machine learning tool. Given
                                                                                                                                                           4.4          Discussion
   the relatively large gap between the performance
   achieved with content-based features (text classifi-                                                                                                     The results obtained in the automatic classification
   cation) and stylistic features (humor-specific heuris-                                                                                                   experiments reveal the fact that computational ap-
   tics), we decided to implement the second learning                                                                                                      proaches represent a viable solution for the task of
   stage in the stacked learner using a memory based                                                                                                       humor-recognition, and good performance can be
   learning system, so that low-performance features                                                                                                       achieved using classification techniques based on
   are not eliminated in the favor of the more accu-                                                                                                       stylistic and content features.
   rate ones5 . We use the Timbl memory based learner                                                                                                         Despite our initial intuition that one-liners are
   (Daelemans et al., 2001), and evaluate the classifica-                                                                                                   most similar to other creative texts (e.g. Reuters ti-
   tion using a stratified ten-fold cross validation. Table                                                                                                 tles, or the sometimes almost identical proverbs),
                                                                                                                                                           and thus the learning task would be more difficult in
        Using a decision tree learner in a similar stacked learning                                                                                        relation to these data sets, comparative experimental
   experiment resulted into a flat tree that takes a classification de-
   cision based exclusively on the content feature, ignoring com-                                                                                          results show that in fact it is more difficult to distin-
   pletely the remaining stylistic features.                                                                                                               guish humor with respect to regular text (e.g. BNC

                                                                                               This is trial version
   sentences). Note however that even in this case the
   combined classifier leads to a classification accuracy
                                                             and 87.86% (One-liners/Proverbs).
                                                                 Finally, in addition to classification accuracy, we
   that improves significantly over the apriori known         were also interested in the variation of classifica-
   baseline.                                                 tion performance with respect to data size, which
      An examination of the content-based features           is an aspect particularly relevant for directing fu-
   learned during the classification process reveals in-      ture research. Depending on the shape of the learn-
   teresting aspects of the humorous texts. For in-          ing curves, one could decide to concentrate future
   stance, one-liners seem to constantly make reference      work either on the acquisition of larger data sets, or
   to human-related scenarios, through the frequent use      toward the identification of more sophisticated fea-
   of words such as man, woman, person, you, I. Simi-        tures. Figure 2 shows that regardless of the type of
   larly, humorous texts seem to often include negative      negative data, there is significant learning only un-
   word forms, such as the negative verb forms doesn’t,      til about 60% of the data (i.e. about 10,000 positive
   isn’t, don’t, or negative adjectives like wrong or bad.   examples, and the same number of negative exam-
   A more extensive analysis of content-based humor-         ples). The rather steep ascent of the curve, especially
   specific features is likely to reveal additional humor-    in the first part of the learning, suggests that humor-
   specific content features, which could also be used in     ous and non-humorous texts represent well distin-
   studies of humor generation.                              guishable types of data. An interesting effect can
                                                             be noticed toward the end of the learning, where for
      In addition to the three negative data sets, we also   both classifiers the curve becomes completely flat
   performed an experiment using a corpus of arbitrary       (One-liners/Reuters, One-liners/Proverbs), or it even
   sentences randomly drawn from the three negative          has a slight drop (One-liners/BNC). This is probably
   sets. The humor recognition with respect to this neg-     due to the presence of noise in the data set, which
   ative mixed data set resulted in 63.76% accuracy for      starts to become visible for very large data sets 6 .
   stylistic features, 77.82% for content-based features     This plateau is also suggesting that more data is not
   using Na¨ve Bayes and 79.23% using SVM. These             likely to help improve the quality of an automatic
   figures are comparable to those reported in Tables 2       humor-recognizer, and more sophisticated features
   and 3 for One-liners/BNC, which suggests that the         are probably required.
   experimental results reported in the previous sec-
   tions do not reflect a bias introduced by the negative     5 Related Work
   data sets, since similar results are obtained when the
   humor recognition is performed with respect to ar-        While humor is relatively well studied in scientific
   bitrary negative examples.                                fields such as linguistics (Attardo, 1994) and psy-
                                                             chology (Freud, 1905; Ruch, 2002), to date there
      As indicated in section 2.2, the negative exam-        is only a limited number of research contributions
   ples were selected structurally and stylistically sim-    made toward the construction of computational hu-
   ilar to the one-liners, making the humor recognition      mour prototypes.
   task more difficult than in a real setting. Nonethe-          One of the first attempts is perhaps the work de-
   less, we also performed a set of experiments where        scribed in (Binsted and Ritchie, 1997), where a for-
   we made the task even harder, using uneven class          mal model of semantic and syntactic regularities was
   distributions. For each of the three types of nega-       devised, underlying some of the simplest types of
   tive examples, we constructed a data set using 75%        puns (punning riddles). The model was then ex-
   non-humorous examples and 25% humorous exam-              ploited in a system called JAPE that was able to au-
   ples. Although the baseline in this case is higher        tomatically generate amusing puns.
   (75%), the automatic classification techniques for            Another humor-generation project was the HA-
   humor-recognition still improve over this baseline.       HAcronym project (Stock and Strapparava, 2003),
   The stylistic features lead to a classification accu-      whose goal was to develop a system able to au-
   racy of 87.49% (One-liners/Reuters), 77.62% (One-         tomatically generate humorous versions of existing
   liners/BNC), and 76.20% (One-liners/Proverbs),
   and the content-based features used in a Na¨ve      ı          We also like to think of this behavior as if the computer
                                                             is losing its sense of humor after an overwhelming number of
   Bayes classifier result in accuracy figures of 96.19%       jokes, in a way similar to humans when they get bored and stop
   (One-liners/Reuters), 81.56% (One-liners/BNC),            appreciating humor after hearing too many jokes.

                                     This is trial version
   acronyms, or to produce a new amusing acronym
   constrained to be a valid vocabulary word, starting
                                                                               tomatic humor-recognizer stops improving after a
                                                                               certain number of examples. Given that automatic
   with concepts provided by the user. The comic ef-                           humor-recognition is a rather understudied problem,
   fect was achieved mainly by exploiting incongruity                          we believe that this is an important result, as it pro-
   theories (e.g. finding a religious variation for a tech-                     vides insights into potentially productive directions
   nical acronym).                                                             for future work. The flattened shape of the curves
      Another related work, devoted this time to the                           toward the end of the learning process suggests that
   problem of humor comprehension, is the study re-                            rather than focusing on gathering more data, fu-
   ported in (Taylor and Mazlack, 2004), focused on                            ture work should concentrate on identifying more
   a very restricted type of wordplays, namely the                             sophisticated humor-specific features, e.g. semantic
   “Knock-Knock” jokes. The goal of the study was                              oppositions, ambiguity, and others. We plan to ad-
   to evaluate to what extent wordplay can be automati-                        dress these aspects in future work.
   cally identified in “Knock-Knock” jokes, and if such
   jokes can be reliably recognized from other non-
   humorous text. The algorithm was based on auto-
   matically extracted structural patterns and on heuris-
   tics heavily based on the peculiar structure of this                        S. Attardo. 1994. Linguistic Theory of Humor. Mouton de
                                                                                   Gruyter, Berlin.
   particular type of jokes. While the wordplay recog-                         K. Binsted and G. Ritchie. 1997. Computational rules for pun-
   nition gave satisfactory results, the identification of                          ning riddles. Humor, 10(1).
   jokes containing such wordplays turned out to be                            C. Bucaria. 2004. Lexical and syntactic ambiguity as a source
   significantly more difficult.                                                     of humor. Humor, 17(3).
                                                                               W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den
                                                                                   Bosch. 2001. Timbl: Tilburg memory based learner, ver-
   6 Conclusion                                                                    sion 4.0, reference guide. Technical report, University of
           A conclusion is simply the place where you got tired of thinking.       Antwerp.
           (anonymous one-liner)                                               S. Freud. 1905. Der Witz und Seine Beziehung zum Unbe-
      The creative genres of natural language have been                            wussten. Deutike, Vienna.
                                                                               T. Joachims. 1998. Text categorization with Support Vector
   traditionally considered outside the scope of any                               Machines: learning with many relevant features. In Pro-
   computational modeling. In particular humor, be-                                ceedings of the European Conference on Machine Learning.
   cause of its puzzling nature, has received little atten-                    D.E. Knuth. 1993. The Stanford Graph Base: A Platform for
   tion from computational linguists. However, given                               combinatorial computing. ACM Press.
   the importance of humor in our everyday life, and                           D. Lewis, Y. Yang, T. Rose, and F. Li. 2004. RCV1: A new
                                                                                   benchmark collection for text categorization research. The
   the increasing importance of computers in our work                              Journal of Machine Learning Research, 5:361–397.
   and entertainment, we believe that studies related to                       A. McCallum and K. Nigam. 1998. A comparison of event
   computational humor will become increasingly im-                                models for Naive Bayes text classification. In Proceedings
   portant.                                                                        of AAAI-98 Workshop on Learning for Text Categorization.
                                                                               G. Miller. 1995. Wordnet: A lexical database. Communication
      In this paper, we showed that automatic classifi-                             of the ACM, 38(11):39–41.
   cation techniques can be successfully applied to the                        A. Nijholt, O. Stock, A. Dix, and J. Morkes, editors. 2003. Pro-
   task of humor-recognition. Experimental results ob-                             ceedings of CHI-2003 workshop: Humor Modeling in the
   tained on very large data sets showed that computa-                             Interface, Fort Lauderdale, Florida.
                                                                               W. Ruch. 2002. Computers with a personality? lessons to be
   tional approaches can be efficiently used to distin-                             learned from studies of the psychology of humor. In Pro-
   guish between humorous and non-humorous texts,                                  ceedings of the The April Fools Day Workshop on Computa-
   with significant improvements observed over apriori                              tional Humour.
   known baselines. To our knowledge, this is the first                         O. Stock and C. Strapparava. 2003. Getting serious about the
                                                                                   development of computational humour. In Proceedings of
   result of this kind reported in the literature, as we                               ¡¤
                                                                                   the 8 International Joint Conference on Artificial Intelli-
   are not aware of any previous work investigating the                            gence (IJCAI-03), Acapulco, Mexico.
   interaction between humor and techniques for auto-                          O. Stock, C. Strapparava, and A. Nijholt, editors. 2002. Pro-
   matic classification.                                                            ceedings of the The April Fools Day Workshop on Computa-
                                                                                   tional Humour, Trento.
      Finally, through the analysis of learning curves                         J. Taylor and L. Mazlack. 2004. Computationally recognizing
   plotting the classification performance with respect                             wordplay in jokes. In Proceedings of CogSci 2004, Chicago.
   to data size, we showed that the accuracy of the au-

                                               This is trial version

Shared By: