Datamining Interview Question by hyw18104


More Info
									                Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

      An Exploratory Study on Promising Cues in Deception Detection and
                         Application of Decision Tree

                           Tiantian Qin, Judee Burgoon, Jay F. Nunamaker, Jr.
                                          University of Arizona
                             {tqin, jBurgoon, jnunamaker}

                      Abstract                             include number of sentences, number of words,
                                                           sentence complexity etc. Judging whether one is
                                                           deceptive is equal to decide the classification (true or
     Automatic deception detection (ADD) becomes           deceptive) of one’s message, from a set of attributes
more and more important. ADD can be facilitated with       (linguistic cues). Considering the suspicious message
the development of data mining techniques. In the          as a data (record) with a list of attributes and an
paper we focus on decision tree to automatic classify      unknown classification (true or deceptive), deception
deceptions. The major question is how to select            detection is nothing more than a data classification
experiment data (input data for training in decision       process.
tree) so that it maximally benefits the decision tree            Many data mining techniques are now available
performance. We investigate promising level of the         to classify data: such as neural networks, Bayesian
cues of experiment data, and then adjust the               networks, k-means, decision trees, etc [8,7]. All these
applications in decision tree accordingly.       Five      data mining techniques require training process where
comparative decision tree experiments demonstrate          data with known classification are input and their
that tree performance, such as accurate rate and           attributes are auto analyzed. The training process then
complexity, is dramatically improved by statistically      produce a classification baseline, which could be a tree
and semantically selecting cues.                           structure (decision tree) or a network structure
                                                           (Bayesian network), and so on. Future data can be
                                                           classified based on the baseline.
1.   Introduction
                                                                Research on cues and decision trees make
     Deception means that messages are transmitted to      deception detection objectives possible.           Briefly
cause a false impression or conclusion [3]. The            speaking, the final research objective is to decide
challenge of detecting untrue information becomes          automatically whether a text-based message (such as
more significant after the event of 911. For this          an email) is true or not. In data mining, the goal is to
purpose, tremendous efforts are needed to analyze          classify the data (message) into one of 2 categories
messages, compare with previous records, and figure        (true or deceptive) based on its attributes (linguistic
out the suspicious information. Furthermore, the           cues). Among many data mining methods, we focus
empirical studies provide evidence that humans are         on decision tree (C4.5) in this paper, because it is
typically very poor at detecting deception and             powerful and the tree structure shows interactions of
fallacious information [11]: especially when the           cues. The most challenging part of the goal is the
messages are sourced from text-based, computer             training process, i.e., constructing a baseline structure,
mediate, where the accuracy is little better than chance   or threshold values distinguishing between deception
[9]. Tools that facilitate human deception detection are   and truth.
therefore valuable, such tools would also benefit to            Training data is critical to the training process. In
law enforcement in dealing with criminal                   the current stage, we obtain training data from
investigations.                                            experiments [2]. However, we claim that directly
      Previous literatures offer many prospective cues     applying experiment data as training data (a quick and
that might be useful to distinguish deception and truth    dirty method) for two major reasons does not
in text-based messages, while manageable by                guarantee a reliable baseline:
computer software tools [2,7] Such linguistic cues

                                      0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                             1
                 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

     First, deception behavior (represented as a cue) is     experiment, participants staged a mock theft and were
consistent in all contexts. Cues (attributes) need to be     subsequently interviewed by untrained and trained
validated according to their context, since deceptive        interviewers via text chat (Txt), audio conferencing
behavior (cues) perform significantly variously, some-       (Audio) , or face-to-face (Ftf) interaction. The FtF
times even oppositely under different circumstances          interactions were later transcribed, and the transcripts
[9]. For example, people in vocally chatting might           and chats were submitted to linguistic analysis such
communicate more informally than in text chatting,           features as number of words, number of sentences,
where messages contain more complicated and/or               number of unique words (lexical diversity),
longer sentences. It is necessary to semantically            emotiveness, pronoun usages, plus several others that
analyze experiment data, and to construct many               are available in the Grammatik tool within
baselines depending on the corresponding contexts.           WordPerfect. Due to small sample size, only a few of
     Second, even in the same context, characteristics       the differences between innocents (truth tellers) and
of cues may differ. It has been noted that selection of      thieves (deceivers) were statistically significant. urther
attributes tremendously influences data mining [7].          more, because the effect of interaction modality (text
Some cues are more promising than others for use in          chat, face-to-fact or Audio chat), cues might perform
detecting deception. For example, some may be more           differently even opposite between truth tellers and
significant in a statistical sense. Because including        deceivers under different modality. An example is:
superfluous attributes (cues) in training data would         truthful tellers had more sentences than deceivers in
decrease the performance of decision tree, cue               Txt, but less in Audio, i.e., cues are not “consistent”
selection is also an important issue in selecting            across different modalities.
training data.                                                    In the mock theft experiment, students were
     In this paper, we concentrate on how to select          recruited from a multi-sectioned communication class
cues (attributes) according to context and promise for       by offering them credit for participation and the
improving training data (i.e., the preprocessing of          chance to win money if they were successful at their
experiment data in order to generate a qualified             task. Half of the students were randomly assigned to
training data set for a decision tree). We demonstrated      be “thieves,” i.e., those who would be deceiving about
that, semantic and statistic analysis of the experiment      a theft, and the other half became “innocents,” i.e.,
data can result in drastically improves in decision          those who would be telling the truth.
trees. Meanwhile, consistent (under some context) and             Interviewees in the deceptive condition were
promising cues can maintain a focus on important             assigned to “steal” a wallet that was left in a
indicators of the direction future deception detection       classroom. In the truthful condition, interviewees
experiments should take. As the first paper on               were told that a “theft” would occur in class on an
semantic selection of training data for decision-tree        assigned day. All of the interviewees and interviewers
implementation in a deception detection context, it          then appeared for interviews according to a pre-
sheds light on questions and discussions in this area.       assigned schedule. We attempted to motivate serious
     This paper is organized into the following              engagement in the task by offering interviewers $10 if
sections: The Method section describes deception             they could successfully detect whether their
detection experiments that provide source data;              interviewee was innocent or guilty and successfully
linguistic cues (attributes). In results and discussion      detect whether they were deceiving or telling the truth
section, we analyze cues and summary on their                on a series of the interview questions. In turn, we
potential power in application; we also point out            offered interviewees $10 if they convinced a trained
preprocessing methods that could enhance the decision        interviewer that they were innocent and that their
performance. Application section shows five                  answers to several questions were truthful. An
comparative decision tree experiments that support our       additional incentive was a $50 prize to be awarded to
preprocessing methods in previous section. Next, we          the most successful interviewee.
take a close look at the decision trees generated, and            Interviewees were then interviewed by one of
explain the tree under deception detection context.          three trained interviewers under one of three
We then finish the paper by summary the findings in          modalities— Face to Face (FtF), text chat, or audio
conclusion section.                                          conferencing. The interviews followed a standardized
                                                             Behavioral Analysis Interview format that is taught to
2.   Method                                                  criminal investigators.
                                                                  There are three segments of questions; the first
     The Mock theft experiment [2] was designed as a         segment is questions on previous class question, such
pilot study. One of the purposes of this study is to         as, what is your favorite class? In this segment every
reveal useful linguistic cues to detect deception. In this   subject gave true answers. The second segment is on

                                        0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                             2
                 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

question on previous jobs, such as: describe your last       Audio). Therefore there are 4 contexts: low-risk Text,
job. In this segment, subjects acting as “thieves” are       low-risk Text, high-risk Audio, and high-risk Audio.
supposed to fake on the job stories, and had to make         As an example of consistent cue: we seem sentence
up one if they did not have any job experience. The          complexity consistent if deceivers use simpler
third segment is the questions on the “stealing” event:      sentences (lower sentence complexity), in the 4
described your experience in that classroom where the        contexts. “Significant” means that the values of cues
wallet was missing. These three segments recorded the        are statistically different between deceivers and truth-
pattern of subjects when they were: telling a truth,         tellers. In reality, it is usually difficult to find a cue
telling low risk lies on familiar topics, and telling high   that is consistent under all contexts, while remains
risk lies. The third segment is the situation that we are    significant. We need to decide from the combined
most interested in, since the deceptive behavior is          performance of cues and come up with a set of good
close to those could bring the most dangerous (high          cues. These good cues are referred to as promising
risk) consequence in the real world. In application          cues in rest of the paper. In the next section, we will
section, the decision tree will be trained on data in the    discuss how to select promising cues.
third segment.
     Interviews were subsequently transcribed and            3.   Results and Discussion
submitted to linguistic analysis. Clusters of potential
indicators, all of which could be automatically                   In this section, we will discuss issues that
calculated with a shallow parser (Grok or Iskim) or          influence the promising level of cues: significance
could use a look-up dictionary were included. The            (statistically) level of the cues, relationships among
specific classes of cues and respective indicators were      the cues, and consistency of cues (in contexts of high-
as follows:                                                  risk Txt, and high-risk Audio). We summarized
                                                             promising cues and predict that considering only those
     1. Quantity (number of syllables, number of             promising cues (attributes) in defining the training
words, number of sentences, number of short                  data should result in better decision trees having a
sentences, number of simple sentences)                       reliable correct classification rate and less complexity,
     2. Vocabulary level Complexity (number of big           than using all 19 cues. Empirical evidence supports
words, number of syllables per word, lexical                 such a prediction in the next section, in which several
complexity, and lexical complexity)                          decision trees, built using different training data, are
     3. Sentences level Complexity (Flesh-Kincaid            compared.
grade level, average number of words per sentence,
sentence complexity, number of conjunctions)                 3.1. Relations among cues
     4. Specificity and Expressiveness (emotiveness
index, rate of adjectives and adverbs, number of                  Including redundant attributes in training data
affective terms, sensory and RM terms)                       cannot improve the tree performance, but adds in
                                                             unnecessarily computational complexity.             For
     5. Informality (total number of flagged errors
                                                             example, if number-of-syllables and number-of-words
(from Word Perfect))
                                                             are highly correlated and can be statistically
     The reason for these cue classes is that they           demonstrated to be substitutable one for the other,
distinguish truth from deception while they are              retaining both cues is actually almost the equivalent of
machine extractable. In general, deceivers are higher        using the same cue twice. Therefore, we decide to
on quantity, specificity and expressiveness, and less on     check the relationships among cues in order to
complexity [10]. For informality, deceivers are less         eliminate superfluous ones.
informality than truth-tellers [3].
                                                                  The 19 cues can be grouped into 5 dimensions:
     The data set of Ftf is small since the transcription    quantity, vocabulary-level complexity, sentence-level
process has not finished, therefore we did not consider      complexity, Specificity and Expressiveness, and
Ftf. We applied a design for statistical test, where         informality. Table 1 shows that cues in the first
there are 3 segments, 2 modalities, and 2 conditions         column can be replaced by corresponding cues in the
(truth teller and deceiver). Total 58 subjects: 28 are for   second column, for the reasons shown in the third
Txt (16 are “thieves”), and 20 (9 are “thieves”) are for     column, where correlations between cues and
Audio.                                                       reliability test are listed. For example, number-of-
     We want cues that are both consistent and               simple-sentences and number-of-short-sentences are
significant.      “Consistent” means that people’s           highly correlated with number-of-sentences; number-
deceptive behaviors are identical in some contexts.          of-syllables is highly correlated with number-of-
This paper consider context only in the two categories:      words. Reliability testing confirmed that these
segment (low-risk, high-risk), and modality (text,

                                        0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                             3
                     Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

 variables measuring similar information and can be                        3.2. Statistically significances
 replaced one for the other. This suggests that
 eliminating # simple sentences, # short sentences, and
 # syllables can reduce complexity without losing                               A cue is considered statistically significant if the
 information.                                                              difference in its occurrence between deceivers and
                                                                           truth-tellers is statistically significant, and we have
                                                                           reason to believe therefore that this difference is not
                Table 1. Cues relation                                     due to chance. Significant cues are more important
                                                                           because they represent systematic difference.
                               Reason                                      However, because decision tree (and many other
 Duplicated Represented
 variables  by                                  Reliability                existing data mining tools) cannot automatically
                                                test                       determine the statistical significance of attributes
                                                                           (cues). we rely on traditional statistical methods.
                               0.692**          0.7074                          A series of GLM and independent sample t-tests
                Sentences                                                  were applied. F and P values are shown in table 2. The
 Short                                                                     multivariate testing of cues occurring frequently
                               0.863**          0.8045
 sentences                                                                 produced no significant multivariate effects (p>0.01).
                                                                           T-tests provided weak support significance of number-
 Syllables      Words          0.993**          0.9376                     of-words (p = .096).
                                                                                 Multivariate testing of complexity at both the
                              **: Significant at .05 level.                sentence-level (simple sentences, long sentences, short
                                                                           sentences, sentence complexity, Flesch-Kincaid grade
      For simplicity, these 3 cues: number-of-simple                       level, number of conjunctions, average-words-per-
 sentences, number-of-short sentences, and number-of-                      sentence (AWS)) and the vocabulary level (vocabulary
 syllables will be defined as “duplicated”. Getting rid                    complexity, number of big words, average-syllables-
 of duplicated cues is expected to improve, or at least                    per-word (ASW)) did not showed significance. T-test
 not reduce, the performance of trees, since noise is                      provided evidence for effects on deception condition
 reduced by keeping just sufficient information into the                   for several individual variables (AWS, with p=.021;
 training set.                                                             Flesch-Kincaid grade level, with p=.056; and sentence
                                                                           complexity, with p=.082). This implies that sentence
                                                                           complexity cues are helpful for distinguishing between
                                                                           deceptive and true messages.

                                                   Table 2. F-(p-) values of 19 cues

                                                     Independent                                                         Independent
                               Multivariate test       Samples                                       Multivariate test     Samples
             Cues             between subjects          t-Test                   Cues               between subjects        t-Test
         Syllables               1.842(.182)          1.502(.140)        Sentence complexity           2.055(.159)       1.779(.082)*
          Words                  2.407(.128)          1.702(.096)*      Vocabulary complexity           .512(.478)       -.997(.324)
        Sentences                 .001(.972)             .111(.912)        # of Conjunctions           2.569(.116)       1.426(.163)
     Short sentences              .588(.447)          -.725(.472)     Rate Adjectives and Adverbs       .329(.569)       -.596(.554)
     Long sentences              6.566(.014)*         2.781(.008)*           Emotiveness                .054(.818)        -.233(.817)
     Simple sentences             .002(.969)             .061(.951)       Lexicon complexity            .771(.404)        .618(.542)
        Big words                 .288(.594)             .616(.541)           Sens&RM                   .568(.457)        .769(.447)
Average syllables per word       1.703(.199)          -1.668(.102)        Total flagged errors         3.945(.056)*      2.174(.037)*
Average words per sentence       4.368(.042)*         2.414(.021)*              Affect                 3.291(.214)       1.630(.110)
 Flesch-Kincaid grade level      2.690(.108)          1.958(.056)*
                                                                                                           *: Significant at .1 level

                                                0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                                   4
                Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

     For specificity and expressiveness (adjectives and    trend in which quality are consistent only in Txt, since
adverbs, emotiveness, Sens& RM, and affect), the           the deceiver has less quality than truth teller in both
multivariate and t test showed no significance, p>.1.      context of low-risk Txt, and high-risks Txt.
     Informality of message (total flagged errors) was          Continuing the consistency comparison for the
significantly different between deception and truth by     remaining cue dimensions (Lexical level complexity,
both multivariate test (p=.056) and t test (p=.037),       sentence level complexity, expressiveness, specificity,
implying that informality might also be useful in          and informacy), we observe that sentence complexity
training data.                                             is the most consistent. All profile plots of sentence
     In general, number-of-words, number-of-long-          complexity are shown in figure 2. The only
sentences, AWS, Flesch-Kincaid grade level, sentence       inconsistent-point happens in number-of-conjunctions,
complexity and total-flagged-errors can be considered      in Audio modality. The inconsistence implies that
to be more significant for cue significance than other     Audio modality is less reliable than Txt, which
cues. We refer to them as significant for simplicity.      remains true for all cue dimensions other than
This investigation confirmed that deceivers behave         sentence complexity. We then predict that using only
differently from truth tellers in text chat and/or Audio   Txt data as training set could result in better decision
chat communications.                                       trees than using data with both modalities, because
                                                           Audio modality is not consistent and will bring in
3.3. Consistency                                           noise in the decision-trees.

     A cue is considered consistent if it shows the                                            At MODALITY = 1
same behavior under different circumstances                                              120

(modality and segment). For example, if deceivers                                        100

speak or write less than truth tellers in response to
                                                              Estimated Marginal Means

both low-risk and high-risk questions (segment),                                          80

and/or in both Txt and Audio situations (modality), we
consider the measure to be consistent measure.                                            60                             GUILT


     We combine modality and segment effects to                                           40                                     guilty

decide the consistency of cue. Consistency is clearly                                          1                 2   3

visible on profile-plots such as that for number-of-
                                                                                               At MODALITY = 2
words in Figure 1.                                                                       180

     In Figure 1 Modality =1 refers to Txt and 2 refers                                  160

Audio. Question 1, 2 and 3 represent segment, i.e.,                                      140

question 1 is the segment where all subjects told the
                                                              Estimated Marginal Means


truth; questions 2 and 3 are segments in which subjects                                  100

acting “thieves” told low- and high- risk lies,                                           80

respectively. We focus on segments 2 and 3. In the                                        60                                 guilty

                                                                                          40                                 innocent
Txt situation, deceivers said or wrote fewer words in                                          1                 2   3

both segments 2 and 3. In the audio situation,                                                 QUESTION

however, whether deceivers said or wrote fewer words
varies on segments, i.e., more in low-risk context and       Figure1. Profile plot for number-of-words
fewer in high-risk context. Therefore, number-of-
word was not a consistent cue because it depended on       3.4. Summary
the segments. Although this method is subjective, it
provides a vivid and effective way to look at the             In table 3, we summarized promising cues by
consistency of cues. Using the comparison method on        combining significance and consistency.
all cues in the quality dimension (number of syllables,
number of words, number of- sentences, number of
short sentences, number of simple sentences), reveals

                                      0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                                                 5
                                                    Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

W ords per sentences:
                                        E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1                                                                                  E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1
                                        A t M O D A L IT Y = 1                                                                                                                           A t M O D A L IT Y = 2
                                 24                                                                                                                                           26

                                 22                                                                                                                                           24

      Estimated Marginal Means

                                                                                                                                                 Estimated Marginal Means

                                                                                                      G UILT                                                                  16                                                                      G UILT

                                 14                                                                       g u ilt y                                                           14                                                                            g u ilt y

                                 12                                                                       in n o c e n t                                                      12                                                                            in n o c e n t
                                        1                             2                           3                                                                                  1                              2                             3

                                        Q U E S T IO N                                                                                                                                   Q U E S T IO N

Flesch-K incaid grade level:
                                        E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1                                                                              E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1
                                        A t M O D A L IT Y = 1                                                                                                                       A t M O D A L IT Y = 2
                                 10                                                                                                                                           10

      Estimated Marginal Means

                                                                                                                                                 Estimated Marginal Means


                                                                                                      G UILT                                                                                                                                          G UILT
                                                                                                          g u ilt y                                                                                                                                         g u ilt y

                                  6                                                                       in n o c e n                                                          5                                                                           in n o c e n
                                        1                             2                           3                                                                                  1                              2                             3

                                        Q U E S T IO N                                                                                                                               Q U E S T IO N

#C onjunctions:
                                            E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1                                                                          E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1
                                            At M O D ALITY = 1                                                                                                                       A t M O D A L IT Y = 2
                                 5 .0                                                                                                                                         11

                                 4 .5
     Estimated Marginal Means

                                                                                                                                                Estimated Marginal Means

                                 4 .0

                                 3 .5
                                                                                                      G UILT                                                                   6                                                                      G UILT
                                 3 .0
                                                                                                          g u ilt y                                                            5                                                                           g u ilt y

                                 2 .5                                                                     in n o c e n t                                                       4                                                                           in n o c e n t
                                        1                              2                          3                                                                                  1                              2                             3

                                            Q U E S T IO N                                                                                                                           Q U E S T IO N

R ateM odals:
                                             E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1                                                                             E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1
                                             A t M O D A L IT Y = 1                                                                                                                      A t M O D A L IT Y = 2
                                  .0 4                                                                                                                                       .0 4

                                  .0 3                                                                                                                                       .0 3
     Estimated Marginal Means

                                                                                                                           Estimated Marginal Means

                                  .0 2                                                                                                                                       .0 2

                                                                                                      G UILT                                                                                                                                      G UILT
                                  .0 1                                                                                                                                       .0 1
                                                                                                          g u ilt y                                                                                                                                      g u ilt y

                                 0 .0 0                                                                   in n o c e n t                                                    0 .0 0                                                                       in n o c e n t
                                            1                             2                       3                                                                                  1                             2                          3

                                             Q U E S T IO N                                                                                                                              Q U E S T IO N

Sentence com plexity:
                                        E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1
                                                                                                                                                                                     E s t im a t e d M a r g in a l M e a n s o f M E A S U R E _ 1
                                        A t M O D A L IT Y = 1
                                 60                                                                                                                                                  A t M O D A L IT Y = 2

   Estimated Marginal Means

                                                                                                                                              Estimated Marginal Means


                                                                                                      G UILT
                                 30                                                                                                                                                                                                                   G UILT
                                                                                                          g u ilt y
                                                                                                                                                                                                                                                           g u ilt y

                                 20                                                                       in n o c e n t                                                      20                                                                           in n o c e n t
                                        1                             2                           3                                                                                  1                             2                              3

                                        Q U E S T IO N                                                                                                                               Q U E S T IO N

                                                                              Figure 2. Profile plots of sentence level complexity

                                                                                     0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                                                                                                                                   6
                      Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

     Although expressiveness and specificity seemed                                       In short, getting rid of the three duplicated cues
not to be consistent and significant in this pilot study,                            could decrease the complexity of the decision tree.
they are important because they represent a new                                      The six significant cues are potential to be the most
information dimension. The reason for their not                                      efficient cues for training data. Cues are more
showing significance could be a consequence of the                                   consistent in a Txt situation, suggesting that using only
small data set and dictionary used to calculate these                                Txt data as a training set could construct better
cues. We need to continue to evaluate them with more                                 decision trees.
experiment data and a better dictionary.

                                                    Table 3. Summary of promising cue

                                  R e p r e s e n ta tiv e      S ig n ific a n t
     C u e C la s s               Cue                           D iffe r e n c e (D /T )     C o n s is te n c y               P r o m is in g le v e l
     Q u a n t it y               # W o rd s                    Yes                          B e tte r u n d e r tx t          m id - h ig h
     L e x ic a l le v e l
     c o m p le x it y                                          No                           No                                lo w - m id

                                  A W S , F K G ra d e ,
     S e n t e n c e le v e l     S e n te n c e
     c o m p le x it y            c o m p le x it y      Yes                                 G ood                             h ig h
     E x p r e s s iv e n e s s
     S p e c if ic it y                                      C o u ld b e g o o d in d ic a t o r s . N e e d m o r e t e s t s
     In fo rm a c y               E rro rs                      Yes                         B e tte r u n d e r tx t        m id - h ig h

                                                                                     produce more noise in the training set, thus damaging
4.   Application                                                                     the prediction rate. That cues are also more consistent
                                                                                     in Txt meaning that Txt data are more reliable.
     The technique for constructing decision tree is
provided in C 4.5 (Weka). Cross validation has been                                       Although improvement is not significant,
used for more reliable results [7]. We used the data in                              organizing training data without duplicated cues is
segment 3 because this type of deception (high risk) is                              slightly better than training with all cues. Simplifying
more interesting.                                                                    cues so that they represent sufficient information in
                                                                                     messages can help to improve performance, and the
     Figure 3 shows the results of five experiments. In                              performance is even better when using only significant
“Original” we used all 19 cues and both Txt and                                      cues.
Audio as training set; In “No Duplicate” we used 16
cues without the 3 that are duplicates (number of -
simple sentences, short sentence, and syllables). In
“significant” we use only the 6 statistically significant                            78.58%
cues (number of words, long sentences, AWS, Flesch-                                  75%
Kincaid grade level, sentence complexity, total
flagged errors); the next two tests contain only Txt
data, with no duplicated and significant cues,
respectively. Each test was repeated 20 times and we                                 62.5%
recorded the highest prediction rate (the rate at which                              60.42%
tree successfully classified deception and truth). For
Audio-only data is not displayed because previous                                    58.3%
analysis and profile plots revealed inconsistency in the
Audio data. However, ways to auto-distinguish
deception in Audio context may exist and the Audio
context is a highly sensitive scenario that needs further
semantic analysis and deeper refinement of the data.                                         Original                   Significant          Only txt; Significant
     As shown in the figure, the prediction                                                             No Duplicate            Only txt; No duplicate
performance rate is increased among all the
experiments. Training with Txt was significantly                                           Figure 3. Performance (prediction rate) of C4.5
better than that with combined data from Txt and
Audio. This means that deceptive behavior in Txt and
Audio are so different that combining them is likely

                                                  0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                                                              7
                          Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

                                                                              Average sentence length (ASL) is the most important
5.    A close look at decision trees                                          cue. In figure 5, a truth-teller uses longer sentences
                                                                              than deceivers. This implies that more information is
     In this section, the two best decision trees are                         available in each sentence. On the other hand,
displayed and explained in detail: 1.) a tree that was                        deceivers using shorter sentences, implies that they
built with Txt-only data, plus 16 no duplicated cues;                         pause more often make up fake stories, possibly under
and 2.) a tree that was built with Txt-only data, 6                           a heavy cognitive load [1]. Sentence complexity
significant cues. The two trees displayed little noise in                     (more compound sentences, and longer words) also
training set and therefore, exhibit more robust                               plays a role in deception detection. A truth-teller, who
structures than other trees. All 5 trees are shown in                         feels at ease and undergoes less cognitive load, uses
the appendix.                                                                 simpler sentences while recalling a previous
                                                                              experience. Compared with truth-tellers, deceivers
                                                                              strive to make a credible impression [9]. As a result,
                                                                              they use more formal, and more complex sentences,
                                                                              hoping that formal writings appear more credible. In
                         <=           15.75     >
                                                                              short, the structures of decision trees show reasonable
               Sen&Rm                                ASW
                                                                              patterns that coincide with previous research.

     <=          1            >                <=     1.4   >

 Deceptive                    ASL                           ASL                                           <=          15.75     >

                  <=          12.52     >              <=   18      >                             Sent-
             Deceptive                 True          True         Deceptive             <=         22          >

                                                                                     True                 Deceptive
          Figure 4. Txt only, with 16 no duplicated cues

     As shown in figure 4, the decision tree picked up
average_words_sentence,           Sens&RM,           and
average_syllables_per_word (ASW), and organized                                              Figure 5. Txt, with 6 significant cues
them in a hierarchical way. Except for
average_syllables_per_word (ASW), most of them                                     The examples of trees are for Txt-only data in a
were significant cues.                                                        high-risk context. Further analysis needs to be
     For simplicity, we explain that, in figure 5, if                         conducted for other contexts, since deception detection
average sentence length was greater than 15.75, the                           is sensitive to contexts
message will be considered true; also, the message is
deceptive if sentence complexity is greater than 22;                          6.   Conclusion
and true if sentence complexity is less than 22.
                                                                                   This paper reports on a preliminary study of
     In figure 5, where only 6 significant cues are used
                                                                              selection of cues to generate more reliable training
in training, the decision tree is much simpler, yet no
                                                                              data. We describe a method of purifying experimental
less accurate than that in figure 4. This supports the
                                                                              data by eliminating unpromising cues. We
expectation that semantic analysis and pre-selection of
                                                                              demonstrate that the purifying method actually
the cues in a training set can reduce the complexity,
                                                                              enhanced performances of decision trees, with the best
since decision trees cannot automatically filter out
                                                                              decision tree resulting from using only Txt data.
noisy data. For instance, the non-significant cue,
ASW, did not increase the accuracy but introduce                                   Given such a small data set, the current
more complexity. Only through semantic analysis, can                          experiment showed big variances in tree structure and
a noisy cue be detected and purged.                                           prediction performance. However, light is shed on the
                                                                              potential power of selecting the training data
     These two trees demonstrate the importance of
                                                                              semantically and statistically.
sentence-level complexity in deception detection.

                                                       0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                                 8
                  Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

7.    Reference
[1] Buller, D. B. & Burgoon, J. K. (1996). “Interpersonal        deception detection." Journal of Social and Personal
Deception Theory.” Communication Theory, 6, 203-242.             Relationships, 9, 143-154.
[2] Burgoon, J., Blair, J., & Moyer, E. (2003, November).        [7] Quinlan, J. R. (1993). C4.5. San Mateo, CA: Morgan
“Effects of Communication Modality on Arousal,                   Kaufmann Publishers.
Cognitive Complexity, Behavioral Control and Deception           [8] Spangler, W., May, J., & Vargas, L. (1999). “Choosing
Detection during Deceptive Episodes.” Paper submitted to         Data-Mining Methods for Multiple Classification:
the annual meeting of the National Communication                 Representational     and     Performance     Measurement
Association, Miami.                                              Implications for Decision Support.” Journal of
[3] Burgoon, J. K., Buller, D. B., Ebesu, A., & Rockwell, P.     Management Information Systems, 16, 37-62.
(1994). "Interpersonal deception: V. Accuracy in deception       [9] Vrij, A., Edward, et al. (2000). "Detecting deceit via
detection." Communication Monographs, 61, 303-325.               analysis of verbal and nonverbal behavior." Journal of
[4] Burgoon, J.K., Buller, D.B., Guerrero, L.K., Afifi,          Nonverbal Behavior, 24, 239-263.
W.A., & Feldman, C.M. (1996). Interpersonal Deception:           [10] Zhou, L. Twitchell, D., Qin, T., Burgoon, J. K. &
XII. Information management dimensions underlying                Nunamaker, J. F., Jr. (2003). “An Exploratory Study into
deceptive and truthful messages. Communication                   Deception Detection in Text-based Computer-Mediated
Monographs, 63, 52-69.                                           Communication.” Proceedings of the 36th Annual Hawaii
[5] Burgoon, J., Marett, K., Blair, J. (in press). “Detecting    International Conference of System Sciences, Big Island.
Deception in Computer-Mediated Communication.” In J.             Los Alamitos, CA: IEEE.
George (Ed.), Social Issues of Computing.                         [11] Zuckerman, M., DePaulo, B., & Rosenthal, R. (1981).
[6] Levine, T., & McCornack, S. (1992). "Linking love and        “Verbal and nonverbal communication of deception.” In L.
lies: A formal test of the McCornack and Parks model of          Berkowitz (Ed.), Advances in experimental social
                                                                 psychology (Vol 14, pp.1-59). NY: Academic Press.

Appendix Decision tree output
Original data of Txt and Audio, 19 cues
Long_sentences <= 1
|    FK_grade <= 5.408709: 2 (9.0)
|    FK_grade > 5.408709
|    |     Affect <= 0
|    |     |      Average_syllables_per_word <= 1.44
|    |     |      |    LexComp <= 3.679012: 2 (6.0/1.0)
|    |     |      |    LexComp > 3.679012
|    |     |      |    |    total_flagged_errors <= 7: 1 (10.0/1.0)
|    |     |      |    |    total_flagged_errors > 7: 2 (2.0)
|    |     |      Average_syllables_per_word > 1.44: 2 (9.0/1.0)
|    |     Affect > 0: 1 (6.0/1.0)
Long_sentences > 1: 1 (6.0)

Original, no duplicated, 16 cues
Long_sentences <= 1
|    FK_grade <= 5.408709: 2 (9.0)
|    FK_grade > 5.408709
|    |     Affect <= 0
|    |     |     Average_syllables_per_word <= 1.44
|    |     |     |     LexComp <= 3.679012: 2 (6.0/1.0)
|    |     |     |     LexComp > 3.679012
|    |     |     |     |     Flagged <= 7: 1 (10.0/1.0)
|    |     |     |     |     Flagged > 7: 2 (2.0)
|    |     |     Average_syllables_per_word > 1.44: 2 (9.0/1.0)
|    |     Affect > 0: 1 (6.0/1.0)
Long_sentences > 1: 1 (6.0)

                                           0-7695-2056-1/04 $17.00 (C) 2004 IEEE                                              9
               Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

Original, significant, 6 statistically significant cues
FK_grade <= 5.408709: 2 (9.0)
FK_grade > 5.408709
|    Long_sentences <= 0
|    |      Sent_comp <= 34
|    |      |     Flagged_error <= 2
|    |      |     |      Average_words_per_sentence <= 13.7: 2 (3.0)
|    |      |     |      Average_words_per_sentence > 13.7: 1 (2.0)
|    |      |     Flagged_error > 2: 1 (6.0)
|    |      Sent_comp > 34: 2 (10.0/2.0)
|    Long_sentences > 0
|    |      Average_words_per_sentence <= 18: 2 (3.0)
|    |      Average_words_per_sentence > 18
|    |      |     Average_words_per_sentence <= 31: 1 (12.0/1.0)
|    |      |     Average_words_per_sentence > 31: 2 (3.0/1.0)

Txt only, with 16 no duplicated cues
Average_words_per_sentence <= 15.75
|    Sens&RM <= 1: 2 (10.0)
|    Sens&RM > 1
|    |      Average_words_per_sentence <= 12.52: 2 (2.0)
|    |      Average_words_per_sentence > 12.52: 1 (2.0)
Average_words_per_sentence > 15.75
|    Average_syllables_per_word <= 1.4: 1 (10.0)
|    Average_syllables_per_word > 1.4
|    |      Average_words_per_sentence <= 18: 1 (2.0)
|    |      Average_words_per_sentence > 18: 2 (2.0)

Txt, with 6 significant cues
Average_words_per_sentence <= 15.75
|     Sent_comp <= 22: 1 (3.0/1.0)
|     Sent_comp > 22: 2 (11.0)
Average_words_per_sentence > 15.75: 1 (14.0/2.0)

                                    0-7695-2056-1/04 $17.00 (C) 2004 IEEE                          10

To top