Email Authorship Identification Using Radial Basis Function by ijcsis


More Info
									                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 1, January 2011

   Email Authorship Identification Using Radial
                 Basis Function
                       A.Pandian                                                  Dr. Md. Abdul Karim Sadiq
            Asst.Professor (Senior Grade)                                         Ministry of Higher Education
                Department of MCA                                              College of Applied Sciences, Sohar,
           SRM University, Chennai, India                                              Sultanate of Oman

Abstract - Email authorship identification helps                     by the author. Each author follows style, which is
tracking fraudulent emails. This research proposes                   called functional words. By using these functional
extraction on unique words from the emails. These                    words and their frequencies, identification of the
unique words will be used as representative features to              author is easy [David 2005].
train Radial Basis function (RBF). Final weights are
obtained and subsequently used for testing. The
                                                                              Authorship identification is important as the
percentage of identification of email authorship depends
upon number of RBF centers and the type of functional                number of documents in internet is increasing. The
words used for training RBF. One hundred fifty                       researchers are focused on different properties of
authors with one hundred files from the sent folder of               texts. There are two different properties of the texts
Enron database are considered. A total of 300 unique                 that are used in classification: the content of the text
words of number of characters in each word ranging                   and the style of the author. Stylometry [Goodman
from 3 to 7 are considered. Training and Testing RBF                 2007] the statistical analysis of literary style -
are done by taking different length of words. The                    complements traditional literary scholarship since it
percentage of authorship identification ranges from                  offers a means of capturing the often elusive
95% to 97%. Simulation shows the effectiveness of the
                                                                     character of an author’s style [Zheng 2006] by
proposed RBF network for email authorship
identification.                                                      quantifying some of its features. Most stylometry
                                                                     [Pavelec 2007 and Diederich 2008] studies employ
Keywords: email authorship          identification;   word           items of language and most of these are lexically
frequency; radial basis function;                                    based.

                  I.   INTRODUCTION                                            The usefulness of function words in
                                                                     Authorship attribution [Diederich 2003] is examined.
           The principal objectives of author                        Experiments were conducted with support vector
identification are to classify [Moshe 2002] the emails               machine classifiers in twenty novels and-success
belonging to an author. This approach is used in                     rates above 90% were obtained. The use of functional
forensic for author identification in malicious emails.              words is a valid and good approach in Authorship
Some of the commercial softwares like copycatch                      attribution [Koppel 2006].
gold, jvocalize, signature stylometric system, textaz,
Antconc, yoshikoder, lexico3, T-lab, wordsmithtools                            Stamatatos 2001 has measured a success rate
etc. use statistical methods to identify an author..                 of 65% and 72% in their study for authorship
These softwares uses parameters such as total number                 recognition, which is an implementation of multiple
of different words, number of content words used in                  regression and discriminant analysis. Joachim
the list, total number of words in the text / vocabulary             Diederich 2003 and his collaborators conducted
items used, vocabulary richness, mean sentence                       experiments with support vector classifiers and
length, mean paragraph length, mean of 2-3 letter                    detected author with 60-80% success rates with
words, mean of voxel starting words, cumulative                      different parameters.
summation method, bigrams and many more. The
users who intend to utilize the software for their                             The effect of word sequences in authorship
email author identification need to choose the type of               [Abbasi 2005] attribution has been studied. The
statistical analysis options that best identify author               researchers aimed to consider both stylistic and topic
for an email and obtain the characteristics that                     features of texts. In this work the documents are
remains constant for large number of emails written                  identified by the set of word sequences that combine
                                                                     functional and content words. The experiments are
                                                                     done on a dataset consisting of poems using naïve

                                                                                                 ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 1, January 2011

Bayes classifier [Peng 2004]; the researchers claim                      and what clarity he has in the mail. The number of
that they achieved good results.                                         work words will indicate performance task
                                                                         requirements in a neat, unambiguous manner by
           II. MATERIALS AND METHODS                                     using the work words that translate exactly what an
  2.1 Materials                                                          author has in his mind. Action words: It indicates
          Words of working type, action oriented,                        some actions during an expressing in the email.
different categories of prepositions, pronouns,                          Preposition, adjectives, adverbs, conjunctions and
adjectives, adverbs, conjunctions and interjections are                  interjections have their standard meanings.
given in Table 1 to Table 3. These words are used as
filtering and as templates. When an email is analyzed                    The total number of words used as basic dictionary is
for uniqueness, the extracted features are based on                      1648 (work + action + prepositions + adjectives +
list of words presented in the tables. Hence,                            adverbs + conjunctions + Interjections). The numbers
unnecessary words are eliminated and the number of                       mentioned in the paranthesis are the total in each
unique words that represent an email is minimum.                         category whereas, only few words are shown in the
                                                                         tables for understanding.
     TABLE 1 SAMPLE WORDS USED FOR FILTERING                             A schematic diagram for implementation of the
                                                                         proposed work is presented din Figure 1.
  Work               Action         Preposition    Preposition_2
   (70)              (524)            _1 (94)           (30)
 analyze           Accelerate         Aboard        according to            Emails            Extract words                    Filter
annotate          Accommodate         About           ahead of                                                                words
ascertain          Accomplish         Above             as of                                                                  using
  attend           Accumulate         Absent           as per                                                                template
  audit             Achieve           Across         as regards
  build             Acquire            After         aside from
calculate             Act             Against        because of
consider            Activate          Along           close to
                                                                            Train RBF                Create                   Find the
construct            Adapt           Alongside         due to
 control              Add              Amid          except for              and store               author                  frequency
                                                                               final                 matrix                    and the
     TABLE 2 SAMPLE WORDS USED FOR FILTERING                                  weights                                        words for
  Preposition         Preposition     Pronoun         Adjectives
     _3 (16)             _4 (9)         (77)             (395)
                                                                                             Fig.1 (a) Training the system
     as far as         apart from        All             early
    as well as            but         Another          abundant
  by means of            except         Any            adorable             Emails            Extract words                  Filter words
 in accordance            plus        anybody        adventurous                                                                 using
       with                                                                                                                    template
 in addition to           save         Anyone         aggressive
                                                                                                                             words given
    in case of        concerning       anything       agreeable
   in front of        considering        Both            alert
    in lieu of         regarding         Each           alive
   in place of           worth        each other       amused                Identify             Process                   Find the
   in point of                          Either         ancient                 the                                      frequency and
                                                                                                 with final
                                                                              author              weights                the words for
                                                                                                                        each category
  Adverbs (331)          Conjunctions (25)     Interjections (77)                                  Fig.1 (b) Testing the system
    Abnormally                  And               Absolutely
  absentmindedly                But                  Achoo
   Accidentally                 For                   Ack                Email: The email received in the system
      Acidly                    Nor                 Agreed               Extract words: all the words in the email are
     Actually                   Or                    Aha                arranged.
  Adventurously                 So                   Ahem                Filter words: The words given in Table 1-3 are
    Afterwards                  Yet                   Ahh
                                                                         searched in the extracted words. Subsequently, the
      Almost                   after                 Ahoy
      Always                 although                Alack               word frequencies are found.
      Angrily                    as                   Alas               Author matrix: A matrix with column as authors and
                                                                         vertical rows with word frequencies.
Work words: To avoid misinterpretation, work                             Training patterns: The columns of the matrix are used
words will analyze how an author writes his email                        as training patterns and labeling are introduced.

                                                                                                      ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 9, No. 1, January 2011

 2.2     Methods
                                                                      Read              Create               Create
 2.2.1   Radial Basis Function                                                                                                        G= rbTrb
         The concept of distance measure is used to                   Input             centers              RBF, rb
associate the input and output pattern values. RBFs                   patter
are capable of producing approximations to an                           n
unknown function ‘f’ from a set of input data
abscissa. The approximation is produced by passing                                                             Yes
an input point through a set of basis functions, each
                                                                                                   Is                        Find
of which contains one of the RBF centers.                                D=                      D==0?                      SVD(D)
         An exponential function is used as an
activation function for the input data. Distance
between Input data and set of centers chosen from the                                 No
Input data are found and passed through an
exponential activation function. A bias value of f is                                          B=Inv(G)
used along with the data. These data are further
processed to get a set of final weights between radial
basis function and the target value.                                                            E=B*G’

          The topology of RBF network is 12 nodes in
Input layer, 10 nodes in hidden layer and 1 node in
the output layer. The difference in input data and a                                         Find weights
center is passed through exp(-x) and is called RBF. A                                        F=E*Target
rectangular matrix is further obtained for which
inverse is found. The resultant value is processed                             Fig 2 Radial basis function flow chart
with the entire inputs and target values to obtain final
weights.                                                                   III. . EXPERIMENTAL PROCEDURE

Details of the Figure 2 is given below:                                    Enron email dataset has been used for
                                                                 evaluating the efficiency of RBF in email authorship
Read input pattern: The columns of the author matrix             identification. This email dataset was made public by
are used as training patterns. The number of patterns            the Federal Energy Regulatory Commission during
is equal to number of authors.                                   its investigation. It contains all kind of emails,
                                                                 personal and official. William Cohen from CMU has
Create center: One hundred training patterns are used            put up the dataset on the web for researchers. This
as centers.                                                      contains around 5,17,431 emails from 151 users.
                                                                 Each mail in the folders contains the senders and the
Create RBF: Calculate distance between training                  receiver email addresses, date and time, subject,
patterns and one hundred centers. The resultant                  body, text and some other email specific technical
values are passed through activation function, exp(-x)           details. It is available in the form of MySql database
to produce outputs of RBF nodes in the hidden layer              with a size of 400MB. The Enron database contains
of the network.                                                  four tables. The first table contains information of
                                                                 each of the 151 employees. The second table contains
         The number of training patterns and the                 the information of the email message, the sender,
number of centers will produce a rectangular matrix.             subject, text and other information. The third table
This is converted into square matrix and inverse of              contains the recipient’s information. The fourth table
the same is found and processed with labeling to get             contains information about either as a forward or
final weights.                                                   reply. Table 4 presents names of few folders under
                                                                 each author. Only 146 authors have been considered
                                                                 for study.

                                                                                               ISSN 1947-5500
                                                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                  Vol. 9, No. 1, January 2011

                     TABLE 4 DETAILS OF ENRON FOLDER                                                                                                          Vowels

                                                                             Notes inbox







                                                                                                                  F r e que nc y
allen-p     602      628           2          361       412          66      48            562         345                     20
arnold-j    814      1047          X          723       401          142     84            816         723
arora-h     X        65            X          197       57           79      X             9           68                      15
badeer-     52       299           2          13        277          3       115           X           7
r                                                                                                                              10
bailey-s    X        16            X          434       X            4       10            X           14
bass-e      1409     2037          X          415       1386         310     601           1363        258
Baughm      X        389           4          431       384          383     X                         96
-d                                                                                                                                 0
beck-s      1093     3137          7          309       2630         751     190           1099        482                          0                 50                  100                   150
benson-     X        84            X          203       77           274     75            7           9                           Fig.3 Number of words with vowel in the beginning of words
blair-l     39       2             X 662        X       291                  X             X           929                                                     Work
                                 X represents no information                                                                        4

           There are 15 unique words that are identified in all
           the emails under consideration by using the filtering                                                                    3
           words given in Table 1-3.. The list of unique words is
           presented in Table 5.                                                                                               2.5
                                                                                                                  F r e que ncy

                                  TABLE 5 UNIQUE WORDS
                         our                            when
                         out                            which
                         plan                           with
                         please                         you
                         that                           your
                         to                             yours
                         we                             zip                                                                          0                 50                 100                   150
                         what                                                                                                                                 Authors

                                   IV. IMPLEMENTATION                                                                                          Fig4. Work words for each author

                    Characterization and feature extraction for                                                                    30
           training radial basis function (RBF) are based on
           vowels at the beginning of words and some of the
           grammatical rules present in the emails of an author.
           Figure 3 presents authors in x-axis and number of
           words with vowels at the beginning of words in the
                                                                                                                  F r equency

           y-axis. Each stem is the average number of words
           with vowels at the beginning considering all the                                                                        15
           emails of an author. Figure 4 presents the number of
           work words. Figure 5 presents the number of action                                                                      10
           words. Figure 6 to figure 9 presents number of
           prepositions.                                                                                                            5

                                                                                                                                     0                50                 100                    150

                                                                                                                                               Fig.5 Action words for each author

                                                                                                                                                             ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 9, No. 1, January 2011

                                 Preposition-1                                                          Preposition-4
             30                                                                       2

             25                                                                      1.6

F requency

             15                                                                       1

             10                                                                      0.6


              0                                                                       0
               0           50                    100         150                       0          50                    100              150
                                   Authors                                                                Authors

                   Fig.6 Preposition 1 for each author                                     Fig.9 Preposition 4 for each author

                                Preposition-2                                        35

             0.8                                                                     30




             0.4                                                                     15

             0.3                                                                     10

              0                                                                        0          50                    100              150
               0           50                    100        150                                           Authors

                                                                                            Fig.10 Pronoun for each author
                   Fig.7 Preposition 2 for each author                                                     Adjectives

             0.7                                                                      8

F requency


             0.4                                                                      4


             0.1                                                                      0
                                                                                       0           50                     100                  150
              0                                                                                             Authors
               0           50                     100         150
                                                                                           Fig.11 Adjectives for each author
                   Fig.8 Preposition 3 for each author

                                                                                                         ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                            Vol. 9, No. 1, January 2011

             10                                                             their frequencies of all emails of all authors by using
                                                                            the filtering words available in Table 1 -3.

                                                                                     Create a matrix with rows equivalent to total
                                                                            number of unique words considering all emails of all

                                                                            authors. The number of columns is equivalent to
                                                                            number authors.

                                                                                     Based on the dictionary of words obtained
                                                                            from all the emails, fill up a column of the zero
             1                                                              matrix based on the availability of the words in a
              0            50                       100        150
                                                                            document with their frequencies. Each column will
                                    Authors                                 be treated as a pattern for training. A labeling is done
                                                                            for each pattern.
                    Fig.12 Adverbs for each author

                                  Conjunctions                                       Train the RBF network with patterns
                                                                            considered for training. A final weight matrix is
                                                                            obtained which is further used to test the incoming
                                                                            mails that belongs to existing authors else, the mail
              7                                                             can belong to some other person other than these
              6                                                             existing authors considered in this experiment.



              3                                                                                                            V.    RESULTS AND DISCUSSIONS

               0           50                       100        150
                                                                            RBF output and author identification

                                    Authors                                                                        100

                   Fig.13 Conjunctions for each author
                                                                                                                                                                     Center= 2
                                    Interjections                                                                                                                    Center= 25
              3                                                                                                                                                      Center= 50
                                                                                                                     0                                               Center= 75
             2.5                                                                                                                                                     Center=146



                                                                                                                       0                50                  100                   150
             1.5                                                                                                                                 email

                                                                                                                           Fig.15 Performance of RBF center selection

                                                                                      The figure 15 presents the performance of
             0.5                                                            RBF in training the patterns. When the number of
                                                                            centers used is less than 50% of the total number of
              0                                                             input patterns, the performance of author
               0             50                       100            150
                                       Authors                              identification is minimum. As the number of centers
                                                                            increase, the author identification increases. The
                   Fig.14 Interjections for each author                     legend shows the number of centers. Figure 16
                                                                            presents the performance of the RBF. In this plot,
                                                                            output obtained from RBF overlaps target outputs.
          We use the following algorithm for email                          The plot emails versus author identification. With
identification by Neural Network training and testing:                      146 centers, the RBF identifies maximum number of
Find the number of words and their number of                                authors.
occurrences (frequencies) an email and all the emails
of an author. Similarly, find the number of words and

                                                                                                                                             ISSN 1947-5500
                                                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                     Vol. 9, No. 1, January 2011

                                                                                                                        considered that belong to the training group and that
                                                                                                                        do not belong to training group. All the emails that
                                                                                                                        belong to (sent / sent_mail) folders are used for
                                                                                                                        training. The emails of the remaining folders of all
    R B F o u tp u t a n d A u th o r id en tifica tio n

                                                                                                                        authors have been considered for testing. The
                                                           100                                                          performance of RBF has been calculated using
                                                                                                                        confusion matrix. The plot (Figure 17) indicates that
                                                                                                                        the proposed RBF system suits the author
                                                                                                RBF output
                                                                                                                        identification from given emails. This is inferred
                                                                                                                        from the points obtained above the diagonal of the
                                                                                                                        ROC curve.

                                                                                                                                     TABLE 6 CONFUSION MATRIX FOR RECEIVER
                                                                                                                                          OPERATING CHARACTERISTICS

                                                                                                                                                       False Negative

                                                                                                                                                                                                          True Negative
                                                                                                                                                                                         False Positive
                                                                                                                                     True Positive
                                                              0          50               100                150



                                                                  Fig.16 Performance of RBF

          This work has presented a novel method of                                                                      1           80               20                0.80            10                40              0.80          0.20
                                                                                                                         2           82               18                0.82            8                 42              0.84          0.16
identifying email authorship using RBF patterns of
                                                                                                                         3           90               10                0.90            5                 45              0.90          0.10
training data have been collected by averaging the                                                                       4           85               15                0.85            7                 43              0.86          0.14
frequencies of words of each person and fixing a                                                                         5           92               8                 0.92            8                 42              0.84          0.16
target value for the person. Testing pattern has been                                                                     Sensitivity=True Positive Rate=True Positive/Total
created by modifying the existing contents of an                                                                                words True Positive Rate=1-Specificity
email. A new word has been considered during
testing. If the new word does not fit into the patterns
used for training, then the word is excluded during
testing. As we are unaware to which author the email
belongs, now all the training patterns are treated as
test patterns after adding the frequencies of the new
mail. As only 146 authors are considered, 146
outputs are obtained after testing.

          Receiver operating characteristics (ROC) of
the authorship identification reveals the following
          Is the author correctly identified of a
different document that belongs to this author which
is True positive?
          Is the author wrongly classified that the
document does not belong to him and belongs to
some other person or the document does not belong                                                                                                    Fig.17 Receiver Operating Characteristics
to any one of the ten authors under experiment (False
positive)                                                                                                                                                               VI. CONCLUSION
          Is the document that belongs to some other
author not considered in this experiment is treated as                                                                            The proposed RBF has been used for author
document of one of the ten authors (False negative).                                                                    identification of emails. Different RBF centers and
          Is the document considered from outside the                                                                   their effectiveness in author identification are
training group belongs to same group and not the                                                                        presented. The receiver operating characteristics
authors considered din this experiment. (True                                                                           curve has been presented and it shows the proposed
negative).                                                                                                              RBF network performance is acceptable. As a further
                                                                                                                        work, the huge amount of words can be meaningfully
        Table 6 presents the confusion matrix values                                                                    filtered that are more specific to an author and that
and the ROC values. The author emails have been                                                                         can be further used for author identification.

                                                                                                                                                                                      ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 1, January 2011

                 REFERENCES                                                             AUTHORS PROFILE

1.   Abbasi A. And Chen H, “Applying Authorship
    Analysis to Extremist-Group Web Forum Messages”
    IEEE INTELLIGENT SYSTEMS, pp. 67–75, 2005.
2.   David Madigan, Alexander Genkin, David Lewis,
    Shlomo Argamon, Dmitriy Fradkin, and Li Ye,
    “Author Identification on the Large Scale”, Proc. of
    The Meeting Of The Classification Society of North
3.   Diederich, J., and Chen, H. 2008. Writeprints, “A
    stylometric approach to identity-level identification
    and similarity detection”, ACM Transactions on
    Information Systems (26:2),pp. 7.
4.   Diederich, J., Kindermann, J., Leopold, E. and Paass,
    G. (2003), “Authorship Attribution with Support
    Vector Machines”, Applied Intelligence 19(1), pp. 109-          A. PANDIAN received his B.Sc.,and MCA degree
    123.                                                            from Bharathidasan University, Tiruchi. He received
5.   Goodman R., Hahn M., Marella M., Ojar C., And                  his M.Tech degree from Punjabi University, Patiala,
    Westcott S., “The Use Of Stylometry For Email Author
    Identification:    A      Feasibility    Study”,   Proc.
                                                                    Punjab and M.Phil. degree from Periyar University,
    Student/Faculty Research Day, CSIS, Pace University,            Salem. He is doing Ph.D.( Computer Science &
    White Plains, NY, pp.1-7, May 2007.                             Engineering) in SRM University, Chennai. He has
6. Koppel, M., Schler, J., Argamon, S. and Messeri,                 over fourteen years of experience in teaching. He is
7. E., “Authorship Attribution with Thousands of
    Candidate Authors”, in Proc. 29th ACM SIGIR
                                                                    working as Assistant Professor (Sr.G) in the
    Conference on Research & Development on                         Department of MCA,SRM University, Chennai. His
    Information Retrieval, 2006.                                    areas of interest are text processing, information
8.   Moshe Koppel, Shlomo Argamon, And Anat Rachel                  retrieval and machine learning. He is a member of
    Shimoni, “Automatically Categorizing Written Texts
    By Author Gender”, Literary And Linguistic
                                                                    ISTE and ISC.
    Computation. 17(4):pp.401-412, 2002.
     Pavelec, D., Justino, E., And Oliveira, L. S.,
9. “Author Identification Using Stylometric Features”,
    Inteligencia Artificial (11:36), pp. 59-65, 2007.
10. Peng, F., Schuurmans, D., ,Wang, S., “Augumenting
    Naive Bayes Text Classifier With Statistical Language
    Models , Information Retrieval”, 7 (3-4), Pp. 317 – 345,
11. Zheng R., Li J., Chen H., Huang Z., “A Framework For
    Authorship Identification Of Online Messages:
    Writing-Style Features And Classification Techniques”,
    Journal Of The
12. American Society For Information Science And
    Technology57(3):378–93, 2006.                                   Dr. M. Abdul Karim Sadiq holds Ph.D. in Computer
                                                                    Science & Engineering from Indian Institute of
                                                                    Technology, Madras. He has over fourteen years of
                                                                    experience in software development, research,
                                                                    management and teaching. His areas of interest are
                                                                    text processing, information retrieval and machine
                                                                    learning. Having published papers in many
                                                                    international conferences and refereed journals of
                                                                    repute, he filed a patent in the United States Patent
                                                                    and Trademark Office. He is an associate editor of
                                                                    Soft Computing Applications in Business, Springer.
                                                                    Moreover, he organized certain international
                                                                    conferences and is on the program committees. He
                                                                    has been awarded a star performer in the software

                                                                                                ISSN 1947-5500

To top