Email Authorship Identification Using Radial Basis Function
The International Journal of Computer Science and Information Security (IJCSIS) is a reputable venue for publishing novel ideas, state-of-the-art research results and fundamental advances in all aspects of computer science and information & communication security. IJCSIS is a peer reviewed international journal with a key objective to provide the academic and industrial community a medium for presenting original research and applications related to Computer Science and Information Security. . The core vision of IJCSIS is to disseminate new knowledge and technology for the benefit of everyone ranging from the academic and professional research communities to industry practitioners in a range of topics in computer science & engineering in general and information & communication security, mobile & wireless networking, and wireless communication systems. It also provides a venue for high-calibre researchers, PhD students and professionals to submit on-going research and developments in these areas. . IJCSIS invites authors to submit their original and unpublished work that communicates current research on information assurance and security regarding both the theoretical and methodological aspects, as well as various applications in solving real world information security problems. . Frequency of Publication: MONTHLY ISSN: 1947-5500 [Copyright � 2011, IJCSIS, USA]
- views:
- 168
- posted:
- 2/14/2011
- language:
- English
- pages:
- 8

(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
Email Authorship Identification Using Radial
Basis Function
A.Pandian Dr. Md. Abdul Karim Sadiq
Asst.Professor (Senior Grade) Ministry of Higher Education
Department of MCA College of Applied Sciences, Sohar,
SRM University, Chennai, India Sultanate of Oman
pandiana@ktr.srmuniv.ac.in abdulkarim.soh@cas.edu.om
Abstract - Email authorship identification helps by the author. Each author follows style, which is
tracking fraudulent emails. This research proposes called functional words. By using these functional
extraction on unique words from the emails. These words and their frequencies, identification of the
unique words will be used as representative features to author is easy [David 2005].
train Radial Basis function (RBF). Final weights are
obtained and subsequently used for testing. The
Authorship identification is important as the
percentage of identification of email authorship depends
upon number of RBF centers and the type of functional number of documents in internet is increasing. The
words used for training RBF. One hundred fifty researchers are focused on different properties of
authors with one hundred files from the sent folder of texts. There are two different properties of the texts
Enron database are considered. A total of 300 unique that are used in classification: the content of the text
words of number of characters in each word ranging and the style of the author. Stylometry [Goodman
from 3 to 7 are considered. Training and Testing RBF 2007] the statistical analysis of literary style -
are done by taking different length of words. The complements traditional literary scholarship since it
percentage of authorship identification ranges from offers a means of capturing the often elusive
95% to 97%. Simulation shows the effectiveness of the
character of an author’s style [Zheng 2006] by
proposed RBF network for email authorship
identification. quantifying some of its features. Most stylometry
[Pavelec 2007 and Diederich 2008] studies employ
Keywords: email authorship identification; word items of language and most of these are lexically
frequency; radial basis function; based.
I. INTRODUCTION The usefulness of function words in
Authorship attribution [Diederich 2003] is examined.
The principal objectives of author Experiments were conducted with support vector
identification are to classify [Moshe 2002] the emails machine classifiers in twenty novels and-success
belonging to an author. This approach is used in rates above 90% were obtained. The use of functional
forensic for author identification in malicious emails. words is a valid and good approach in Authorship
Some of the commercial softwares like copycatch attribution [Koppel 2006].
gold, jvocalize, signature stylometric system, textaz,
Antconc, yoshikoder, lexico3, T-lab, wordsmithtools Stamatatos 2001 has measured a success rate
etc. use statistical methods to identify an author.. of 65% and 72% in their study for authorship
These softwares uses parameters such as total number recognition, which is an implementation of multiple
of different words, number of content words used in regression and discriminant analysis. Joachim
the list, total number of words in the text / vocabulary Diederich 2003 and his collaborators conducted
items used, vocabulary richness, mean sentence experiments with support vector classifiers and
length, mean paragraph length, mean of 2-3 letter detected author with 60-80% success rates with
words, mean of voxel starting words, cumulative different parameters.
summation method, bigrams and many more. The
users who intend to utilize the software for their The effect of word sequences in authorship
email author identification need to choose the type of [Abbasi 2005] attribution has been studied. The
statistical analysis options that best identify author researchers aimed to consider both stylistic and topic
for an email and obtain the characteristics that features of texts. In this work the documents are
remains constant for large number of emails written identified by the set of word sequences that combine
functional and content words. The experiments are
done on a dataset consisting of poems using naïve
68 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
Bayes classifier [Peng 2004]; the researchers claim and what clarity he has in the mail. The number of
that they achieved good results. work words will indicate performance task
requirements in a neat, unambiguous manner by
II. MATERIALS AND METHODS using the work words that translate exactly what an
2.1 Materials author has in his mind. Action words: It indicates
Words of working type, action oriented, some actions during an expressing in the email.
different categories of prepositions, pronouns, Preposition, adjectives, adverbs, conjunctions and
adjectives, adverbs, conjunctions and interjections are interjections have their standard meanings.
given in Table 1 to Table 3. These words are used as
filtering and as templates. When an email is analyzed The total number of words used as basic dictionary is
for uniqueness, the extracted features are based on 1648 (work + action + prepositions + adjectives +
list of words presented in the tables. Hence, adverbs + conjunctions + Interjections). The numbers
unnecessary words are eliminated and the number of mentioned in the paranthesis are the total in each
unique words that represent an email is minimum. category whereas, only few words are shown in the
tables for understanding.
TABLE 1 SAMPLE WORDS USED FOR FILTERING A schematic diagram for implementation of the
proposed work is presented din Figure 1.
Work Action Preposition Preposition_2
(70) (524) _1 (94) (30)
analyze Accelerate Aboard according to Emails Extract words Filter
annotate Accommodate About ahead of words
ascertain Accomplish Above as of using
attend Accumulate Absent as per template
audit Achieve Across as regards
build Acquire After aside from
calculate Act Against because of
consider Activate Along close to
Train RBF Create Find the
construct Adapt Alongside due to
control Add Amid except for and store author frequency
final matrix and the
TABLE 2 SAMPLE WORDS USED FOR FILTERING weights words for
each
Preposition Preposition Pronoun Adjectives
_3 (16) _4 (9) (77) (395)
Fig.1 (a) Training the system
as far as apart from All early
as well as but Another abundant
by means of except Any adorable Emails Extract words Filter words
in accordance plus anybody adventurous using
with template
in addition to save Anyone aggressive
words given
in case of concerning anything agreeable
in front of considering Both alert
in lieu of regarding Each alive
in place of worth each other amused Identify Process Find the
in point of Either ancient the frequency and
with final
author weights the words for
TABLE 3 SAMPLE WORDS USED FOR FILTERING
each category
Adverbs (331) Conjunctions (25) Interjections (77) Fig.1 (b) Testing the system
Abnormally And Absolutely
absentmindedly But Achoo
Accidentally For Ack Email: The email received in the system
Acidly Nor Agreed Extract words: all the words in the email are
Actually Or Aha arranged.
Adventurously So Ahem Filter words: The words given in Table 1-3 are
Afterwards Yet Ahh
searched in the extracted words. Subsequently, the
Almost after Ahoy
Always although Alack word frequencies are found.
Angrily as Alas Author matrix: A matrix with column as authors and
vertical rows with word frequencies.
Work words: To avoid misinterpretation, work Training patterns: The columns of the matrix are used
words will analyze how an author writes his email as training patterns and labeling are introduced.
69 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
2.2 Methods
Read Create Create
2.2.1 Radial Basis Function G= rbTrb
The concept of distance measure is used to Input centers RBF, rb
associate the input and output pattern values. RBFs patter
are capable of producing approximations to an n
unknown function ‘f’ from a set of input data
abscissa. The approximation is produced by passing Yes
an input point through a set of basis functions, each
Is Find
of which contains one of the RBF centers. D= D==0? SVD(D)
det(G)
An exponential function is used as an
activation function for the input data. Distance
between Input data and set of centers chosen from the No
G=U*W*VT
Input data are found and passed through an
exponential activation function. A bias value of f is B=Inv(G)
used along with the data. These data are further
processed to get a set of final weights between radial
basis function and the target value. E=B*G’
The topology of RBF network is 12 nodes in
Input layer, 10 nodes in hidden layer and 1 node in
the output layer. The difference in input data and a Find weights
center is passed through exp(-x) and is called RBF. A F=E*Target
rectangular matrix is further obtained for which
inverse is found. The resultant value is processed Fig 2 Radial basis function flow chart
with the entire inputs and target values to obtain final
weights. III. . EXPERIMENTAL PROCEDURE
Details of the Figure 2 is given below: Enron email dataset has been used for
evaluating the efficiency of RBF in email authorship
Read input pattern: The columns of the author matrix identification. This email dataset was made public by
are used as training patterns. The number of patterns the Federal Energy Regulatory Commission during
is equal to number of authors. its investigation. It contains all kind of emails,
personal and official. William Cohen from CMU has
Create center: One hundred training patterns are used put up the dataset on the web for researchers. This
as centers. contains around 5,17,431 emails from 151 users.
Each mail in the folders contains the senders and the
Create RBF: Calculate distance between training receiver email addresses, date and time, subject,
patterns and one hundred centers. The resultant body, text and some other email specific technical
values are passed through activation function, exp(-x) details. It is available in the form of MySql database
to produce outputs of RBF nodes in the hidden layer with a size of 400MB. The Enron database contains
of the network. four tables. The first table contains information of
each of the 151 employees. The second table contains
The number of training patterns and the the information of the email message, the sender,
number of centers will produce a rectangular matrix. subject, text and other information. The third table
This is converted into square matrix and inverse of contains the recipient’s information. The fourth table
the same is found and processed with labeling to get contains information about either as a forward or
final weights. reply. Table 4 presents names of few folders under
each author. Only 146 authors have been considered
for study.
70 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
TABLE 4 DETAILS OF ENRON FOLDER Vowels
35
Notes inbox
30
documents
Discussion
_threads
contacts
Deleted
Person
_items
_mail
inbox
items
sent_
25
Sent
sent
All
F r e que nc y
allen-p 602 628 2 361 412 66 48 562 345 20
arnold-j 814 1047 X 723 401 142 84 816 723
arora-h X 65 X 197 57 79 X 9 68 15
badeer- 52 299 2 13 277 3 115 X 7
r 10
bailey-s X 16 X 434 X 4 10 X 14
bass-e 1409 2037 X 415 1386 310 601 1363 258
5
Baughm X 389 4 431 384 383 X 96
an
-d 0
beck-s 1093 3137 7 309 2630 751 190 1099 482 0 50 100 150
Authors
benson- X 84 X 203 77 274 75 7 9 Fig.3 Number of words with vowel in the beginning of words
r
blair-l 39 2 X 662 X 291 X X 929 Work
X represents no information 4
3.5
There are 15 unique words that are identified in all
the emails under consideration by using the filtering 3
words given in Table 1-3.. The list of unique words is
presented in Table 5. 2.5
F r e que ncy
2
TABLE 5 UNIQUE WORDS
1.5
our when
out which
1
plan with
please you
0.5
that your
to yours
0
we zip 0 50 100 150
what Authors
IV. IMPLEMENTATION Fig4. Work words for each author
Action
Characterization and feature extraction for 30
training radial basis function (RBF) are based on
vowels at the beginning of words and some of the
25
grammatical rules present in the emails of an author.
Figure 3 presents authors in x-axis and number of
20
words with vowels at the beginning of words in the
F r equency
y-axis. Each stem is the average number of words
with vowels at the beginning considering all the 15
emails of an author. Figure 4 presents the number of
work words. Figure 5 presents the number of action 10
words. Figure 6 to figure 9 presents number of
prepositions. 5
0
0 50 100 150
Authors
Fig.5 Action words for each author
71 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
Preposition-1 Preposition-4
30 2
1.8
25 1.6
1.4
20
1.2
F requency
Frequency
15 1
0.8
10 0.6
0.4
5
0.2
0 0
0 50 100 150 0 50 100 150
Authors Authors
Fig.6 Preposition 1 for each author Fig.9 Preposition 4 for each author
Pronoun
Preposition-2 35
0.9
0.8 30
0.7
25
0.6
Frequency
20
Frequency
0.5
0.4 15
0.3 10
0.2
5
0.1
0
0 0 50 100 150
0 50 100 150 Authors
Authors
Fig.10 Pronoun for each author
Fig.7 Preposition 2 for each author Adjectives
12
Preposition-3
0.9
10
0.8
0.7 8
Frequency
0.6
6
F requency
0.5
0.4 4
0.3
2
0.2
0.1 0
0 50 100 150
0 Authors
0 50 100 150
Authors
Fig.11 Adjectives for each author
Fig.8 Preposition 3 for each author
72 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
Adverbs
10 their frequencies of all emails of all authors by using
9
the filtering words available in Table 1 -3.
8
7
Create a matrix with rows equivalent to total
number of unique words considering all emails of all
6
Frequency
authors. The number of columns is equivalent to
5
number authors.
4
3
Based on the dictionary of words obtained
2
from all the emails, fill up a column of the zero
1 matrix based on the availability of the words in a
0
0 50 100 150
document with their frequencies. Each column will
Authors be treated as a pattern for training. A labeling is done
for each pattern.
Fig.12 Adverbs for each author
Conjunctions Train the RBF network with patterns
10
considered for training. A final weight matrix is
9
obtained which is further used to test the incoming
8
mails that belongs to existing authors else, the mail
7 can belong to some other person other than these
6 existing authors considered in this experiment.
Frequency
5
4
3 V. RESULTS AND DISCUSSIONS
2
150
1
0
0 50 100 150
RBF output and author identification
Authors 100
Fig.13 Conjunctions for each author
50
Center= 2
Interjections Center= 25
3 Center= 50
0 Center= 75
Center=100
2.5 Center=146
-50
2
Frequency
-100
0 50 100 150
1.5 email
Fig.15 Performance of RBF center selection
1
The figure 15 presents the performance of
0.5 RBF in training the patterns. When the number of
centers used is less than 50% of the total number of
0 input patterns, the performance of author
0 50 100 150
Authors identification is minimum. As the number of centers
increase, the author identification increases. The
Fig.14 Interjections for each author legend shows the number of centers. Figure 16
presents the performance of the RBF. In this plot,
output obtained from RBF overlaps target outputs.
We use the following algorithm for email The plot emails versus author identification. With
identification by Neural Network training and testing: 146 centers, the RBF identifies maximum number of
Find the number of words and their number of authors.
occurrences (frequencies) an email and all the emails
of an author. Similarly, find the number of words and
73 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
considered that belong to the training group and that
150
do not belong to training group. All the emails that
belong to (sent / sent_mail) folders are used for
training. The emails of the remaining folders of all
R B F o u tp u t a n d A u th o r id en tifica tio n
authors have been considered for testing. The
100 performance of RBF has been calculated using
confusion matrix. The plot (Figure 17) indicates that
the proposed RBF system suits the author
Target
RBF output
identification from given emails. This is inferred
from the points obtained above the diagonal of the
50
ROC curve.
TABLE 6 CONFUSION MATRIX FOR RECEIVER
OPERATING CHARACTERISTICS
Positive
False Negative
0
True Negative
False Positive
True Positive
0 50 100 150
Email
Sensitivity
Specificity
Instances
True
Rate
Fig.16 Performance of RBF
This work has presented a novel method of 1 80 20 0.80 10 40 0.80 0.20
2 82 18 0.82 8 42 0.84 0.16
identifying email authorship using RBF patterns of
3 90 10 0.90 5 45 0.90 0.10
training data have been collected by averaging the 4 85 15 0.85 7 43 0.86 0.14
frequencies of words of each person and fixing a 5 92 8 0.92 8 42 0.84 0.16
target value for the person. Testing pattern has been Sensitivity=True Positive Rate=True Positive/Total
created by modifying the existing contents of an words True Positive Rate=1-Specificity
email. A new word has been considered during
testing. If the new word does not fit into the patterns
used for training, then the word is excluded during
testing. As we are unaware to which author the email
belongs, now all the training patterns are treated as
test patterns after adding the frequencies of the new
mail. As only 146 authors are considered, 146
outputs are obtained after testing.
Receiver operating characteristics (ROC) of
the authorship identification reveals the following
analysis.
Is the author correctly identified of a
different document that belongs to this author which
is True positive?
Is the author wrongly classified that the
document does not belong to him and belongs to
some other person or the document does not belong Fig.17 Receiver Operating Characteristics
to any one of the ten authors under experiment (False
positive) VI. CONCLUSION
Is the document that belongs to some other
author not considered in this experiment is treated as The proposed RBF has been used for author
document of one of the ten authors (False negative). identification of emails. Different RBF centers and
Is the document considered from outside the their effectiveness in author identification are
training group belongs to same group and not the presented. The receiver operating characteristics
authors considered din this experiment. (True curve has been presented and it shows the proposed
negative). RBF network performance is acceptable. As a further
work, the huge amount of words can be meaningfully
Table 6 presents the confusion matrix values filtered that are more specific to an author and that
and the ROC values. The author emails have been can be further used for author identification.
74 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, January 2011
REFERENCES AUTHORS PROFILE
1. Abbasi A. And Chen H, “Applying Authorship
Analysis to Extremist-Group Web Forum Messages”
IEEE INTELLIGENT SYSTEMS, pp. 67–75, 2005.
2. David Madigan, Alexander Genkin, David Lewis,
Shlomo Argamon, Dmitriy Fradkin, and Li Ye,
“Author Identification on the Large Scale”, Proc. of
The Meeting Of The Classification Society of North
America,2005.
3. Diederich, J., and Chen, H. 2008. Writeprints, “A
stylometric approach to identity-level identification
and similarity detection”, ACM Transactions on
Information Systems (26:2),pp. 7.
4. Diederich, J., Kindermann, J., Leopold, E. and Paass,
G. (2003), “Authorship Attribution with Support
Vector Machines”, Applied Intelligence 19(1), pp. 109- A. PANDIAN received his B.Sc.,and MCA degree
123. from Bharathidasan University, Tiruchi. He received
5. Goodman R., Hahn M., Marella M., Ojar C., And his M.Tech degree from Punjabi University, Patiala,
Westcott S., “The Use Of Stylometry For Email Author
Identification: A Feasibility Study”, Proc.
Punjab and M.Phil. degree from Periyar University,
Student/Faculty Research Day, CSIS, Pace University, Salem. He is doing Ph.D.( Computer Science &
White Plains, NY, pp.1-7, May 2007. Engineering) in SRM University, Chennai. He has
6. Koppel, M., Schler, J., Argamon, S. and Messeri, over fourteen years of experience in teaching. He is
7. E., “Authorship Attribution with Thousands of
Candidate Authors”, in Proc. 29th ACM SIGIR
working as Assistant Professor (Sr.G) in the
Conference on Research & Development on Department of MCA,SRM University, Chennai. His
Information Retrieval, 2006. areas of interest are text processing, information
8. Moshe Koppel, Shlomo Argamon, And Anat Rachel retrieval and machine learning. He is a member of
Shimoni, “Automatically Categorizing Written Texts
By Author Gender”, Literary And Linguistic
ISTE and ISC.
Computation. 17(4):pp.401-412, 2002.
Pavelec, D., Justino, E., And Oliveira, L. S.,
9. “Author Identification Using Stylometric Features”,
Inteligencia Artificial (11:36), pp. 59-65, 2007.
10. Peng, F., Schuurmans, D., ,Wang, S., “Augumenting
Naive Bayes Text Classifier With Statistical Language
Models , Information Retrieval”, 7 (3-4), Pp. 317 – 345,
2004.
11. Zheng R., Li J., Chen H., Huang Z., “A Framework For
Authorship Identification Of Online Messages:
Writing-Style Features And Classification Techniques”,
Journal Of The
12. American Society For Information Science And
Technology57(3):378–93, 2006. Dr. M. Abdul Karim Sadiq holds Ph.D. in Computer
Science & Engineering from Indian Institute of
Technology, Madras. He has over fourteen years of
experience in software development, research,
management and teaching. His areas of interest are
text processing, information retrieval and machine
learning. Having published papers in many
international conferences and refereed journals of
repute, he filed a patent in the United States Patent
and Trademark Office. He is an associate editor of
Soft Computing Applications in Business, Springer.
Moreover, he organized certain international
conferences and is on the program committees. He
has been awarded a star performer in the software
industry.
75 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Get documents about "