"Email Authorship Identification Using Radial Basis Function"
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 Email Authorship Identification Using Radial Basis Function A.Pandian Dr. Md. Abdul Karim Sadiq Asst.Professor (Senior Grade) Ministry of Higher Education Department of MCA College of Applied Sciences, Sohar, SRM University, Chennai, India Sultanate of Oman email@example.com firstname.lastname@example.org Abstract - Email authorship identification helps by the author. Each author follows style, which is tracking fraudulent emails. This research proposes called functional words. By using these functional extraction on unique words from the emails. These words and their frequencies, identification of the unique words will be used as representative features to author is easy [David 2005]. train Radial Basis function (RBF). Final weights are obtained and subsequently used for testing. The Authorship identification is important as the percentage of identification of email authorship depends upon number of RBF centers and the type of functional number of documents in internet is increasing. The words used for training RBF. One hundred fifty researchers are focused on different properties of authors with one hundred files from the sent folder of texts. There are two different properties of the texts Enron database are considered. A total of 300 unique that are used in classification: the content of the text words of number of characters in each word ranging and the style of the author. Stylometry [Goodman from 3 to 7 are considered. Training and Testing RBF 2007] the statistical analysis of literary style - are done by taking different length of words. The complements traditional literary scholarship since it percentage of authorship identification ranges from offers a means of capturing the often elusive 95% to 97%. Simulation shows the effectiveness of the character of an author’s style [Zheng 2006] by proposed RBF network for email authorship identification. quantifying some of its features. Most stylometry [Pavelec 2007 and Diederich 2008] studies employ Keywords: email authorship identification; word items of language and most of these are lexically frequency; radial basis function; based. I. INTRODUCTION The usefulness of function words in Authorship attribution [Diederich 2003] is examined. The principal objectives of author Experiments were conducted with support vector identification are to classify [Moshe 2002] the emails machine classifiers in twenty novels and-success belonging to an author. This approach is used in rates above 90% were obtained. The use of functional forensic for author identification in malicious emails. words is a valid and good approach in Authorship Some of the commercial softwares like copycatch attribution [Koppel 2006]. gold, jvocalize, signature stylometric system, textaz, Antconc, yoshikoder, lexico3, T-lab, wordsmithtools Stamatatos 2001 has measured a success rate etc. use statistical methods to identify an author.. of 65% and 72% in their study for authorship These softwares uses parameters such as total number recognition, which is an implementation of multiple of different words, number of content words used in regression and discriminant analysis. Joachim the list, total number of words in the text / vocabulary Diederich 2003 and his collaborators conducted items used, vocabulary richness, mean sentence experiments with support vector classifiers and length, mean paragraph length, mean of 2-3 letter detected author with 60-80% success rates with words, mean of voxel starting words, cumulative different parameters. summation method, bigrams and many more. The users who intend to utilize the software for their The effect of word sequences in authorship email author identification need to choose the type of [Abbasi 2005] attribution has been studied. The statistical analysis options that best identify author researchers aimed to consider both stylistic and topic for an email and obtain the characteristics that features of texts. In this work the documents are remains constant for large number of emails written identified by the set of word sequences that combine functional and content words. The experiments are done on a dataset consisting of poems using naïve 68 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 Bayes classifier [Peng 2004]; the researchers claim and what clarity he has in the mail. The number of that they achieved good results. work words will indicate performance task requirements in a neat, unambiguous manner by II. MATERIALS AND METHODS using the work words that translate exactly what an 2.1 Materials author has in his mind. Action words: It indicates Words of working type, action oriented, some actions during an expressing in the email. different categories of prepositions, pronouns, Preposition, adjectives, adverbs, conjunctions and adjectives, adverbs, conjunctions and interjections are interjections have their standard meanings. given in Table 1 to Table 3. These words are used as filtering and as templates. When an email is analyzed The total number of words used as basic dictionary is for uniqueness, the extracted features are based on 1648 (work + action + prepositions + adjectives + list of words presented in the tables. Hence, adverbs + conjunctions + Interjections). The numbers unnecessary words are eliminated and the number of mentioned in the paranthesis are the total in each unique words that represent an email is minimum. category whereas, only few words are shown in the tables for understanding. TABLE 1 SAMPLE WORDS USED FOR FILTERING A schematic diagram for implementation of the proposed work is presented din Figure 1. Work Action Preposition Preposition_2 (70) (524) _1 (94) (30) analyze Accelerate Aboard according to Emails Extract words Filter annotate Accommodate About ahead of words ascertain Accomplish Above as of using attend Accumulate Absent as per template audit Achieve Across as regards build Acquire After aside from calculate Act Against because of consider Activate Along close to Train RBF Create Find the construct Adapt Alongside due to control Add Amid except for and store author frequency final matrix and the TABLE 2 SAMPLE WORDS USED FOR FILTERING weights words for each Preposition Preposition Pronoun Adjectives _3 (16) _4 (9) (77) (395) Fig.1 (a) Training the system as far as apart from All early as well as but Another abundant by means of except Any adorable Emails Extract words Filter words in accordance plus anybody adventurous using with template in addition to save Anyone aggressive words given in case of concerning anything agreeable in front of considering Both alert in lieu of regarding Each alive in place of worth each other amused Identify Process Find the in point of Either ancient the frequency and with final author weights the words for TABLE 3 SAMPLE WORDS USED FOR FILTERING each category Adverbs (331) Conjunctions (25) Interjections (77) Fig.1 (b) Testing the system Abnormally And Absolutely absentmindedly But Achoo Accidentally For Ack Email: The email received in the system Acidly Nor Agreed Extract words: all the words in the email are Actually Or Aha arranged. Adventurously So Ahem Filter words: The words given in Table 1-3 are Afterwards Yet Ahh searched in the extracted words. Subsequently, the Almost after Ahoy Always although Alack word frequencies are found. Angrily as Alas Author matrix: A matrix with column as authors and vertical rows with word frequencies. Work words: To avoid misinterpretation, work Training patterns: The columns of the matrix are used words will analyze how an author writes his email as training patterns and labeling are introduced. 69 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 2.2 Methods Read Create Create 2.2.1 Radial Basis Function G= rbTrb The concept of distance measure is used to Input centers RBF, rb associate the input and output pattern values. RBFs patter are capable of producing approximations to an n unknown function ‘f’ from a set of input data abscissa. The approximation is produced by passing Yes an input point through a set of basis functions, each Is Find of which contains one of the RBF centers. D= D==0? SVD(D) det(G) An exponential function is used as an activation function for the input data. Distance between Input data and set of centers chosen from the No G=U*W*VT Input data are found and passed through an exponential activation function. A bias value of f is B=Inv(G) used along with the data. These data are further processed to get a set of final weights between radial basis function and the target value. E=B*G’ The topology of RBF network is 12 nodes in Input layer, 10 nodes in hidden layer and 1 node in the output layer. The difference in input data and a Find weights center is passed through exp(-x) and is called RBF. A F=E*Target rectangular matrix is further obtained for which inverse is found. The resultant value is processed Fig 2 Radial basis function flow chart with the entire inputs and target values to obtain final weights. III. . EXPERIMENTAL PROCEDURE Details of the Figure 2 is given below: Enron email dataset has been used for evaluating the efficiency of RBF in email authorship Read input pattern: The columns of the author matrix identification. This email dataset was made public by are used as training patterns. The number of patterns the Federal Energy Regulatory Commission during is equal to number of authors. its investigation. It contains all kind of emails, personal and official. William Cohen from CMU has Create center: One hundred training patterns are used put up the dataset on the web for researchers. This as centers. contains around 5,17,431 emails from 151 users. Each mail in the folders contains the senders and the Create RBF: Calculate distance between training receiver email addresses, date and time, subject, patterns and one hundred centers. The resultant body, text and some other email specific technical values are passed through activation function, exp(-x) details. It is available in the form of MySql database to produce outputs of RBF nodes in the hidden layer with a size of 400MB. The Enron database contains of the network. four tables. The first table contains information of each of the 151 employees. The second table contains The number of training patterns and the the information of the email message, the sender, number of centers will produce a rectangular matrix. subject, text and other information. The third table This is converted into square matrix and inverse of contains the recipient’s information. The fourth table the same is found and processed with labeling to get contains information about either as a forward or final weights. reply. Table 4 presents names of few folders under each author. Only 146 authors have been considered for study. 70 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 TABLE 4 DETAILS OF ENRON FOLDER Vowels 35 Notes inbox 30 documents Discussion _threads contacts Deleted Person _items _mail inbox items sent_ 25 Sent sent All F r e que nc y allen-p 602 628 2 361 412 66 48 562 345 20 arnold-j 814 1047 X 723 401 142 84 816 723 arora-h X 65 X 197 57 79 X 9 68 15 badeer- 52 299 2 13 277 3 115 X 7 r 10 bailey-s X 16 X 434 X 4 10 X 14 bass-e 1409 2037 X 415 1386 310 601 1363 258 5 Baughm X 389 4 431 384 383 X 96 an -d 0 beck-s 1093 3137 7 309 2630 751 190 1099 482 0 50 100 150 Authors benson- X 84 X 203 77 274 75 7 9 Fig.3 Number of words with vowel in the beginning of words r blair-l 39 2 X 662 X 291 X X 929 Work X represents no information 4 3.5 There are 15 unique words that are identified in all the emails under consideration by using the filtering 3 words given in Table 1-3.. The list of unique words is presented in Table 5. 2.5 F r e que ncy 2 TABLE 5 UNIQUE WORDS 1.5 our when out which 1 plan with please you 0.5 that your to yours 0 we zip 0 50 100 150 what Authors IV. IMPLEMENTATION Fig4. Work words for each author Action Characterization and feature extraction for 30 training radial basis function (RBF) are based on vowels at the beginning of words and some of the 25 grammatical rules present in the emails of an author. Figure 3 presents authors in x-axis and number of 20 words with vowels at the beginning of words in the F r equency y-axis. Each stem is the average number of words with vowels at the beginning considering all the 15 emails of an author. Figure 4 presents the number of work words. Figure 5 presents the number of action 10 words. Figure 6 to figure 9 presents number of prepositions. 5 0 0 50 100 150 Authors Fig.5 Action words for each author 71 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 Preposition-1 Preposition-4 30 2 1.8 25 1.6 1.4 20 1.2 F requency Frequency 15 1 0.8 10 0.6 0.4 5 0.2 0 0 0 50 100 150 0 50 100 150 Authors Authors Fig.6 Preposition 1 for each author Fig.9 Preposition 4 for each author Pronoun Preposition-2 35 0.9 0.8 30 0.7 25 0.6 Frequency 20 Frequency 0.5 0.4 15 0.3 10 0.2 5 0.1 0 0 0 50 100 150 0 50 100 150 Authors Authors Fig.10 Pronoun for each author Fig.7 Preposition 2 for each author Adjectives 12 Preposition-3 0.9 10 0.8 0.7 8 Frequency 0.6 6 F requency 0.5 0.4 4 0.3 2 0.2 0.1 0 0 50 100 150 0 Authors 0 50 100 150 Authors Fig.11 Adjectives for each author Fig.8 Preposition 3 for each author 72 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 Adverbs 10 their frequencies of all emails of all authors by using 9 the filtering words available in Table 1 -3. 8 7 Create a matrix with rows equivalent to total number of unique words considering all emails of all 6 Frequency authors. The number of columns is equivalent to 5 number authors. 4 3 Based on the dictionary of words obtained 2 from all the emails, fill up a column of the zero 1 matrix based on the availability of the words in a 0 0 50 100 150 document with their frequencies. Each column will Authors be treated as a pattern for training. A labeling is done for each pattern. Fig.12 Adverbs for each author Conjunctions Train the RBF network with patterns 10 considered for training. A final weight matrix is 9 obtained which is further used to test the incoming 8 mails that belongs to existing authors else, the mail 7 can belong to some other person other than these 6 existing authors considered in this experiment. Frequency 5 4 3 V. RESULTS AND DISCUSSIONS 2 150 1 0 0 50 100 150 RBF output and author identification Authors 100 Fig.13 Conjunctions for each author 50 Center= 2 Interjections Center= 25 3 Center= 50 0 Center= 75 Center=100 2.5 Center=146 -50 2 Frequency -100 0 50 100 150 1.5 email Fig.15 Performance of RBF center selection 1 The figure 15 presents the performance of 0.5 RBF in training the patterns. When the number of centers used is less than 50% of the total number of 0 input patterns, the performance of author 0 50 100 150 Authors identification is minimum. As the number of centers increase, the author identification increases. The Fig.14 Interjections for each author legend shows the number of centers. Figure 16 presents the performance of the RBF. In this plot, output obtained from RBF overlaps target outputs. We use the following algorithm for email The plot emails versus author identification. With identification by Neural Network training and testing: 146 centers, the RBF identifies maximum number of Find the number of words and their number of authors. occurrences (frequencies) an email and all the emails of an author. Similarly, find the number of words and 73 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 considered that belong to the training group and that 150 do not belong to training group. All the emails that belong to (sent / sent_mail) folders are used for training. The emails of the remaining folders of all R B F o u tp u t a n d A u th o r id en tifica tio n authors have been considered for testing. The 100 performance of RBF has been calculated using confusion matrix. The plot (Figure 17) indicates that the proposed RBF system suits the author Target RBF output identification from given emails. This is inferred from the points obtained above the diagonal of the 50 ROC curve. TABLE 6 CONFUSION MATRIX FOR RECEIVER OPERATING CHARACTERISTICS Positive False Negative 0 True Negative False Positive True Positive 0 50 100 150 Email Sensitivity Specificity Instances True Rate Fig.16 Performance of RBF This work has presented a novel method of 1 80 20 0.80 10 40 0.80 0.20 2 82 18 0.82 8 42 0.84 0.16 identifying email authorship using RBF patterns of 3 90 10 0.90 5 45 0.90 0.10 training data have been collected by averaging the 4 85 15 0.85 7 43 0.86 0.14 frequencies of words of each person and fixing a 5 92 8 0.92 8 42 0.84 0.16 target value for the person. Testing pattern has been Sensitivity=True Positive Rate=True Positive/Total created by modifying the existing contents of an words True Positive Rate=1-Specificity email. A new word has been considered during testing. If the new word does not fit into the patterns used for training, then the word is excluded during testing. As we are unaware to which author the email belongs, now all the training patterns are treated as test patterns after adding the frequencies of the new mail. As only 146 authors are considered, 146 outputs are obtained after testing. Receiver operating characteristics (ROC) of the authorship identification reveals the following analysis. Is the author correctly identified of a different document that belongs to this author which is True positive? Is the author wrongly classified that the document does not belong to him and belongs to some other person or the document does not belong Fig.17 Receiver Operating Characteristics to any one of the ten authors under experiment (False positive) VI. CONCLUSION Is the document that belongs to some other author not considered in this experiment is treated as The proposed RBF has been used for author document of one of the ten authors (False negative). identification of emails. Different RBF centers and Is the document considered from outside the their effectiveness in author identification are training group belongs to same group and not the presented. The receiver operating characteristics authors considered din this experiment. (True curve has been presented and it shows the proposed negative). RBF network performance is acceptable. As a further work, the huge amount of words can be meaningfully Table 6 presents the confusion matrix values filtered that are more specific to an author and that and the ROC values. The author emails have been can be further used for author identification. 74 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 REFERENCES AUTHORS PROFILE 1. Abbasi A. And Chen H, “Applying Authorship Analysis to Extremist-Group Web Forum Messages” IEEE INTELLIGENT SYSTEMS, pp. 67–75, 2005. 2. David Madigan, Alexander Genkin, David Lewis, Shlomo Argamon, Dmitriy Fradkin, and Li Ye, “Author Identification on the Large Scale”, Proc. of The Meeting Of The Classification Society of North America,2005. 3. Diederich, J., and Chen, H. 2008. Writeprints, “A stylometric approach to identity-level identification and similarity detection”, ACM Transactions on Information Systems (26:2),pp. 7. 4. Diederich, J., Kindermann, J., Leopold, E. and Paass, G. (2003), “Authorship Attribution with Support Vector Machines”, Applied Intelligence 19(1), pp. 109- A. PANDIAN received his B.Sc.,and MCA degree 123. from Bharathidasan University, Tiruchi. He received 5. Goodman R., Hahn M., Marella M., Ojar C., And his M.Tech degree from Punjabi University, Patiala, Westcott S., “The Use Of Stylometry For Email Author Identification: A Feasibility Study”, Proc. Punjab and M.Phil. degree from Periyar University, Student/Faculty Research Day, CSIS, Pace University, Salem. He is doing Ph.D.( Computer Science & White Plains, NY, pp.1-7, May 2007. Engineering) in SRM University, Chennai. He has 6. Koppel, M., Schler, J., Argamon, S. and Messeri, over fourteen years of experience in teaching. He is 7. E., “Authorship Attribution with Thousands of Candidate Authors”, in Proc. 29th ACM SIGIR working as Assistant Professor (Sr.G) in the Conference on Research & Development on Department of MCA,SRM University, Chennai. His Information Retrieval, 2006. areas of interest are text processing, information 8. Moshe Koppel, Shlomo Argamon, And Anat Rachel retrieval and machine learning. He is a member of Shimoni, “Automatically Categorizing Written Texts By Author Gender”, Literary And Linguistic ISTE and ISC. Computation. 17(4):pp.401-412, 2002. Pavelec, D., Justino, E., And Oliveira, L. S., 9. “Author Identification Using Stylometric Features”, Inteligencia Artificial (11:36), pp. 59-65, 2007. 10. Peng, F., Schuurmans, D., ,Wang, S., “Augumenting Naive Bayes Text Classifier With Statistical Language Models , Information Retrieval”, 7 (3-4), Pp. 317 – 345, 2004. 11. Zheng R., Li J., Chen H., Huang Z., “A Framework For Authorship Identification Of Online Messages: Writing-Style Features And Classification Techniques”, Journal Of The 12. American Society For Information Science And Technology57(3):378–93, 2006. Dr. M. Abdul Karim Sadiq holds Ph.D. in Computer Science & Engineering from Indian Institute of Technology, Madras. He has over fourteen years of experience in software development, research, management and teaching. His areas of interest are text processing, information retrieval and machine learning. Having published papers in many international conferences and refereed journals of repute, he filed a patent in the United States Patent and Trademark Office. He is an associate editor of Soft Computing Applications in Business, Springer. Moreover, he organized certain international conferences and is on the program committees. He has been awarded a star performer in the software industry. 75 http://sites.google.com/site/ijcsis/ ISSN 1947-5500