Data Mining Techniques: A Key for detection of Financial Statement Fraud by ijcsiseditor


International Journal of Computer Science and Information Security (IJCSIS) provide a forum for publishing empirical results relevant to both researchers and practitioners, and also promotes the publication of industry-relevant research, to address the significant gap between research and practice.

Being a fully open access scholarly journal, original research works and review articles are published in all areas of the computer science including emerging topics like cloud computing, software development etc. It continues promote insight and understanding of the state of the art and trends in technology. To a large extent, the credit for high quality, visibility and recognition of the journal goes to the editorial board and the technical review committee.

Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences. The topics covered by this journal are diversed. (See monthly Call for Papers)

For complete details about IJCSIS archives publications, abstracting/indexing, editorial board and other important information, please refer to IJCSIS homepage. IJCSIS appreciates all the insights and advice from authors/readers and reviewers. Indexed by the following International Agencies and institutions: EI, Scopus, DBLP, DOI, ProQuest, ISI Thomson Reuters. Average acceptance for the period January-March 2012 is 31%.

We look forward to receive your valuable papers. If you have further questions please do not hesitate to contact us at Our team is committed to provide a quick and supportive service throughout the publication process.

A complete list of journals can be found at:
IJCSIS Vol. 10, No. 3, March 2012 Edition
ISSN 1947-5500 � IJCSIS, USA & UK.

More Info
									                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 10, No. 3, March 2012

          Data Mining Techniques: A Key for
     Detection of Financial Statement Fraud
                   Rajan Gupta                                                 Nasib Singh Gill

    Research Scholar, Dept. of Computer Sc. &                   Head, Dept. of Computer Sc. & Applications,
   Applications, Maharshi Dayanand University,                Maharshi Dayanand University, Rohtak (Haryana),
        Rohtak (Haryana) – India. Email:                           India. Email:
                     Abstract                                 of their auditors. Warnings of fraud in US listed
                                                              Chinese companies have grown in recent months.
In recent times, most of the news from business world         In January 2011, the shares of China Forestry
is dominated by financial statement fraud. A financial        Holdings were suspended after the auditor KPMG
statement becomes fraudulent if it has some false
information incorporated by the management
                                                              informed the board of directors of possible
intentionally. This paper implements data mining              irregularities in its accounting books. On 11 April,
techniques such as CART, Naïve Bayesian classifier,           2011 the SEC suspended trading in RINO
Genetic Programming to identify companies those               International due to questions surrounding the
issue fraudulent financial statements. Each of these          accuracy and completeness of information
techniques is applied on a dataset from 114                   contained in RINO’s public filings, and the
companies. CART outperforms all other techniques              company’s failure to report the resignation of its
in detection of fraud.                                        chairman, directors of the board and an outside
                                                              lawyer and forensic accountants brought in to
     1. Introduction                                          investigate allegations of fraud. The finger was
Financial statement fraud is a serious social and             pointed at Sino-Forest Corporation, a Toronto-
economic problem worldwide and more severe in                 listed forestry firm, on 2 June, 2011, after a short-
growing countries. A company listed with any                  seller accused the firm of inflating its assets. More
stock exchange is required to publish its financial           recently, the unravelling of Longtop Financial
statements such as balance sheet, income statement,           Technologies Ltd highlighted the scale of the
statements of retained earnings and cash flow                 problem. The company regularly reported income
statements yearly and quarterly. Financial                    that was slightly higher than executives’
statements of a company reflects its actual financial         predictions [1].1 The "cash balance" on Longtop's
health by analysing which, stockholders can form a            balance sheet was fake--a fiction created by the
wise decision about investing in the company. An              company's managers with bank complicity [2].2
intentional distortion of information in the financial        Data Mining is an iterative process within which
statement is termed as financial statement fraud.             progress is defined by discovery of knowledge.
Conventionally, auditors are responsible for                  Data Mining is most useful in an exploratory
identification and detection of fraudulent financial          analysis scenario in which there are no
statement. Although, auditors are supposed to                 predetermined notions about what will constitute an
provide information weather the statement is                  “interesting” outcome [3].3 The application of Data
according to GAAP or not. With an increase in                 Mining techniques for detection and identification
number of high profile fraud cases, auditors are              of financial statement fraud is a fertile research
overburdened with an additional duty of detection             area. Several law enforcement agencies and special
of fraud. Hence, various techniques of data mining            investigative units have used data mining
are being used to ease out this extra pressure from           techniques successfully for detection of financial
the mind of the auditors.                                     frauds
Some of the world’s major fraud cases include                 In this study, we analyse the financial statements of
Enron, WorldCom, Satyam and many more. A                      various organisations for detection of financial
number of Chinese companies listed on US stock                statement fraud by using data mining techniques.
exchanges have faced accusations of accounting                This research aims at identifying the financial
fraud, and in June 2011, the U.S. Securities and              ratios / items from financial statements in order to
Exchange Commission warned investors against                  help auditors in assessing the probability of
investing with Chinese firms listing via reverse              financial fraud. In this study, data mining
mergers. While over 20 US listed Chinese                      techniques namely CART, Naive Bayesian
companies have been de-listed or halted in 2011, a            Classifier and Genetic Programming are tested for
number of others have been hit by the resignation             their applicability in detection of fraudulent

                                                                                          ISSN 1947-5500
                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                              Vol. 10, No. 3, March 2012

financial statements and differentiating between             fraud. An innovative fraud detection mechanism is
fraud and non fraud reporting. The dataset consists          developed by Huang et al.[12] 12on the basis of
of financial ratios obtained from publicly available         Zipf’s Law. This technique reduces the burden of
financial statements.                                        auditors in reviewing the overwhelming volumes of
The paper is organised as follows: Section 2                 datasets and assists them in identification of any
discusses the relevant prior research followed by            potential fraud records. Hoogs et al[13] 13 presents
section 3 which describes the various tricks                 a genetic algorithm approach to detecting financial
adopted by management for falsifying financial               statement fraud. Cerullo and Cerullo [14]14
statements. Section 4 reveals the key variables and          explained the nature of fraud and financial
financial ratios related to detection of financial           statement fraud along with the characteristics of
statement fraud. Section 5 provides an insight in to         NN and their applications. They illustrated how NN
the data mining techniques used in this study.               packages could be utilized by various firms to
Section 6 analyses the results followed by                   predict the occurrence of fraud. Koskivaara [15]15
concluding remarks (Section 7).                              proposed NN based support systems as a possible
                                                             tool for use in auditing. He demonstrated that the
    2. Related Work                                          main application areas of NN were detection of
An overview of the academic literature concerning            material errors, and management fraud. Busta and
detection of financial statement fraud is given here.        Weinberg[16]16 used NN to distinguish between
Number of studies such as PwC [4]4, and ACFE                 ‘normal’ and ‘manipulated’ financial data. They
[5]5 tells the story about detection of fraud.               examined the digit distribution of the numbers in
Findings of these studies suggest that many a                the underlying financial information. Koh and
number of times fraud has been detected by chance            Low[17]17 construct a decision tree to predict the
means or accident. For example reports of PwC [4]            hidden problems in financial statements by
revels that 41% of the fraud cases were detected by          examining the following six variables: quick assets
means of tip – offs or by chance.                            to current liabilities, market value of equity to total
                                                             assets, total liabilities to total assets, interest
Several groups of researchers have devoted a                 payments to earnings before interest and tax, net
significant amount of effort in studying the use of          income to total assets, and retained earnings to total
data mining techniques in detection of financial             assets. Belinna et al [18] 18examine the
statements fraud from different perspectives.                effectiveness of CART on identification and
Beasley [6]6 used Logit regression to test the               detection of financial statement fraud. They
prediction that the inclusion of larger proportions          concluded by saying that CART is a very effective
of outside members on the board of directors                 technique in distinguishing fraudulent financial
significantly reduces the likelihood of financial            statement from non fraudulent. Further, Deshmukh
statement fraud with a sample of 150 American                and Talluru [19]19 demonstrated the construction of
firms. They found that non-fraud firms have boards           a rule-based fuzzy reasoning system to assess the
with significantly higher percentages of outside             risk of management fraud and proposed an early
members than fraud firms. Green and Choi [7]7                warning system by finding out 15 rules related to
presented a neural network fraud classification              the probability of management fraud. Zhou &
model employing endogenous financial data. A                 Kapoor [20]20 examine the effectiveness and
classification model created from the learned                limitations of data mining techniques such as
behavior pattern is then applied to a test sample.           regression, decision trees, neural network and
Fanning and Cogger 8[8] also used an artificial              Bayesian networks. They explore a self – adaptive
neural network to predict management fraud. Using            framework based on a response surface model with
publicly available predictors of fraudulent financial        domain knowledge to detect financial statement
statements, they found a model of eight variables            fraud. Recently, Ravisankar et al [20]21 uses data
with a high probability of detection. Kirkos 9[9],           mining techniques such as Multilayer Feed
carry out an in-depth examination of publicly                Forward Neural Network (MLFF), Support Vector
available data from the financial statements of              Machines (SVM), Genetic Programming (GP),
various firms in order to detect FFS by using Data           Group Method of Data Handling (GMDH),
Mining classification methods. In this study, three          Logistic Regression (LR), and Probabilistic Neural
Data Mining techniques namely Decision Trees,                Network (PNN) to identify companies that resort to
Neural Networks and Bayesian Belief Networks are             financial statement fraud. They found that PNN
tested for their applicability in management fraud           outperformed all the techniques without feature
detection. Spathis et al10 [10] compared multi-              selection, and GP and PNN outperformed others
criteria decision aids with statistical techniques           with feature selection and with marginally equal
such as logit and discriminant analysis in detecting         accuracies.
fraudulent financial statements. Cecchini et al [11]         If we summarize the existing academic research,
   developed a novel financial kernel using support          we arrive at a conclusion that detection of financial
vector machines for detection of management                  statement fraud is an instance of classification and

                                                                                         ISSN 1947-5500
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 10, No. 3, March 2012

decision problem. In present research, we apply the           different financial ratios. Financial ratio assists
same idea and implements data mining                          investors / auditors in evaluating the actual position
classification methods for differentiation between            of the company. On the basis of existing academic
fraudulent and non fraudulent observations.                   research and expert’s knowledge, we identify the
      3. Artifice used by top level executives for            following financial variables / ratios (Table 1).
           fraudulent financial reporting:
Financial statements are a company's basic                         (a) Z-score: Financial distress may be a
documents to reflect its financial status [22].22 A                    motivation for management fraud [8]. To
complete and thorough analysis of financial                            measure the financial distress Z-score is
statements could help investors in judging the                         developed by Altman [25]25. It is a
financial status of a company. Any material                            formula for estimating the financial status
misstatement in the financial statement has been a                     of a company and also helpful in
major apprehension to the investors worldwide.                         bankruptcy prediction. The formula for Z-
The techniques associated with the production of                       score for public companies is given by:
fraudulent financial statement have been discussed
in Schilit’s book “Financial Shenanigans” [23].23             Z-score= (Working capital / Total assets* 1.2) +
The book reported seven common tricks:                        (Retained earnings ÷Total assets* 1.4)+ (Earnings
(1) Recording revenue before it is earned;                    before income tax ÷Total assets* 3.3)+(Book value
(2) Creating fictitious revenue;                              of total / Liabilities * 0.6) + (Sales ÷Total assets*
(3)     Boosting      profits    with    non-recurring        0.999)
(4) Shifting current expenses to a later period;                  (b) A high debt structure may be an
(5) Failing to record or disclose liabilities;                         indicator     for     fraudulent    financial
(6) Shifting current income to a later period and                      reporting, because it shifts the risk from
(7) Shifting future expenses to an earlier period.                     mangers to debt owners. Hence we can
The first five tricks aim at boosting current year                     state that higher levels of debt may
earnings, and the last two shift current-year                          increase the likelihood of financial
earnings to the future in order to create an illusion                  statement fraud and one should carefully
of steady income over years.                                           consider the financial ratios related to debt
A number of studies have been conducted for                            structure.
finding indicators for Fraudulent Financial                       (c) Continues growth: The need for
statements. One such study conducted by C.Fei in                       continues growth may be another
china found four types of companies which are                          motivational factor for financial statement
more prone to financial scandals [24].24                               fraud [26].26So, sales to growth ratio
(i) Companies with frequent capital operations and                     should be measured as a fraudulent
related-party transactions.                                            financial statement indicator.
(ii) Companies with high- and volatile-stock prices           Sales to growth = (Current Year's sales - Last
(iii) Initial Public Offering (IPO) companies and             Year's sales) / (Last Year's sales)
(iv) Companies in a declining or over-competitive                 (d) Other items: A company may manipulate
business environment                                                   accounts receivable, inventories and gross
Management of an organisation may falsify the                          margin. Accounts receivable may be
financial statement to achieve the following:                          manipulated by recording sales before
      a. Good amount of loan sanctioned from a                         they are earned. Inventory is also prone to
           bank                                                        manipulation. Mangers may manipulate
      b. Paying less dividends to shareholders                         inventory either by reporting inventory at
      c. Avoid payment of taxes and                                    lower cost or by obsolete inventory or
      d. Inflated stock prices                                         both. A company may use gross margin
In present time there is a steady increase in number                   as a factor for falsifying financial
of companies which are falsifying their financial                      statement. The company may not match its
statements in order to present a rosy picture about                    sales with the corresponding cost of goods
financial status to the stockholder and making                         sold, thus increasing gross margin, net
selfish gain. Hence, detection of such fraud is an                     income and strengthening the balance
additional responsibility of auditors and is the need                  sheet [8].
of the hour.                                                      (e) Other qualitative variables: Qualitative
                                                                       variables such as qualification or the
                                                                       composition of the administrative board of
    4.  Key financial items / ratios relevant to                       a company, previous auditors, high
        the detection of fraud:                                        turnover of CEO and CFO, size and age of
The values and numbers present in financial                            a company could prove helpful in
statements can be easily interpreted with the help of                  searching for indicators of cooked books.

                                                                                          ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                    Vol. 10, No. 3, March 2012

                                                                      12         Current assets/Total assets
     5. Research Methodology                                          13         Net profit/Primary business income
     5.1 Dataset                                                      14         Accounts receivable/Primary business income
The dataset used in this research is obtained from                    15         Primary business income/Total assets
114 companies listed in different stock exchanges.                    16         Current assets/Current liabilities
Out of these 114 firms used in our analysis, 85                       17         Primary business income/Fixed assets
firms have not reported their financial statements                    18         Cash/Total assets
                                                                      19         Inventory/Current liabilities
fraudulently, whereas 29 organisations are having
                                                                      20         Total debt/Total equity
different charges of fraudulent financial reporting.
                                                                      21         Long term debt/Total assets
The      data     has     been     collected     form                 22          Net profit/Gross profit for all the 114 companies. We                       23         Total debt/Total assets
reviewed AAERs (Accounting and auditing                               24         Total assets/Capital and reserves
enforcement releases) published by SEC (U.S.                          25         Long term debt/Total capital and reserves
Securities and Exchange Commission) between                           26         Fixed assets/Total assets
2007 and 2012, to identity companies accused of                       27         Deposits and cash/Current assets
falsifying financial statements. All the firms in the                 28         Capitals and reserves/Total debt
sample have been checked by auditors. There was a                     29         Accounts receivable/Total assets
clear indication of fraudulent financial reporting for                30         Gross profit/Primary business profit
29 fraud firms. Some of the indicators of fraud                       31         Undistributed profit/Net profit
includes: resignation by the auditors, chairman and                   32         Primary business profit/Primary business profit
board of directors, doubts reported by auditors,                                 of last year
observations by the tax authorities.                                  33         Primary business income/Last year's primary
                                                                                 business income
The 29 fraud firms have been matched with 85 non
                                                                      34         Account receivable /Accounts receivable of last
fraud organisations. These firms are classified as                               year
non fraud because no published indication or proof                    35         Total assets/Total assets of last year
is present. However, absence of any proof does not                    36         Debit / Equity
guarantee that these firms have not falsified their                   37         Accounts Receivable / Sales
financial statements or will not do the same in                       38         Inventory / Sales
future. This research only assures that fraudulent                    39         Sales – Gross Margin
reporting has been found for these firms.                             40         Working Capital / Total Assets
     5.2 Variables                                                    41         Net Profit / Sales
                                                                      42         Sales / Total Assets
All the variables to be used as a candidate for                       43         Net income / Fixed Assets
participation in the input vector have been                           44         Quick assets / Current Liabilities
extracted from published financial statements such                    45         Revenue /Total Assets
as income statement and balance sheet. The dataset                    46         Current Liabilities / Revenue
contain 52 financial ratios / items for each of the                   47         Total Liability / Revenue
114 companies. A list of these financial ratios /                     48         Sales Growth Ratio
items is presented in Table 1. The selection of these                 49         EBIT
financial variables is based on prior research and                    50         Z – Score
financial ratios on liquidity, safety, profitability and              51         Retained Earnings / Total Assets
efficiency of the organisations under consideration.                  52         EBIT / Total Assets
During the preprocessing stage, each of the
independent financial variables has been
normalized. In order to improve the reliability of                    We compiled all the financial items / ratios of
the result further we perform ten – fold cross                        Table 1. We applied one way ANOVA on the
validation.                                                           dataset for reducing dimensionality and to test
   Table 1: Items / Ratios from financial statement to be             whether the differences between the two classes
      used for detection of financial statement fraud:                namely fraud and non fraud, were significant for
S.No. Financial items / Ratios                                        each variable. The variables with high p – value are
1         Debt                                                        considered non informative. Variables with p –
2.        Total assets                                                value <= 0.05 are considered informative and are
3         Gross profit                                                tested further using data mining methods. The
4         Net profit                                                  financial ratios which are considered informative
5          Primary business income
                                                                      are present in Table 2 along with their F – values
6         Cash and deposits
                                                                      and p – values.
7         Accounts receivable
                                                                              Table 2: Informative financial ratios / items
8         Inventory/Primary business income
                                                                  S.                                               P-
9         Inventory/Total assets
                                                                  No.          Financial Ratios / Items          value     F - Value
10        Gross profit/Total assets
                                                                           Debt                                   0.028
11        Net profit/Total assets                                 1                                                             1.345

                                                                                                       ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                    Vol. 10, No. 3, March 2012

               ry/Primary businness        0.001                      with pr               asses. CART is a classificatio
                                                                              re-assigned cla              s              on
2       income                                             31
                                                        3.03                 d
                                                                      method different from traditio                     cal
                                                                                                            onal statistic
               ry/Total assets
        Inventor                           0.046                      method                 s            n
                                                                             ds. CART is a decision tree learnin          ng
3                                                       5.74
               fit/Total assets
        Net prof                           0.001                             que
                                                                      techniq that produ    uces classifica               he
                                                                                                           ation tree if th
4                                                       3.04
                                                                      depend dent variable is categorical and regressio   on
        Cash/Total assets                  0.001
5                                                       2.90
                                                           06         tree ot               is             est
                                                                              therwise. Thi method be classify th         he
                bt/Total assets
        Total deb                          0.002
                                                           03                es
                                                                      sample in to a nu     umber of non – overlappin
                                                                                                          n               ng
        Fixed assets/Total assets            0
                                             0.05                             s.
                                                                      regions The tree construction using CAR            RT
7                                                       2.07
                                                           75         method                 es
                                                                             dology include three steps. The first ste    ep
               s              ent
        Deposits and cash/Curre assets     0.001                              k
                                                                      also known as g                      se
                                                                                            growing phas consists of
8                                                       2.93
        Working Capital / Total Assets
              g                             0.001                     construucting maximu    um tree whi  ich means th  hat
9                                                       2.86
                                                                      splittin of learning sample should be done up a
                Total Assets
        Sales / T                          0.002
10                                                     12.07
                                                           77                               h
                                                                      point where each terminal node contain              ns
               ome          ets
        Net inco / Fixed Asse              0.001                      observa               y               his
                                                                              ations of only one class. Th step is mo    ost
11                                                        3.0
        Revenue /Total Assets              0.002                      time co onsuming bec  cause each iter               he
                                                                                                            ration seeks th
12                                                     12.07
                                                           77         best sp                le.
                                                                             plitting variabl Tree constr   ructed as abovve
        EBIT                               0.026                      may co onsist of hundr               and
                                                                                            reds of levels a insignifica ant
13                                                      4.29
        Z score                            0.001                      nodes or subtrees. T   Therefore, thi complex tree
14                                                      3.13
                                                                      should be pruned by using cross v   validation as onne
        Account receivable/Prim
                ts             mary        0.018
15      business income                                    99
                                                                      of the p              thm. This pruning will result in
                                                                              pruning algorit
        Primary business incomee/Total     0.001                      a right size tree which will be used in the third ste
                                                                                              h                           ep
16      assets                                          3.04
                                                           48         for classsifying new daata.
        Primary business incomee/Fixed     0.001                           CAART as a cla                 m
                                                                                             assification method does n  not
17      assets                                             57
                                                        3.05                 e                             n
                                                                      require variables to be selected in advance. Th    his
        Capitals and reserves/To debt      0.003                             d
                                                                      method automatica      ally identifi ies significa ant
18                                                        2.2
                                                                      variables and ignor   res the non significant on   ne.
               rofit/Primary bus
        Gross pr               siness      0.008
                                                                      Moreov CART is very sensitive to the trainin
                                                                              ver,                         e              ng
19      profit                                          3.92
        Account Receivable / S
               ts             Sales         0.013
                                                                      data if it consists o outliers. Ou
                                                                             f              of                           ery
                                                                                                            utliers are ve
20                                                      2.31
                                                           11         promin                 al
                                                                             nent in financia data due to financial crise es.
        Retined earnings / Total Assets     0.001                     Non p   parametric na ature and its capability of
21                                                      3.04
        EBIT / T
               Total Assets                0.001                             ng
                                                                      handlin noisy data are one of the reasons f        for
22                                                      3.05
                                                           59                 ng             ne           hod
                                                                      selectin CART as on of the meth to be used in
                                                                      this res

                                                                           5.3             Bayesian Class
                                                                                     Naïve B            sifier
             Data Mining Methods
         5.3 D           M
                                                                      Naive BBayesian class sifier is a proba                ng
                                                                                                             abilistic learnin
     Detection of financial statement fra    aud can be                      que
                                                                      techniq based on a                   es’
                                                                                             applying Baye theorem wi        ith
                d              l
     considered as a classical problem of c classification.           class c condition ind dependence as   ssumption. Th   his
                               wo            e
     Classification includes tw steps. In the first step, a           strong (naive) indepe endence assum  mption states th hat
                t              et
     model that describes a se of predeterm mined classes             presenc or absence o an attribute of a class is n
                                                                             ce              of                             not
     is construccted. The sam               he
                             mple used in th process is                                     e
                                                                      related to presence or absence of any oth             her
     known as training sam     mple. Each tu uple in the              attribut Bayes’ the
                                                                             te.                            tes
                                                                                            eorem calculat the posteri       ior
     training se is supposed to belong to a predefined                probability as
     class as determined by th class label atttribute. This                      P(H|X) = (PP(X|H) * P(H)) / P(X)
     step of sup                ing         ed
                 pervised learni is followe by second                 Where, H is a hypo
                                                                             ,               othesis such a the object X
     step in wh hich the model attempts to classify new               belongs to class C.
     objects wh                e
                 hich form the validation sample. Data                If an object X belo    ongs to one of i alternativ      ve
     mining su   uggests a n                 c
                               number of classification               classes, in order to c classify the obbject a Bayesia   an
     techniques which have an excellent re   eputation for                   ier             he             es
                                                                      classifi calculates th probabilitie P(Ci|X) for a      all
     their class sification cappabilities. Most of these              the pos               Ci
                                                                             ssible classes C and assigns the object to th    he
                ion methods are derived fro artificial
     classificati              a             om                             with             um             y
                                                                      class w the maximu probability P(Ci|X).
                               s.            c
     intelligence and statistics Three such classification            The coonditional distr                he
                                                                                            ribution over th class variab   ble
     methods n                  ,            ian
                namely CART, naive Bayesi classifier                         be
                                                                      C can b expressed as   s
     and Genet Programmi        ing are emplooyed in this
     research sttudy.
          5.3.1      CART
     Classification and R   Regression T Tree is a
     computerizzed, non – pa             a
                            arametric data exploration
                            e             istorical data
     and prediction technique which uses hi

                                                                                                   ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 10, No. 3, March 2012

                              aling factor
Where Z (the evidence) is a sca                                          ed
                                                                  achieve by select    ting two par    rent trees an nd
dependent only on                   , i.e a constant
                                        e.,                       reproduucing to form ttwo new solut               ent
                                                                                                      tions. The pare
if the valu of the feat
           ues           ture variables or attributes                    re            om
                                                                  trees ar selected fro the initial population by a y
are know  wn. If assump  ption of clas condition
                                        ss                        functio of the fitness of the solutio The creatio
                                                                        on             s              ons.           on
independen nce holds tr  rue, the naiv Bayesian
                                        ve                                             om              ver
                                                                  of the offsprings fro the crossov operation is
classifier produces best accuracy rates. This                     accomp               eting the crosso
                                                                         plished by dele               over fragment of
classificati technique requires smal amount of
           ion                          ll                               st
                                                                  the firs parent and then inserting the crossov    ver
data to guuesstimate the parameters such as means                 fragmeent of the s                   t.
                                                                                       second parent The secon       nd
and varian  nces of the variables, ne   ecessary for                     ng            d              ric
                                                                  offsprin is produced in a symmetr manner. Th       he
           ion. This assum             y
                         mption greatly reduces the               fitness function to search the most efficie       ent
computatio onal cost as on class distr
                          nly           ribution is to                   ter                          he
                                                                  comput program that can solve th given proble     em
be counted d.                                                            n
                                                                  is given below [29].

            t             n
However, this assumption of independe ence may not                                                       fied
                                                                                    No. of samples classif correctly
be valid in many cases, b             ally
                         because genera attributes
are depend  dent in nature This naive design and
                          e.                                              ______________
                                                                Fitness = _                        ______________
                                                                                      ______________            _
simplified assumption should not be taken as its
limitation because naive Bayes’ clas
                          e           ssifier works                                      samples used for training during evaluation
                                                                                  No. of s              r               g
much bett in many complex and real world
     5.3.3               rogramming
                Genetic Pr                                        The ap                            rn
                                                                         pplication of GP in patter classificatio on
Genetic p  programming (GP) is an evolutionary                           the
                                                                  offers t following aadvantages.
learning teechnique that o offers a great potential for                1) GP is very f                            be
                                                                                      flexible which means it can b
classificati 27[27]. GP follows Darw
           ion             P                win’s theory                    adapted to the needs of each particul lar
of evolutio commonly known as “su
           on,                            urvival of the                    problem.
          There is a ra
fittest”. T                andomly gene   erated initial
population of solutions that reproduc with each
           n                              ce                           2) GP can be employed on the data in i             its
other usin various ge
           ng               enetic operato such as
                                           ors                                          m.
                                                                            original form
reproductio crossover, mutation etc. This process
            on,                                                        3) A priori kno                    t
                                                                                         owledge is not required abo     out
of evolutio is termed as generation.
           on                                                                             ion
                                                                            the distributi of the data since GP is free
GP is ess  sentially considered to be a variant of                          from data distribution.
genetic al  lgorithms (GA that uses a complex
                           A)                                          4) GP can easily expr              ress unknow    wn
representat tion language to codify in     ndividuals 28                    relationship      among      t
                                                                                                         the     data      in
[28]. The b  basic differenc between GP and GA is
                           ce              P                                mathematica expressions.
the     reprresentation     of    solutionss.    Genetic               5) GP can be useful in pre        eprocessing an   nd
programmi                                  g
            ing follows the following sequential                            postprocessi along with classification in
steps for soolving a proble 29 [29].
                           em                                               order to enha ance the classifier.
     a) Create a random population o programs,
                           m               of                          6) GP can be helpful in f
                                                                                        e                finding out th   he
          or rules, using the symbolic expressions
            r                                                               majority of discriminating features of a
          prrovided as the initial populatiion.                                          training stage.
                                                                            class in the t
     b) Ev   valuate each program o rule byor                                            tal
                                                                       6. Experiment Results and Analysisd
          asssigning a fitnness value acc  cording to a           Three data mining m   methods discus   ssed above hav   ve
          prredefined fitn ness function that can
                                           n                      been im                n
                                                                         mplemented on the dataset an compared o
                                                                                                          nd              on
          measure the capability of the rule or
          m               c                                              sis             ity
                                                                  the bas of sensitivi and specifi       icity. Sensitiviity
          prrogram to solve the problem.                                articular method can be meas
                                                                  of a pa                               sured as the rat tio
     c) U the reprod       duction operat to copy
                                            tor                   of num mber of fraud   dulent organis sation identifie  ed
          ex               ms             w
            xisting program into the new generation.              accurattely as fraudu                  t
                                                                                         ulent to the total number of
     d) G Generate the new popul           lation with                   f              ms
                                                                  actual fraudulent firm whereas spe                     tio
                                                                                                        ecificity is a rat
          crrossover, muta                  er
                            ation, or othe operators                     n              n
                                                                  of the number of non fraud firms id    dentified as non-
          fr               y
           rom a randomly chosen set of parents.
                                           f                      fraud to the total nu
                                                                         t                umber of real non-fraudule
                                                                                                         l               ent
     e) R                  ond
          Repeat the seco to the fou      urth steps for                  nies
                                                                  compan (Table 8).
          th new popul      lation until a predefined                    s               T
                                                                   In this study, CART model is co       onstructed usin  ng
           ermination crite                ed,
                           erion is satisfie or a fixed                  A
                                                                  SIPINA Research edi    ition software version – 32 b  bit.
          nuumber of gener  rations is comppleted.                The tre given below has been built by using who
                                                                         ee            w                 t               ole
     f) Th solution to the problem is the genetic
             he                                                         e
                                                                  sample as training set with conf      fidence level of
          prrogram with the best fitnes within all
                           t               ss                     0.05.
          geenerations.                                                                ure:   T
                                                                                    Figu 1 CART
The most important ope     eration for gen nerating new
population in GP is crossover. C          Crossover is

                                                                                                ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 10, No. 3, March 2012

                                                                  Table: 5 (Confusion matrix for Naïve Bayesian
                                                                Label         NF         (Non F (Fraud)
                                                                NF            83                 2
                                                                F             10                 19

                                                                Genetic programming has been implemented using
                                                                tool Discipulus version 5.1. The data set has been
                                                                divided into training and validation data. The
                                                                training data set is used to train the sample and
                                                                validation dataset is used exclusively for the
                                                                purpose of validation. 80% of the whole dataset is
                                                                used to train the sample, while 20% is used of the
                                                                purpose of validation. Since our dependent variable
                                                                (target output) is binary, we select “hits then
                                                                fitness” as a fitness function. Every single run of
                                                                Discipulus has been set to terminate after it has
                                                                gone 50 generations with no improvement in
                                                                fitness.    The confusion matrix for genetic
                                                                programming is given as table 6.
                                                                       Table: 6 (Confusion Matrix for Genetic
The confusion matrix is given below (Table: 3)                                     Programming)
      Table: 3 (Confusion Matrix for CART)                      Label          NF (Non Fraud)      F (Fraud)
Label               NF       (Non F (Fraud)                     NF             84                  1
                    Fraud)                                      F              13                  16
NF (Non Fraud)      85              0
F (Fraud)           4               25
                                                                From table 7 we can observe the input impact of
                                                                various input parameters on the model.
CART manages to classify 96 % cases. This
method well classifies all the non fraud cases (100                   Table: 7 Impact of input variables (Genetic
%) and misclassifies only 4 fraud cases. The                                       Programming)
percentage of classification for fraud cases is 86 %.
The tree presented here uses Deposits and cash to
current assets ratio as the first splitter. This ratio          S.N     Variable         Freque     Avera      Maxim
indicates that how better the company is in                     o.                       ncy        ge         um
converting its non – liquid assets into cash. At                                                    Impac      Impact
second level of the tree, retained earnings / total                                                 t
assets and fixed assets / total assets has been used              1     Debt               0.06      00.00      00.000
as a splitter. Table 4 consist of all the ratios used by                                              000         00
the tree.                                                         2     Inventory/Pri      0.35      22.52      53.846
                        Table: 4                                        mary                          747         15
S. No.              Financial Ratios / Items                            business
          Net profit/Total assets                                       income
                                                                  3     Inventory/To       0.35      09.70      20.879
  2       Fixed assets/Total assets                                     tal assets                    696         12
  3       Deposits and cash/Current assets                        4     Net                0.06      02.19      02.197
          Working Capital / Total Assets                                profit/Total                  780         80
  4                                                                     assets
  5       Sales / Total Assets                                    5     Cash/Total         0.29      03.84      05.494
  6       Retained earnings / Total Assets                              assets                        615         51
                                                                  6     Total              0.12      00.00      00.000
Second technique of classification, the Naïve                           debt/Total                    000         00
Bayesian Classifier has been implemented using                          assets
SIPINA Research edition software version – 32 bit.                7     Fixed              0.00      00.00      00.000
The method correctly classifies 89% cases. The                          assets/Total                  000         00
confusion matrix is given below (Table 5):                              assets
                                                                  8     Deposits and       0.18      06.59      06.593

                                                                                            ISSN 1947-5500
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 10, No. 3, March 2012

        cash/Current                341         41                   7. Conclusion
 9      Working         0.06        00.00     00.000          In this study, data mining methods of good repute
        Capital /                    000        00            is implemented on dataset collected from financial
        Total Assets                                          statements of 114 companies for classifying
10      Sales / Total   0.00        00.00     00.000          organizations as fraud or non – fraud. We collected
        Assets                       000        00            and compiled 52 financial variables / ratios. Then,
11      Net income /    0.41        07.69     09.890          one way ANOVA is used for finding informative
        Fixed Assets                 231        11            variables on the basis of p –value. Then three
12      Revenue         0.29        09.01     14.285          intelligent classification methods namely CART,
        /Total Assets                099        71            Naïve      Bayesian      Classifier    and     Genetic
                                                              Programming are applied on 22 informative ratios.
13      EBIT            0.06        05.49     05.494
                                                              In order to have better reliability of the result, ten –
                                     451        51
                                                              fold cross validation has been implemented
14      Z score         0.06        19.78     19.780
                                                              throughout the study. All the three methods have
                                     022        22
                                                              been compared on the basis of sensitivity and
15      Accounts        0.29        00.54     01.098
                                                              specificity. CART produces best sensitivity and
        receivable/Pr                945        90
                                                              specificity as compared with other two methods.
                                                              The accuracy rate of these methods can be further
                                                              enhanced by using some qualitative information
                                                              such as composition of administrative board along
16      Primary         0.18        02.74     03.296          with financial ratios used in this research.
        business                     725        70
17      Primary         0.41        03.29     08.791
        business                     670        21            References:
18      Capitals and    0.00        00.00     00.000    
        reserves/Tota                000        00
        l debt                                                3
                                                                Kantardzi3 c M. (2002), Data Mining: Concepts, Models,
19      Gross           0.53        05.65     09.890          Methods, and Algorithms’,
        profit/Primar                149        11              Wiley – IEEE Press.
                                                                PriceWaterhouse&Coopers: Economic crime: People, culture
        y business                                            and controls. The 4th Biennial Global Economic Crime Survey
        profit                                                (2007), available at:
20      Accounts        0.00        00.00     00.000
        Receivable /                 000        00            5
                                                                Association of Certified Fraud Examiners: 2006 ACFE Report
        Sales                                                 to the nation on Occupational fraud and abuse (2006), Technical
21      Retained        0.18        02.93     04.395          report, Association of Certified Fraud Examiners, USA,
        earnings /                   040        60            available at:
        Total Assets
22      EBIT / Total    0.24        03.29     05.494          6
                                                                Beasley, M. (1996). An empirical analysis of the relation
        Assets                       670        51            between board of director composition and financial statement
                                                              fraud. The Accounting Review, 71(4), 443–466.

           Table: 8 (Performance Matrix)                        Green, B. P., & Choi, J. H. (1997). Assessing the risk of
                                                              management fraud through neural-network technology.
                                                              Auditing: A Journal of Practice and Theory, 16(1), 14–28.
S.No.    Predictor       Sensitivity      Specificity
                         (%)              (%)
 1       CART               86.2             100                Fanning, K., & Cogger, K. (1998). Neural network detection of
                                                              management fraud using published financial data. International
 2       Naïve              65.5             97.6
                                                              Journal of Intelligent Systems in Accounting, Finance &
         Bayesian                                             Management, 7(1), 21–24.
 3       Genetic               53            99.2             9
         Programming                                             Efstathios Kirkos, Charalambos Spathis & Yannis
                                                              Manolopoulos (2007). Data mining techniques for the detection
                                                              of fraudulent financial statements. Expert Systems with
                                                              Applications 32 (23) (2007) 995–1003

                                                                                                         ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                           Vol. 10, No. 3, March 2012

   C. Spathis, M. Doumpos, C. Zopounidis, Detecting falsified
financial statements: a comparative study using multicriteria
analysis and multivariate statistical techniques, European
Accounting Review 11 (3) (2002) 509–535.
   M. Cecchini, H. Aytug, G.J. Koehler, and P. Pathak. Detecting
Management          Fraud        in      Public       Companies.
12                                                                                            Rajan Gupta obtained masters degree in computer application
  S.-M. Huang, D.C. Yen, L.-W. Yang, J.-S. Hua, An
                                                                                              from Department of Computer Science & Application, Guru
investigation of Zipf's Law for fraud detection, Decision Support
                                                                                              Jambheshwar University,Hisar, Haryana, India and Master
Systems 46 (1) (2008) 70–83.
13                                                                                            Degree of Philosophy in Computer Science from Madurai
   Hoogs Bethany, Thomas Kiehl, Christina Lacomb and Deniz
                                                                                              Kamraj University, Madurai, India. He is currently pursuing
Senturk (2007). A Genetic Algorithm Approach to Detecting
                                                                                              Doctorate degree in Computer Science from Department of
Temporal Patterns Indicative Of Financial Statement Fraud,
                                                                                              Computer Science & Application, Mahrshi Dayanand
Intelligent systems in accounting finance and management 2007;
                                                                                              University, Rohtak, Haryana, India.
15: 41 – 56, John Wiley & Sons, USA, available at:

     M.J. Cerullo, V. Cerullo, Using neural networks to predict
financial reporting fraud: Part 1, Computer Fraud & Security 5
(1999) 14–17.
    E. Koskivaara, Artificial neural networks in auditing: state of
the art, The ICFAI Journal of Audit Practice 1 (4) (2004) 12–33.
     B. Busta, R. Weinberg, Using Benford's law and neural                                    Dr Nasib S. Gill obtained Doctorate degree in computer science
networks as a review procedure, Managerial Auditing Journal 13                                and Post doctoral research in Computer Science from Brunel
(6) (1998) 356–366.                                                                           Univerrsity, U.K. He is currently working as Professor and Head
     H.C. Koh, C.K. Low, Going concern prediction using data                                  in the Department of Computer Science and Application,
mining techniques, Managerial Auditing Journal 19 (3) (2004)                                  Mahrshi Dayanand University, Rohtak, Haryana, India. He is
462–476.                                                                                      having more than 22 years of teaching and 20 years of research
     Belinna Bai, Jerome yen, Xiaoguang Yang, False Financial                                 experience. His interest areas include software metrics,
Statements: Characteristics of china listed companies and CART                                component based metrics, testing, reusability, Data Mining and
Detection Approach, International Journal of Information                                      Data warehousing, NLP, AOSD, Information and Network
Technology and Decision Making , Vol. 7, No. 2(2008), 339 -                                   Security.
     A. Deshmukh, L. Talluru, A rule-based fuzzy reasoning
system for assessing the risk of management fraud, International
Journal of Intelligent Systems in Accounting, Finance &
Management 7 (4) (1998) 223–241.
      Wei Zhou, G. Kappor, Detecting evolutionary financial
statement fraud, Decision Support Systems 50 (2011) 570 – 575.
    P.Ravisankar, V. Ravi, G.Raghava Rao, I., Bose, Detection of
financial statement fraud and feature selection using data mining
techniques, Decision Support Systems, 50(2011) 491 - 500
    W.H. Beaver, Financial ratios as predictors of failure, Journal
of Accounting Research 4 (1966) 71–111
     H. M. Schilit, Financial Shenanigans (McGraw-Hill, Inc.,
New York, 1993).
     C. Fei, The performances of four classes of listed companies
are incredible (in Chinese), Hunan Daily (28-Sep-01).
      E.I. Altman, Financial ratios, discriminant analysis and
prediction of corporate bankruptcy, The Journal of Finance 23
(4) (1968) 589–609.
     Stice J., Albrecht S. and Brown L., (1991), ‘Lessons to be
learned-ZZZZBEST, Regina, and Lincoln Savings’, The CPA
Journal, April, pp. 52-53.
     W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone,
Genetic Programming—An Introduction; On the Automatic
Evolution of Computer Programs and its Applications. San
Mateo,           CA/Heidelberg,         Germany:           Morgan
Kaufmann/dpunkt.verlag, 1998.
     J. H. Holland, Adaptation in Natural and Artificial Systems.
Ann Arbor, MI: Univ. of Michigan Press, 1975. 
     K.M. Faraoun, A. Boukelif, Genetic programming approach
for multi-category pattern classification applied to network
intrusion detection, International Journal of Computational
Intelligence and Applications 6 (1) (2006) 77–99.

                                                                                                                                         ISSN 1947-5500

To top