Docstoc

PPT - PowerPoint 2

Document Sample
PPT - PowerPoint 2 Powered By Docstoc
					Detecting Phishing Web
Pages with Visual
Similarity Assessment
Based on Earth
Mover’s Distance (EMD)
Speaker
Po-Jiu Wang
Institute of Information Science Academia Sinica
Author
Anthony Y. Fu
Department of Computer Science, City University of Hong
Kong
IEEE 2006                                             1
Outline
 What is phishing
 Various phishing techniques
 Previous anti-phishing works
 Evaluating webpage distance with EMD
     What  is EMD, and its advantage
     Color and its coordinate distance with EMD
   Conclusion and tentative work to do

                                                   2
What is phishing
   Phishing is a criminal trick of stealing
    personal information through requesting
    people to access a fake webpage.

   How to “request people to…”?
     Phishing email, BBS, chatting room, etc.
     Spoofing: free gift, identity confirmation etc.


                                                        3
Various phishing techniques
   The most straightforward way for a phisher
    to spoof people is to make the appearance
    of webpage links and webpages similar to
    the real ones.




                                             4
Various phishing techniques
(Link based phishing obfuscation)
   The link based phishing obfuscation can be
    carried out in four ways below:

     Adding     suffix to domain name of URL.
      E.g., revise www.citybank.com to
      www.citybank.com.us.ebanking;

     Using    actual link different from visible link.
      E.g., the HTML line: <a
      href="http://www.citibank.com.us.ebanking">
      www.citibank.com</a>;


                                                      5
Various Phishing Techniques
(Link based phishing obfuscation 1)
   Using bug in real webpage to redirect to
    other webpages.
    E.g., the bug of eBay website:
    http://cgi.ebay.com/ws/eBayISAPI.dll?MfcISAPICommand=RedirectTo
    Domain&DomainUrl=PHISHINGLINK can direct you to any specified
    PHISHINGLINK;

   And     replacing similar characters in the real
    link.
    E.g., replace “I”s (uppercase “i”) with “l” (lowercase of “L”) or “1” (Arabic
    number one), such as WWW.CITIBANK.COM to
    WWW.C1TlBANK.COM.



                                                                                    6
Various Phishing Techniques
(webpage based obfuscation)
   The webpage based obfuscation can be
    carried out in three basic ways below

     Using  the downloaded webpage from real
      website to make the phishing webpage
      appear and react exactly the same with the
      real one;


                                               7
Various Phishing Techniques
(webpage based obfuscation 1)
   Usingscript or add-in to web browser to
   cover the address bar to spoof users to
   believe they have entered the correct
   website;

   And using visual based content (E.g.,
   image, flash, video, etc.) rather than HTML
   to avoid HTML based phishing detection.


                                                 8
Previous Anti-Phishing Works
   Anti-Spamming
     Phishing email is spam. Phisher do email
     address harvest, and broadcast to the
     potential victims.

   Human aided
     Banks employ a group of people to monitor
     the Phishing activities. E.g. HSBC


                                                  9
Previous Anti-Phishing Works (1)
   Duplicate document detection approaches,
    which focus on plain text documents and
    use pure text features in similarity measure.




                                              10
Motivation
   Phishing Web pages always have high
    visual similarity with the real Web pages.

   An effective approach called image-based
    EMD is proposed to calculate the visual
    similarity of Web pages.


                                                 11
Evaluating webpage distance with
EMD
   EMD is Earth Mover’s Distance and it is
    based on the well known transportation
    problem
     Suppose      we have m producers
          P={(p1,wp1),(p2,wp2)…(pm,wpm)}
    N     customers
          C={(c1,wc1),(c2,wc2)…(cn,wcn)}
     Distance     matrix D=[dij] is given


                                              12
Evaluating webpage distance with
EMD (transportation fee)
   The task is to find a flow matrix F =[fij] which
    contains factors indicating the amount of
    product to be moved from one producer to
    one consumer.




                                                       13
Evaluating webpage distance with
EMD (total cost of transportation fee)
   The total cost of transportation fee can be
    represented as:
                                  f ij  0 1  i  m,1  j  n
                                   n

                m     n
                                  f     ij    wpi 1  i  m

COST(P,C,F)=  fij dij
                                  j 1

                            ST:   m

                i 1 j 1         f
                                  i 1
                                         ij    wcj 1  j  n
                                  m      n                 m       n

                                   f
                                  i 1 j 1
                                               ij    min( wpi ,  wcj )
                                                           i 1   j 1



                                                                         14
Evaluating webpage distance with
EMD (final equation of EMD)
   The EMD can be represented as:

                            m    n

                             f d
                            i 1 j 1
                                         ij        ij

         EMD( P, C , D)       m n

                              f
                             i 1 j 1
                                              ij




                                                        15
Advantage of EMD
   Represent problems involving multi-
    featured signatures

   Allow for partial matches in a very natural
    way

   Fit for cognitive distance evaluation


                                                  16
Color and its coordinate distance
with EMD (Preprocess image data)
   Preprocess image data
     Compress      them to 10*10 pixes
          Experiment shows that the calculation time can be heavily
           reduced through image size compression without reducing
           the precision an recall
     E.g.




                                                                       17
The calculation of the distance of
pixel color and coordinate
   Get the signature of webpage1 and webpage2
    using pixel color and coordinate
   Calculate D=[dij].
   dij=Distance(Color(pixeli), Color(pixelj)
        , Coordinate(pixeli), Coordinate(pixelj))

   EMDColorAndCordinate=
      EMDDist(Signature1,Signature2, D)


                                                    18
The improved color space
   The color of each pixel in the resized
    images is represented using the ARGB
    (alpha, red, green, and blue) scheme with
    4 bytes (32 bits).
    A degraded color space called Color Degrading Factor
    (CDF) is needed.

    Thus, the degraded color space is (28/CDF)4.



                                                           19
   The centroid of degraded color space

      The centroid of each degraded color is calculated
       using:              The coordinates of the ith pixel
                             that has degraded color dc


                               N dc
                                      Cdc ,i
 The centroid of
degraded color dc    Cdc  
                               i 1   N dc
                             The total number of pixels
                            that have degraded color dc

                                                          20
Computing visual similarity from
EMD
   First, the normalized euclidian distance of the
    degraded ARGB colors is calculated, and then
    the normalized Euclidian distance of centroids is
    calculated.




                                                    21
The maximum color distance

   Suppose feature i  dci , Cdci  where
    dci  dAi , dRi , dGi , dBi  ,feature  j  dc j , Cdc j 
    ,where dc j  dAj , dR j , dG j , dB j  , the maximum
    color distance, the maximum color distance is




                                                                    22
The normalized color distance

   The normalized color distance NDcolor is defined
    as




                                                   23
The normalized centroid distance

   The maximum centroid distance MDcentroid = w2  h2
    where w and h are the width and height of the
    resized images, respectively. The normalized
    color distance NDcentroid is defined as




                                                  24
Final equation of EMD

   The two distances are added up with weights p
    and q,respectively, to form the feature distance,
    where p+q =1.




                                                        25
Computing EMD-based visual
similarity of two images
               (0, ) is the amplifier of visual similarity




                                                       26
An improved adjusted threshold for
classification

   A special threshold for each given protected web
    page is used to classify a web page to be a
    phishing web page or a normal one.


           Ti (1  i  N protected )denotes the threshold of the
            ith protected Web page


         Ti  arg min( MissClassification(t ))   , t VSSi

                                                                   27
Two types of misclassifications
   False alarm
       The visual similarity is larger than or equal to t but, in fact, the web
        page is not a phishing Web page (false positive).


   Missing
       The visual similarity is less than t but, in fact, the web page is a
        phishing one (false negative).

VSSi correlates to two accessory parameters, the false alarm number
and false negative


                                                                               28
The way to classify phishing
page
   When a suspected web page comes, the visual
    similarity vector which can be represented as

          VS  vs1, vs2 ,........., vsN protected 
    and the classification result using the following
    equation:
                            1 if max(VS  T )  0
         IsPhishing (VS )  
                            0 if max(VS  T )  0

                                                        29
Experiment configuration of
phishing detection performance
   10,272 homepages are selected from the web.
   9 phishing web pages which targeted at 8 real
    protected web pages.
   The 10,272+9 web pages are mixed together to
    form the Suspected Webpage Set.
   Randomly selected 1,000 web pages from the
    10,272 ones, combining with the 9 phishing
    webpages to form the Training Webpage Set.

                                                    30
Train a threshold vector
   We use the Train Webpage Set to train a
    threshold vector
       Protected Webpage                Threshold(T)
       real-Bank of Oklahoma - Online         0.8469
       real-ebay1                             0.9434
       real-eBay2                             0.9493
       real-ICBC(Asia)                        0.7385
       real-Key Bank                          0.9323
       real-us bank                           0.9573
       real-Washington Mutual                 0.8541
       real-Wells Fargo Sign On               0.9255
                                                       31
Classification precision, phishing recall,
and false alarm list
( = 0, 9281 Suspected Web Pages)




                                         32
Classification precision, phishing recall,
and false alarm list
( = 0.005, 9281 Suspected Web Pages)




                         Reduce false negative possibilities !!
                                                       33
Phishing detection performance of
image-based EMD



                             There are 65
                             false alarms




                                    34
Phishing detection performance of
HTML/DOM-based EMD




                              There are 849
                               false alarms



                                     35
Phishing detection performance of
similarity assessment-based EMD




                                    There are 697
                                    false alarms

                                         36
Experiment results
   The threshold vector to is used to classify an
    suspected webpage.

   In order to reduce false negative possibilities,
     there is a necessary sacrifice needed under
      0.005
   Empirically set the parameters w =h =100,
    =0.5,|Ss| =20, p=q=0.5, and CDF=32 in our
    experiments by tuning.
                                                       37
The number of ground truth web
page for each protected web page




                                   38
The configuration of tuning the
parameters
   Take Nsample 5,10,.....,50 as the sample
    number for each protected web

   If a web page in the Nsample collected web pages
    is in the corresponding ground truth group, it is
    counted as a correctly detected similar web page.




                                                   39
Tuning the parameters
(w and h)
   We have four configuration options (w=h
    =10, 10 10,100, and 100 10 ) to tune w and h.




                                                    40
Tuning the parameters
(p and q)
   11 configuration options (p : q =0 : 1; 0:1 : 0:9;
    0:2 :0:8; . . . ; 0:9 : 0:1;1:0) to are used to tune p
    and q.




                                                         41
Tuning the parameters
(sample color number)
   Six configuration options (|Ss| = 5, 10, 15, 20, 25,
    and 30) are used to tune |Ss|.




                                                      42
Tuning the parameters
(CDF)
   Eight configuration options (CDF =8, 16, 24, 32,
    40,48, 56, and 64) to tune CDF.




                                                   43
The built architecture anti-phising
system




                                      44
    Conclusions
   This approach works at the pixel level of Web
    pages rather than at the text level.

   Experiments show that our method can
    achieve satisfying classification precision and
    phishing recall.

   The time efficiency of computation is also
    acceptable for online phishing detection.
                                                 45
    Tentative works
   Continue with more phishing examples and
    even larger scale datasets.

   The method could not detect those which are
    not visually similar.

   Keep working on developing a client-side
    application
                                               46
Thanks for your attention.



                        47

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:48
posted:7/31/2011
language:English
pages:47