PPT - PowerPoint 2
Shared by: cuiliqing
-
Stats
- views:
- 33
- posted:
- 7/30/2011
- language:
- English
- pages:
- 47
Document Sample


Detecting Phishing Web
Pages with Visual
Similarity Assessment
Based on Earth
Mover’s Distance (EMD)
Speaker
Po-Jiu Wang
Institute of Information Science Academia Sinica
Author
Anthony Y. Fu
Department of Computer Science, City University of Hong
Kong
IEEE 2006 1
Outline
What is phishing
Various phishing techniques
Previous anti-phishing works
Evaluating webpage distance with EMD
What is EMD, and its advantage
Color and its coordinate distance with EMD
Conclusion and tentative work to do
2
What is phishing
Phishing is a criminal trick of stealing
personal information through requesting
people to access a fake webpage.
How to “request people to…”?
Phishing email, BBS, chatting room, etc.
Spoofing: free gift, identity confirmation etc.
3
Various phishing techniques
The most straightforward way for a phisher
to spoof people is to make the appearance
of webpage links and webpages similar to
the real ones.
4
Various phishing techniques
(Link based phishing obfuscation)
The link based phishing obfuscation can be
carried out in four ways below:
Adding suffix to domain name of URL.
E.g., revise www.citybank.com to
www.citybank.com.us.ebanking;
Using actual link different from visible link.
E.g., the HTML line: <a
href="http://www.citibank.com.us.ebanking">
www.citibank.com</a>;
5
Various Phishing Techniques
(Link based phishing obfuscation 1)
Using bug in real webpage to redirect to
other webpages.
E.g., the bug of eBay website:
http://cgi.ebay.com/ws/eBayISAPI.dll?MfcISAPICommand=RedirectTo
Domain&DomainUrl=PHISHINGLINK can direct you to any specified
PHISHINGLINK;
And replacing similar characters in the real
link.
E.g., replace “I”s (uppercase “i”) with “l” (lowercase of “L”) or “1” (Arabic
number one), such as WWW.CITIBANK.COM to
WWW.C1TlBANK.COM.
6
Various Phishing Techniques
(webpage based obfuscation)
The webpage based obfuscation can be
carried out in three basic ways below
Using the downloaded webpage from real
website to make the phishing webpage
appear and react exactly the same with the
real one;
7
Various Phishing Techniques
(webpage based obfuscation 1)
Usingscript or add-in to web browser to
cover the address bar to spoof users to
believe they have entered the correct
website;
And using visual based content (E.g.,
image, flash, video, etc.) rather than HTML
to avoid HTML based phishing detection.
8
Previous Anti-Phishing Works
Anti-Spamming
Phishing email is spam. Phisher do email
address harvest, and broadcast to the
potential victims.
Human aided
Banks employ a group of people to monitor
the Phishing activities. E.g. HSBC
9
Previous Anti-Phishing Works (1)
Duplicate document detection approaches,
which focus on plain text documents and
use pure text features in similarity measure.
10
Motivation
Phishing Web pages always have high
visual similarity with the real Web pages.
An effective approach called image-based
EMD is proposed to calculate the visual
similarity of Web pages.
11
Evaluating webpage distance with
EMD
EMD is Earth Mover’s Distance and it is
based on the well known transportation
problem
Suppose we have m producers
P={(p1,wp1),(p2,wp2)…(pm,wpm)}
N customers
C={(c1,wc1),(c2,wc2)…(cn,wcn)}
Distance matrix D=[dij] is given
12
Evaluating webpage distance with
EMD (transportation fee)
The task is to find a flow matrix F =[fij] which
contains factors indicating the amount of
product to be moved from one producer to
one consumer.
13
Evaluating webpage distance with
EMD (total cost of transportation fee)
The total cost of transportation fee can be
represented as:
f ij 0 1 i m,1 j n
n
m n
f ij wpi 1 i m
COST(P,C,F)= fij dij
j 1
ST: m
i 1 j 1 f
i 1
ij wcj 1 j n
m n m n
f
i 1 j 1
ij min( wpi , wcj )
i 1 j 1
14
Evaluating webpage distance with
EMD (final equation of EMD)
The EMD can be represented as:
m n
f d
i 1 j 1
ij ij
EMD( P, C , D) m n
f
i 1 j 1
ij
15
Advantage of EMD
Represent problems involving multi-
featured signatures
Allow for partial matches in a very natural
way
Fit for cognitive distance evaluation
16
Color and its coordinate distance
with EMD (Preprocess image data)
Preprocess image data
Compress them to 10*10 pixes
Experiment shows that the calculation time can be heavily
reduced through image size compression without reducing
the precision an recall
E.g.
17
The calculation of the distance of
pixel color and coordinate
Get the signature of webpage1 and webpage2
using pixel color and coordinate
Calculate D=[dij].
dij=Distance(Color(pixeli), Color(pixelj)
, Coordinate(pixeli), Coordinate(pixelj))
EMDColorAndCordinate=
EMDDist(Signature1,Signature2, D)
18
The improved color space
The color of each pixel in the resized
images is represented using the ARGB
(alpha, red, green, and blue) scheme with
4 bytes (32 bits).
A degraded color space called Color Degrading Factor
(CDF) is needed.
Thus, the degraded color space is (28/CDF)4.
19
The centroid of degraded color space
The centroid of each degraded color is calculated
using: The coordinates of the ith pixel
that has degraded color dc
N dc
Cdc ,i
The centroid of
degraded color dc Cdc
i 1 N dc
The total number of pixels
that have degraded color dc
20
Computing visual similarity from
EMD
First, the normalized euclidian distance of the
degraded ARGB colors is calculated, and then
the normalized Euclidian distance of centroids is
calculated.
21
The maximum color distance
Suppose feature i dci , Cdci where
dci dAi , dRi , dGi , dBi ,feature j dc j , Cdc j
,where dc j dAj , dR j , dG j , dB j , the maximum
color distance, the maximum color distance is
22
The normalized color distance
The normalized color distance NDcolor is defined
as
23
The normalized centroid distance
The maximum centroid distance MDcentroid = w2 h2
where w and h are the width and height of the
resized images, respectively. The normalized
color distance NDcentroid is defined as
24
Final equation of EMD
The two distances are added up with weights p
and q,respectively, to form the feature distance,
where p+q =1.
25
Computing EMD-based visual
similarity of two images
(0, ) is the amplifier of visual similarity
26
An improved adjusted threshold for
classification
A special threshold for each given protected web
page is used to classify a web page to be a
phishing web page or a normal one.
Ti (1 i N protected )denotes the threshold of the
ith protected Web page
Ti arg min( MissClassification(t )) , t VSSi
27
Two types of misclassifications
False alarm
The visual similarity is larger than or equal to t but, in fact, the web
page is not a phishing Web page (false positive).
Missing
The visual similarity is less than t but, in fact, the web page is a
phishing one (false negative).
VSSi correlates to two accessory parameters, the false alarm number
and false negative
28
The way to classify phishing
page
When a suspected web page comes, the visual
similarity vector which can be represented as
VS vs1, vs2 ,........., vsN protected
and the classification result using the following
equation:
1 if max(VS T ) 0
IsPhishing (VS )
0 if max(VS T ) 0
29
Experiment configuration of
phishing detection performance
10,272 homepages are selected from the web.
9 phishing web pages which targeted at 8 real
protected web pages.
The 10,272+9 web pages are mixed together to
form the Suspected Webpage Set.
Randomly selected 1,000 web pages from the
10,272 ones, combining with the 9 phishing
webpages to form the Training Webpage Set.
30
Train a threshold vector
We use the Train Webpage Set to train a
threshold vector
Protected Webpage Threshold(T)
real-Bank of Oklahoma - Online 0.8469
real-ebay1 0.9434
real-eBay2 0.9493
real-ICBC(Asia) 0.7385
real-Key Bank 0.9323
real-us bank 0.9573
real-Washington Mutual 0.8541
real-Wells Fargo Sign On 0.9255
31
Classification precision, phishing recall,
and false alarm list
( = 0, 9281 Suspected Web Pages)
32
Classification precision, phishing recall,
and false alarm list
( = 0.005, 9281 Suspected Web Pages)
Reduce false negative possibilities !!
33
Phishing detection performance of
image-based EMD
There are 65
false alarms
34
Phishing detection performance of
HTML/DOM-based EMD
There are 849
false alarms
35
Phishing detection performance of
similarity assessment-based EMD
There are 697
false alarms
36
Experiment results
The threshold vector to is used to classify an
suspected webpage.
In order to reduce false negative possibilities,
there is a necessary sacrifice needed under
0.005
Empirically set the parameters w =h =100,
=0.5,|Ss| =20, p=q=0.5, and CDF=32 in our
experiments by tuning.
37
The number of ground truth web
page for each protected web page
38
The configuration of tuning the
parameters
Take Nsample 5,10,.....,50 as the sample
number for each protected web
If a web page in the Nsample collected web pages
is in the corresponding ground truth group, it is
counted as a correctly detected similar web page.
39
Tuning the parameters
(w and h)
We have four configuration options (w=h
=10, 10 10,100, and 100 10 ) to tune w and h.
40
Tuning the parameters
(p and q)
11 configuration options (p : q =0 : 1; 0:1 : 0:9;
0:2 :0:8; . . . ; 0:9 : 0:1;1:0) to are used to tune p
and q.
41
Tuning the parameters
(sample color number)
Six configuration options (|Ss| = 5, 10, 15, 20, 25,
and 30) are used to tune |Ss|.
42
Tuning the parameters
(CDF)
Eight configuration options (CDF =8, 16, 24, 32,
40,48, 56, and 64) to tune CDF.
43
The built architecture anti-phising
system
44
Conclusions
This approach works at the pixel level of Web
pages rather than at the text level.
Experiments show that our method can
achieve satisfying classification precision and
phishing recall.
The time efficiency of computation is also
acceptable for online phishing detection.
45
Tentative works
Continue with more phishing examples and
even larger scale datasets.
The method could not detect those which are
not visually similar.
Keep working on developing a client-side
application
46
Thanks for your attention.
47
Get documents about "