Docstoc

CANTINA (PowerPoint)

Document Sample
CANTINA (PowerPoint) Powered By Docstoc
					   CANTINA: A Content-Based
Approach to Detecting Phishing Web
              Sites

    Authors: Yue Zhang, Jason Hong,
              Lorrie Cranor

        Presented By: Kim Giglia
         CSC 682 10/7/2008
Introduction

   Automated tool to detect phishing web-
    sites: CANTINA
   June 2006: 9,255 unique phishing sites
    reported
   Estimated costs of phishing websites: $1 -
    $2.8 billion per year
   Previous studies only found one phishing
    detection tool with > 60% accuracy
Tools to detect/prevent phishing

   Education
   Tools/Marks that show trustworthiness
   One way password hashes
   Proxies that are browser extensions
    (PassPet and WebWallet)
   ISP provided toolbars, services
Two major methods to detect phishing:

   Heuristics – Often produce false positives
   Blacklists – Labor intensive
       One time URL’s reduce effectiveness of
        blacklists
How CANTINA works (without added
 heuristics)

   Calculate the TF-IDF scores of terms on a
    page
   Generate lexical signature (five terms)
   Search for lexical signature (Google)
   Compare domain name of page to top N
    results (30 appears to be maximal)
What is TF-IDF?

   TF (term frequency) – number of times a
    term appears in a given document
   IDF (inverse document frequency) –
    measure of importance of a term – how
    common the term is in the corpus
   A high TF-IDF weight occurs when TF is
    high and IDF is lower
Develop the Lexical Signature

   Take 5 highest weighted TF-IDF terms
   Develop the Robust Hyperlink
       Ex:
        http://dom.com/page.html?ls=t1+t2+t3+t4+t5
   Add the current domain name to the lexical
    signature
Search for lexical signature

   ZMP (Zero Results Means Phishing) – if
    Google returns no results – it is a phishing
    site
Additional Heuristics

   Age of Domain
   Known Images
   Suspicious URLs/Links
   IP Address
   Dots in URL
   Forms
CANTINA Implementation

   Written in C# using .NET 2003
   800 lines of code and 4 libraries
   Microsoft IE extension
   Document corpus: British National Corpus –
    67,962,112 total words and 9,022 unique
    words
   Analyze the text content of the DOM
   Simple use interface: red traffic light
Experimental results
Experiment #1:
Experimental results
Experiment #2:
Experimental results
Experiment #3:
Experimental results
Experiment #4:
Limitations

   Doesn’t deal with all Javascript
    modifications to pages
   DOM parser sometimes returns wrong text
   Some legit sites composed mostly of images
   Logos in logo heuristic must be maintained
   No dictionary for other languages
   Time lags in querying from Google
   Doesn’t deal with invisible text
Conclusions/Thoughts

   Neat idea, but needs work on performance
    issues
   Also needs work on heuristics to reduce
    false positives without reducing
    effectiveness
   As long as search engine will return mostly
    legit sites, CANTINA works, but if not…
   Needs work on web pages that dynamically
    change using Javascript

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:12
posted:11/7/2011
language:English
pages:16