Google Hacking A Crash Course

Document Sample
Google Hacking A Crash Course Powered By Docstoc
					Hacking : A Crash Course

Source: Alex Keller, Network/Systems Administrator for BSS Computing @ SFSU
                                  and
  Harry R Erwin& PhD Peter Dunne, PhD University of Sunderland - CSEM02
                     retrieved from the Web June 2005
What is "Google"?
    Definition: Google
     Pronunciation: 'gü-"gol
     Function: noun

     Google is a play on the word googol, coined by Milton Sirotta, nephew of
     American mathematician Edward Kasner, and popularized in the book,
     "Mathematics and the Imagination" by Kasner and James Newman.

     It refers to the number represented by 1 followed by 100 zeros. Google's use of
     the term reflects the company's mission to organize the immense, seemingly
     infinite, amount of information available on the web.

    Originally called "Backrub", the Google search engine logic was develop by
     graduate students Larry Page and Sergey Brin at Stanford University in 1995.

     Their first place of business was in a garage, chosen because it had a
     washer/dryer and a hot tub out back, they were serving 10,000 searches / day.
                                      http://www.google.com/corporate/history.html
     How We Got Here....
   Google is the undisputed leader in online search
    technology.

   Altavista, FAST, and Inktomi had the larger databases;
    but poorer search algorithms.

   Google's profit is partially ad driven, but sponsors do not
    „buy‟ higher rankings in searches.
    Google does Searching and
    Beyond...
   Localization          Programming tools
   Language options      Intra-network
   Toolbar                searches
   Blogger               Print searching
   Translation           Desktop search
   Calculator            Mobile Access
   Stock Quotes
                          News
   Phonebook
                          Spell Checker
   Newsgroups
                          Pricing
Search Engine Supremacy




         http://searchenginewatch.com/reports/article.php/2156481
How Big is Google?




          http://searchenginewatch.com/reports/article.php/2156481
     Searches Per Day in Millions
      80

                                                   250
45



                                                                      Google
80                                                                    Yahoo/Overture
                                                                      Inktomi
                                                                      Looksmart
                                                                      Others

            167


           http://searchenginewatch.com/reports/article.php/2156461
So How Does Google Work?
   Crawls and indexes web pages

   Stores copies of web pages and graphics on
    caching servers

   Provides simple GUI for querying database of
    cached pages

   Returns search results in order
    based on relevancy
            Anatomy of a Search
            Server Side                                Client Side




http://computer.howstuffworks.com/search-engine1.htm
What Can Google Search?
   Adobe Portable Document Format (pdf)
   Adobe PostScript (ps)
   Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)
   Lotus WordPro (lwp)
   MacWrite (mw)
   Microsoft Excel (xls)
   Microsoft PowerPoint (ppt)
   Microsoft Word (doc)
   Microsoft Works (wks, wps, wdb)
   Microsoft Write (wri)
   Rich Text Format (rtf)
   Shockwave Flash (swf)
   Text (ans, txt)
       So What Determines Page
       Relevance and Rating?
      Exact Phrase:
       are keywords found as an exact phrase in any pages?

      Adjacency:
       how close are keywords to each other?
      Weighting:
       how many times do keywords appear in the page?
      PageRank/Links:
       how many links point to the page? how many links are
       actually in the page?

Equation: (Exact Phrase Hit)+(AdjacencyFactor)+(Weight) * (PageRank/Links)
                                   From: Google 201, Advanced Googology - Patrick Crispen, CSU
How Do I Get Results?
   Pick your keywords carefully & be specific
   Do NOT exceed 10 keywords
   Use Boolean modifiers
   Use advanced operators
   Google ignores some words*:
a, about, an, and, are, as, at, be, by, from, how, i, in, is, it, of,
on, or, that, the, this, to, we, what, when, where, which, with



       *From: Google 201, Advanced Googology - Patrick Crispen, CSU
    Google's Boolean Modifiers
   AND is always implied.
   OR: Escobar (Narcotics OR Cocaine)
   "-" = NOT: Escobar –Pablo
   "+" = MUST: Escobar +Roberto
   Use quotes for exact phrase
    matching: "nobody puts baby in a corner"
                                      OR
    "there are known knowns; there are things we know we know. We also
    know there are known unknowns; that is to say we know there are
    some things we do not know. But there are also unknown unknowns,
    the ones we don't know we don't know."
      Wildcards
Google supports word wildcards but NOT
 stemming.
     "It's the end of the * as we know it" works.
     but "American Psycho*" won't get you
      decent results on American Psychology or
      American Psychophysics.
 Advanced Searching
Advanced Search Page:
 http://www.google.com/advanced_search
Advanced Operators
   cache:                                      filetype:
   define:                                     numrange 1973..2005
   info:
                                                source:
   intext:
                                                phonebook:
   intitle:
   inurl:                    DEMO:
   link:                     on-2-13-1973..2004

   related:                  visa
                              4356000000000000..435699999999

   stocks:                   9999




    http://www.googleguide.com/advanced_operators.html
Typical Search Filetypes
   Pdf
   Ps
   Xls
   Ppt
   Doc
   Rtf
   Txt
    Searches to Worry About

   Site:                          Admin|administrator
   Intitle:index.of               -ext:html -ext:htm           -
                                    ext:shtml -ext:asp       -
   Error|warning                   ext:php
   Login|logon                    Inurl:temp|inurl:tmp|
   Username|userid|employ          inurl:backup|inurl:bak
    ee.ID| “your username is”      Intranet|help.desk
   Password|passcode|
    “your password is”
Commonly Available Sensitive
Information
   HR files
   Helpdesk files
   Job listings
   Company information
   Employee names
   Personal websites and blogs
   E-mail and e-mail addresses
Directory Listings
   Search for intitle:index.of
   Or intitle:index.of “parent directory”
   Or intitle:index.of name size
   Or intitle:index.of inurl:admin
   Or intitle:index.of filename
   This can then lead to a directory traversal
   Look for filetype:bak, too, particularly if you
    want to expose sql data generated on the fly
    Web-Enabled Network Devices

   Google webspider often encounters web-
    enabled devices. These allow
    administrators to query status or manage
    configuration using a web browser.

   You may be able to access network
    statistics this way.
    Network Mapping
   Site:domain name
   Site crawling, particularly by indicating negative
    searches for known domains
   Lynx is convenient if you want lots of hits:
      lynx -dump “http://www.google.com/search?\

      q=site:name+-knownsite&num=100” >\

      test.html


   Or use a Perl script with the Google API
    Link Mapping

   Explore target site to see what it links to.
    (Owners of linked sites may be trusted,
    but have weak security.)
   The link operator supports this search.
   Also check newsgroups for questions
    from people at the organization.
Extras...
     Translation and Language options - over 100 to choose from:
      http://www.google.com/language_tools
     Stock Quotes - enter stocks:, example: stocks:GOOG
     Newsgroups - http://groups.google.com
     Calculator - "1024 minus 768" or "12 to the 10 power"
     Froogle - http://froogle.google.com
     Images - http://images.google.com
     Spell Checking - just type it in: "convienence"
     Blogger - http://www.blogger.com/start



  Extras can be found at http://www.google.com/help/features.html
    Google, doesn't make it
    right...
GOOD
 FAIR - Fairness and Accuracy in Reporting http://www.fair.org/
 Federation of American Scientists: http://www.fas.org/main/home.jsp

BAD
   Holocaust Never Happened? http://www.air-photo.com/english
   School of the Americas:
http://carlisle-www.army.mil/usamhi/usarsa/main.htm

UGLY
   Pixyland!
http://www.pixyland.org/peterpan/photo_closeups_pp4.htm
Protecting Yourself
   Solid security policy
   Public web servers are Public!
   Disable directory listings
   Block crawlers with robots.txt
   <META NAME=“ROBOTS”
    CONTENT=“NOARCHIVE”>
   NOSNIPPET is similar.
More Protection
   Passwords
   Delete anything you don‟t need from the
    standard webserver configuration
   Keep your system patched.
   Hack yourself
   Use the URL removal tools to delete sensitive
    data that gets into Google.
Bibliography and Further Research
Search Engine Watch:
http://searchenginewatch.com

Google Hacks: 100 Industrial-Strength Tips & Tools
by Tara Calishain, Rael Domfest

Johnny I Hack Stuff:
http://johnny.ihackstuff.com

Google:
http://www.google.com

HowStuffWorks:
http://computer.howstuffworks.com/search-
     engine1.htm
    Web-Enabled Network Devices

   Google webspider often encounters web-
    enabled devices. These allow
    administrators to query status or manage
    configuration using a web browser.

   You may be able to access network
    statistics this way.
    Network Mapping
   Site:domain name
   Site crawling, particularly by indicating negative
    searches for known domains
   Lynx is convenient if you want lots of hits:
      lynx -dump “http://www.google.com/search?\

      q=site:name+-knownsite&num=100” >\

      test.html


   Or use a Perl script with the Google API
    Link Mapping

   Explore target site to see what it links to.
    (Owners of linked sites may be trusted,
    but have weak security.)
   The link operator supports this search.
   Also check newsgroups for questions
    from people at the organization.
Extras...
     Translation and Language options - over 100 to choose from:
      http://www.google.com/language_tools
     Stock Quotes - enter stocks:, example: stocks:GOOG
     Newsgroups - http://groups.google.com
     Calculator - "1024 minus 768" or "12 to the 10 power"
     Froogle - http://froogle.google.com
     Images - http://images.google.com
     Spell Checking - just type it in: "convienence"
     Blogger - http://www.blogger.com/start



  Extras can be found at http://www.google.com/help/features.html
    Google, doesn't make it
    right...
GOOD
 FAIR - Fairness and Accuracy in Reporting http://www.fair.org/
 Federation of American Scientists: http://www.fas.org/main/home.jsp

BAD
   Holocaust Never Happened? http://www.air-photo.com/english
   School of the Americas:
http://carlisle-www.army.mil/usamhi/usarsa/main.htm

UGLY
   Pixyland!
http://www.pixyland.org/peterpan/photo_closeups_pp4.htm
Protecting Yourself
   Solid security policy
   Public web servers are Public!
   Disable directory listings
   Block crawlers with robots.txt
   <META NAME=“ROBOTS”
    CONTENT=“NOARCHIVE”>
   NOSNIPPET is similar.
More Protection
   Passwords
   Delete anything you don‟t need from the
    standard webserver configuration
   Keep your system patched.
   Hack yourself
   Use the URL removal tools to delete sensitive
    data that gets into Google.
Bibliography and Further Research
Search Engine Watch:
http://searchenginewatch.com

Google Hacks: 100 Industrial-Strength Tips & Tools
by Tara Calishain, Rael Domfest

Johnny I Hack Stuff:
http://johnny.ihackstuff.com

Google:
http://www.google.com

HowStuffWorks:
http://computer.howstuffworks.com/search-
     engine1.htm