Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

searching

VIEWS: 11 PAGES: 34

									Week 1 You'll always find what you want
          A deep and uncharted web
                   Space
• The web is huge
• The problem is that time and space have a different
  meaning on the web.
• Everything that 'happens' is carved forever:
   – try to pull something 'off the web'
   – Everything you write and publish will defy eternity, carved in
     electrons: the very moment you put something on the web,
     someone, somewhere, will make a copy out of it.
   – It is bound to reappear, somewhere sometime: indestructible
     and redoutable powers of the void.



                                                                      2
        A deep and uncharted web
                  Time
• Time is different too.
• As anyone that has real web-experience knows,
  something you wrote, or published, remains
  unanswered - and apparently uncared of - for
  months, or years... and then, all of a sudden,
  when you almost forgot it yourself, a dozen
  persons begin contacting you out of the void,
  with an enormous and for you inexplicable
  interest in what you wrote so long ago.

                                                   3
4
                    Structure
• Four main different areas:
  –   Core
  –   Hidden databases
  –   Outside linkers
  –   Outside linked
• Different techniques are used to access these
  different areas.




                                                  5
                 Hidden databases
• Hidden databases.
   – These are pages that the Nucleus points
   – May (or may not) point back to the Nucleus.
   – For access-restrictive reasons visitors of sites located here are
     supposed to "pay" in order to access them. As you may imagine,
     these pages are NOT mutually linked.

• Fortunately the web was originally built in order to
  share knowledge.
• The building blocks, the "basic frames" behind the
  structure of the web are still the same regardless of the
  commercial twist.


                                                                         6
               Hidden databases
• If I may dare a comparison: exactly as it is pretty easy to
  break any software protection written in a higher
  language if you know (and use) assembly, so it is easy to
  break any server-user delivered barrier to a given
  database if you know (and can outflank) the protocols
  used by browsers and servers.

• As a result let's simply say that for some it is relatively
  easy to access all pages in this area reversing the (simple)
  perl or javascript tricks used to keep them "off limits" .
• (You wont even have to recurr to common exploits à la
  "politically correct" :-)

                                                             7
               Outside linked
• The "outside linked".
• The sites in this area are linked from the Nucleus
  but do not point back to it.
• For instance, the elements of a database of
  images, linked from the Nucleus but not
  necessarily pointing back to it.
• These pages are "outside" the nucleus, yet not
  particularly difficult to find.


                                                   8
                 Outside linkers
• Like matter and anti-matter the "outside linked"
  correspond an inversed related part of the web: the
  "outside linkers" pages.
• Indeed all the pages located in this specific area of the
  web do "point" to the Nucleus but are not pointed back
  from it.
• Imagine as an example the personal links page of a
  scientist: lotta interesting links to the Nucleus yet no
  need to publicize its existence.
• A page with information you may need is there,
  somewhere, without any link whatsoever that could
  bring you to it. Indeed there are no links back from the
  Nucleus to these pages.

                                                              9
                 Outside linkers
• The "outside linkers" are a part of the web you cannot
  reach using "normal" search techniques, since no link
  whatsoever points to them.
• Yet they may hoard knowledge you need. There are,
  fortunately, some techniques that you can apply in order
  to find them, the most simple and common one being
  'klebing'
• Klebing is using the information found inside the
  referral fields of site loggings and statistics when your
  target visits your site. The trick is to lure your potential
  targets to an interesting page you create and keep it
  updated until they land there and unsuspectingly leave
  an entry in your web log or stat server.


                                                            10
The Triad




            11
                    Why Teoma
• Because you can refine, refine, refine... and that's it!
• This is NOT just a simple "search within these results"
  thingy:
• Best choice for starting a query on BROAD topics.
• Teoma is resistant to index spamming (a huge problem
  for Google).
• Teoma eliminated all advertisement banners and
  interstitials (popups) in January 2003
• While Google's 'global linking' gives credit to every link
  equally, Teoma (should) instead find 'the links that
  count'... and count them.
• Teoma "creates on the fly clusters of web pages into
  topic specific web communities"

                                                           12
                 Why Google
• With google you can forget hyperlinks: just find
  pages selecting a set of very peculiar words that
  uniquely identify a given page.
• Google searches inside .pdf, .doc, etc. files.
  Moreover, it locates the text most relevant to
  your specific query and highlights your
  keywords and its context




                                                  13
                  Why Fast
• Because it is fast.
• Because it covers parts of the web that are not
  covered by google.
• Because it is less polluted by the useless
  commercial sites. Because it mines the "deep"
  and "obscure" web more than google or teoma.
• Because it is less censored than google. Because
  it is simply the best main engine for multi-words
  complex searches.

                                                  14
                 Searching
• 1) LOWERCASE
  Always enter your search terms in lower case
  (unless you want to limit your search). Most
  search engine will thus find both upper and
  lower case occurrences of your searchstring.
  "pAris" is NOT the same as "paris"




                                                 15
                  Searching
• 2) EXACT SEQUENCE [""]
  Enclose terms in double quotation marks if you
  want to retrieve those exact terms in that exact
  sequence. This may be very useful in order to
  find a specific page. Thus "saerch engine" will
  retrieve some (11) pages WITH THIS SAME
  MISSPELLING ERROR.




                                                     16
                     Searching
• 3) NARROW DOWN [ AND | & | + ] and
  ELIMINATE MERCILESSY [ AND NOT | | | - ]
  Narrow your searches by linking your search terms with
  AND or &, or simply use the plus sign [+]. The search
  engine will find only those pages that contain all of your
  search terms. Similarly, exclude pages that are not
  relevant to your search by preceding the search term
  with AND NOT or | or simply use the minus sign [-].
  +"search engines" +hints +tips +techniques -tits -sex -
  "make money" is better than the more simple +"search
  engines" +hints +tips +techniques.

                                                           17
                     Searching
• 4) DOWNSIDE OF THE BOOLEAN operators
  It's often difficult to specify exactly what you want to
  include or exclude. You can also get unexpected results
  if you are not careful about your use of operators and
  parentheses. For example, the search seeking OR
  searching AND finding is the same as the search seeking
  OR (searching AND finding). Both queries will find
  documents that contain both searching and finding,
  together with documents that contain the word seeking.
  However, the query (seeking OR searching) AND
  finding is not the same. It will find documents
  containing the word finding and, in the same document,
  either seeking or searching. Be careful with the boolean
  operators!

                                                         18
                        Searching
• 5) "PECULIAR" strings
  You should always strive to use differentiating keywords when
  searching the web. Words that are commonly used will not help you
  much. Extremely common words like articles and prepositions are
  so worthless that they are completely ignored. Try to use words
  which underline the peculiarity of your target. Common words,
  when combined with boolean qualifiers, can be very effective. You
  must identify the main concepts in your topic and determine any
  synonyms, alternate spellings, or variant word forms for the
  concepts. Remember that the most "peculiar" a word, the more
  useful it will be in order to sharpen your search.
  + title:"search strateg*" +hints +tips
  in this case we did include the "search strateg*" string (which
  already has an elevate PEC) in the title: keyword.



                                                                 19
                            Searching
• 6) ASTERISK[*]
  Note also the use of the asterisk [*] in the previous example: it
  MUST be used after at least 3 characters, it is valid for up to 5
  characters or as an element of a phrase.
  For Altavista:
    – Asterisk (*): After 3 specified characters will search for matches in up to
      5 trailing letters.

    – Question Mark (?): After 3 specified characters will match exactly one
      more character.

    – Double Asterisk (**) More flexible as it will search for matches for an
      unlimited number of trailing characters.

   You also have the ability use the wildcards interchangeably and
   more than once in the same search string


                                                                                20
                    Searching
• 7) STOP WORDS
  Stop words are words such as "and" "the" and "or" which
  search engines exclude from their searches to make them
  more effective. These terms are excluded because they
  are either extremely common or they are used by the
  search engine for performing more specialized searches.
  Just think about how many documents on the Web
  contain the word "the" and you'll understand how
  important is a good stop words list for all search
  engines.



                                                       21
                               Errors you encounter
•   400 - Bad request
    What does it mean?
    There's something wrong with the URL you typed. Maybe the server you're contacting doesn't recognize the document you're asking for, maybe it
    doesn't exist, or maybe you're not authorized to access it.
    What can you do about it?
    Check the URL. Pay special attention to uppercase and lowercase letters, colons, and slashes. Here's a tip: one style convention many sites observe is to
    slap initial capital letters on directory names but not filenames. If you get this message repeatedly, maybe the note you copied the URL from mixed up
    its uppercase and lowercase.

•   401 - Unauthorized
    What does it mean?
    You're probably accessing a site that's protected and you're not on the host's preferred guest list or you typed the password incorrectly. Some sites also
    put a block on domain types--if you're not from a .gov or .edu domain, for example, you may not be able to gain access.
    What can you do about it?
    If you're sure you're allowed in, try again, and this time look at the keyboard when you type. Passwords are often case-sensitive, so if you've got your
    Caps Lock on, take it off. If you're trying to break in, we don't want to know, but the odds are stacked against you.

•   403 - Forbidden
    What does it mean?
    You may not be allowed to access this document, probably because it's either blocked to your domain or it's password-protected.
    What can you do about it?
    If you know the password, try again, carefully. If you don't know the password but think you're eligible for one, contact the site's Webmaster and ask for
    it.

•   404 - Not found
    What does it mean?
    The server that hosts the site can't find the HTML document at the end of the URL. It may be a simple case of a mistyped URL, but it may also mean that
    the document doesn't exist anymore.
    What can you do about it?
    Try going one level up (deleting the last part of the URL to the nearest slash) to see if the site is live. If it is, check if there are links to the document you're
    looking for. Failing that, delete the last slash and type .html (or shtml) instead, and see what that gives you.

•   503 - Service unavailable
    What does it mean?
    There are a variety of possibilities: your access provider's server may be down, your company's gateway (the connection between the LAN and the
    Internet) may be broken, or your own system isn't working.
    What can you do about it?
    This is usually an easy one: wait a minute and try again. If the error persists, identify the culprit (access provider, gateway, or your system) by process of
    elimination.

•   Bad file request
    What does it mean?
    Your browser supports forms complete with data-entry fields and drop-down lists, but not the form you're trying to access. Perhaps there's an error or
    unsupported feature in the form.
    What can you do about it?
    Send email to the Webmaster and try the form again some other day                                                                                    22
                             Errors you encounter
•   Cannot add form submission result to bookmark list
    What does it mean?
    You've just entered a search request and tried to save the result as a bookmark. Though it may appear as a discrete address, the result isn't a legitimate
    URL, so you can't add it to your bookmark list.
    What can you do about it?
    Try saving the result page as an HTML page on your hard disk. Use the Save As command then add the saved page to your bookmark list. Depending
    on the CGI script behind the query, you may or may not be successful. But it's worth a try.

•   Connection refused by host
    What does it mean?
    You may not be allowed to access this document, probably because it's either blocked to your domain or it's password-protected.
    What can you do about it?
    If you know the password, try again, carefully. If you don't know the password but think you're eligible for one, contact the site's Webmaster and ask for
    it.

•   Failed DNS lookup
    What does it mean?
    The domain name system can't translate the URL to a valid Internet address. This is either a harmless blip or the result of a mistyped URL (specifically, a
    mistyped host name).
    What can you do about it?
    Blips in DNS lookup are common, and often you can rectify this by clicking the Reload button. If that doesn't work, check your typing of the URL
    carefully. If the problem persists, try again after an hour or so.

•   File contains no data
    What does it mean?
    The site you've accessed is the right one, but there are no Web page documents on it. You may have stumbled upon this site just as updated versions are
    being uploaded.
    What can you do about it?
    Try the URL again, carefully. If that doesn't help, try again in an hour.

•   Helper application not found
    What does it mean?
    Your browser doesn't recognize a file at the Web or Net site you're visiting. Most browsers can be extended using helper applications (or viewers) to
    read files they don't otherwise recognize. These files aren't necessarily graphics--they can be sound files, movie clips, or ZIP or SIT archive files you're
    trying to download.
    What can you do about it?
    The dialog box that carries this message will usually give a clue about the file type that's missing. (You may see some gibberish about octet streams, but
    after that you'll probably see some reference to graphic-TIFF, which gives it away.) Look at CNET's Survival Kits for your computing platform (Mac, PC,
    or Unix) for viewers for the most common file types. Then follow your browser's instructions for assigning a viewer for each file format you wish to
    view online.

•   Host unavailable
    What does it mean?
    The machine that hosts this site is probably down for maintenance.
    What can you do about it?
    If at first you don't succeed, hit Refresh or Reload again and again. But wait a while between refreshes.
                                                                                                                                                            23
                            Errors you encounter
•   Host unknown
    What does it mean?
    The server may be down for maintenance, or you may have lost the connection (your modem disconnected, or your company's T1 line is choking).
    What can you do about it?
    Hit the Reload button first. This is often a blip in the Net. Then check the URL for typos (and don't forget case-sensitivity). Then make sure you're
    connected by hitting Reload, which will re-establish connections in many cases.

•   Network connection was refused by the server
    What does it mean?
    The server is probably too busy to handle one more user, but it's not configured to generate its own message, so this generic message shows up instead.
    What can you do about it?
    As always, try and try again. If that doesn't work, wait as long as you can. Then try again.

•   NNTP server error
    What does it mean?
    You're trying to log on to a Usenet newsgroup, but you can't get to it. The Usenet server is something that's made available by your Internet service
    provider, so it may be that this newsgroup isn't available at all.
    What can you do about it?
    Make sure you've typed the URL correctly. If that doesn't help, try again later. If the problem persists, contact your access provider and give them a piece
    of your mind.

•   Permission denied
    What does it mean?
    You're trying to upload a file to an ftp site, and the site's administrator doesn't want you to. Alternatively, you're using the wrong syntax when trying to
    get a file. Or maybe the site is currently too busy to handle your upload.
    What can you do about it?
    First check that you used the correct syntax. Then try again later. If the problem persists, send email to the Webmaster and ask how you can upload a file
    to that site.

•   Too many connections--try again later
    What does it mean?
    This is another variation on the rush-hour error message. You've picked the wrong time to call, that's all.
    What can you do about it?
    Do as it says--try again later, or keep hitting the Refresh button until you succeed.

•   Too many users
    What does it mean?
    No ftp site has unlimited access: physical connections or administrator policy allocate a number of anonymous users to a given site. When that number
    is exceeded, all who try to log on receive this message.
    What can you do about it?
    Just keep trying until you get lucky. However, on a busy site (like Netscape's the week after a big announcement) or one with very limited access rights,
    you may be out of luck. If so, check to see whether the site has mirrors, and try one of those.

•   Unable to locate host
    What does it mean?
    The server may be down for maintenance, or you may have lost the connection (your modem disconnected or your company's T1 line is choking).
    What can you do about it?
    Hit the Reload button first. This is often a blip in the Net. Check the URL for typos (and don't forget case-sensitivity), then make sure you're connected
    by hitting Reload, which will re-establish connections in many cases.
                                                                                                                                                             24
                      Errors you encounter
•   Unable to locate the server
    What does it mean?
    You have either mistyped the URL, or the server doesn't exist (you may have outdated information).
    What can you do about it?
    Your mission, should you choose to accept it: enter the URL again, looking at the keyboard as you type. No luck?
    Check with your source to verify that the URL is correct.

•   Viewer not found
    What does it mean?
    Your browser doesn't recognize a file at the Web or Net site you're visiting. Viewable files aren't necessarily
    graphics--they can be sound files, movie clips, ZIP or SIT archive files, and so on. If it's not a GIF or JPEG file, your
    browser may not know what it is.
    What can you do about it?
    The dialog box that carries this message will usually give a clue about the file type that's missing. (You may see
    some gibberish about octet streams, but after that you'll probably see some reference to graphic-TIFF, which gives
    it away.) Look at CNET's Survival Kits for your computing platform (Mac, PC, or Unix) for viewers for the most
    common file types. Then follow your browser's instructions for assigning a viewer for each file format you wish to
    view online.

•   You can't log on as an anonymous user
    What does it mean?
    This message covers a multitude of sins. Some ftp sites allow people who aren't members, some don't. Others may
    allow nonmembers, but limit the number of visitors. Another possibility is that your browser doesn't support
    anonymous ftp access. The way most browsers handle this is to submit "anonymous" as the user ID and your
    email address as the password. The America Online browser is one of the few that don't do this.
    What can you do about it?
    Either try again later after the rush hour or enter your user ID and password manually (using ftp software such as
    WS-FTP). Remember: your ID is anonymous and your password is your (I hope for you bogus) email address.




                                                                                                                          25
                 Project #1
                     Part II
• Building the Zen of awareness




                                  26
Cat burglars in the museum after dark
• Our Target
  – http://gallica.bnf.fr/



"70000 documents numérisés, une navigation plus
    intuitive, cette nouvelle version de Gallica
constitue la mise à jour la plus importante depuis
     la création de ce serveur en octobre 1997.



                                                 27
http://gallica.bnf.fr/




                         28
Find Me In the Museum




                        29
Become One With Your Environment
• Relax
• If you focus you are trying too hard and you will
  miss what is happening.
• Try for 15 minutes then take a break and do
  something else while your mind explores the
  problem…laterally
• What kind of things is this browser loading?



                                                  30
An Outside Linker?
    Can you find it?




                       31
http://gallica.bnf.fr/Fonds_Mosaiques/




                                         32
             Hints to Find Me
• 07720489
• Scripts can be used to acquire the target




                                              33
              What to Submit
• Either the photo (*.jpg)
• Or a short (~900 word) essay on:
  – How you think the museum is organized (diagrams
    would be helpful)
  – Why you were unable to find the picture
  – Your thoughts on the ‘outside linker’ where is it…
    why you cant find it…what is used for
  – Other interesting observations you may have
    collected


                                                         34

								
To top