CSCI 572 Information Retrieval and Search Engines (PowerPoint)

Document Sample
CSCI 572 Information Retrieval and Search Engines (PowerPoint) Powered By Docstoc
					CSCI 572: Information Retrieval and
  Search Engines: Summer 2010

        Prof. Chris A. Mattmann
                          The Class
• Will give you a complete treatment of the area of search engines and
  information retrieval
   – The fundamental building blocks of the web and search engines
       • The Search Engine Architecture proposed by Brin/Page
       • Understanding algorithms for ranking pages
       • Understanding technologies for characterizing, downloading,
         parsing, indexing, searching and disseminating web content
   – Advanced topics in search engines such as BigData and distributed
• Will equip you with the necessary skills to design complex, real-
  world search engines

   May-20-10                  CS572-Summer2010                 CAM-2
              General class information
• Lecture, but…
   – You can participate
   – You should participate
   – You will participate, that is, if you want to do well :)
• Breakdown of points
   – 20% participation
   – 40% research paper presentation
   – 40% course project

  May-20-10                 CS572-Summer2010               CAM-3
              General class information
• Syllabus/website:
   – Visit it often, as the schedule may change!
   – This is where all of your course project info and
     presentation info will be posted
   – This site will point you to required reading (research
     papers), and to lectures that you can download before

  May-20-10               CS572-Summer2010             CAM-4
                   What we’ll cover
• Theory
   –   Understanding of basic information retrieval
   –   Search engine querying
   –   Search engine ranking
   –   Architecture of search engines and technologies
   –   Design Patterns
• Practice
   – Modern search engine technologies from Apache

  May-20-10                 CS572-Summer2010             CAM-5
               Course Presentation
• Each week, we’ll read a few research papers on
  search engines
• For the first part of the course (5 weeks), I’ll
  lecture on the general topics that the research
  papers cover
   – The search engine architecture: fetching, parsing,
     indexing, querying, distributed computation, etc.
• For the last part of the course (~5 weeks), each one
  of you will present on one of the research papers
  we covered in the first 5 weeks
  May-20-10               CS572-Summer2010                CAM-6
                    Course Presentation
• What I’m looking for (~20 minutes of presentation, with ~5
  mins questions at the end)
   – You understood the paper
   – Discussion of related work and background
   – Discussion of why should I care about the topic
          • And more importantly why your fellow classmates should care
   – Relation of your paper to the lecture slides I gave on the topic
   – Simple summarization and description of the algorithm and/or
     technology introduced in the paper
   – What were the results/contributions/conclusions of the paper
   – Your evaluation of Pros of the paper
   – Your evaluation of Cons of the paper
  May-20-10                        CS572-Summer2010                       CAM-7
                   Course Presentation
• What I’m NOT looking for
   –   Plagiarism
   –   Repetition
   –   Cutting/Pasting out of the paper
   –   Regurgitation
   –   You to follow the EXACT set of bullets that I gave on the prior slide
• You should be looking to be innovative – show the class and
  me that you really understood what was in the paper
   – Treat it like a conference presentation

  May-20-10                       CS572-Summer2010                  CAM-8
                      Course Project
• You will get to leverage one or a combination of several
  Apache software technologies
   – Nutch, Tika, Lucene, Solr, Hadoop, HBase, Hive, Cassandra, etc.
• You will make a significant contribution to one or more of
  the above communities
• Deliverables
   – A 2 page project proposal
   – A 2 page mid-term project report
   – Source code and final demonstration to me at end of class

  May-20-10                   CS572-Summer2010                   CAM-9
                          Course Project
• Deliverables
   – Your project proposal should include:
          • Demonstration that you’ve researched your particular idea with
            pointers to issue trackers and mailing lists
          • Objectives section
          • Approach section
          • Identification of deliverables section
          • Timeline/Schedule
   – Your mid term report should include:
          • Current status
          • Blockers to completion
          • Planned mitigation to blockers

  May-20-10                        CS572-Summer2010                     CAM-10
•   Graduated with my Ph.D. in Computer
    Science from USC in 2007
     – Advisor: Dr. Nenad Medvidovic
•   Was a student at USC from 1998-2007
     – B.S., Computer Science 2001
     – M.S., Computer Science 2003
•   My research interests
     – The intersection of software
       architectures, and large-scale data
     – Software connector selection
     – Bayesian decision theory
     – Reinforcement learning
     – Search Engines

    May-20-10                       CS572-Summer2010   CAM-11
• Quick lecture on characterizing the web
• Read the papers linked from the syllabus
• Be ready for next Tuesday as this is a 10-week
  course and we are going to dive in

  May-20-10            CS572-Summer2010        CAM-12