Docstoc

Expert Search Presentation

Document Sample
Expert Search Presentation Powered By Docstoc
					   ExpertSearch
    Data Sciences Summer Institute
University of Illinois at Urbana-Champaign
                  July 1, 2011




                                             1
Expert Search Group




                      2
   Expert Search Goal
 Expert Search is a search engine
that returns a list of people who are
experts in a particular area of study
  given a paper abstract or list of
                topics.
                              Expert #1

 Paper         Expert         Expert #2
Abstract       Search
                              Expert #3
or Topic
                              Expert #4
                                     3
Expert Search Current Use
  Generate a list of professors that
should be invited to a talk on campus.


               Summary of Talk



               List of people
                  to e-mail

                                   4
Presentation Overview




                        5
                     Data Crawling
____________________________________________________________________




                                                                6
   Data Crawling Team
    Oluwadare Ibiyemi
     Fitzroy Nembhard
     Froswell Wallace
Thapanapong Rukkanchanunt


                        7
     Data Crawling

13609 Documents

           4445 Experts

  987 Mb of Text
         247 Programs of Study

                            8
      Data Crawling


            crawl   obtain




retrieve                     save
           store


                                    9
            Data Crawling
• Get homepage URL from UIUC phonebook

• Use search engine to obtain the URL if it is not
  in UIUC phonebook

• Crawl homepage

• Terminate the search at 300 HTML pages

• Collect and process PDF files

• Store the number of documents, homepage
  link, and text crawled from "homepage"
                                              10
Data Crawling




                11
Data Crawling: Tools




                       12
               Data Classification
_____________________________________________________________




                                                                13
Classification & Extraction

       Kendra Clay
        Eunki Kim
       Victoria Ko
   Bekah Van Maanen


                          14
Classification & Extraction


Classification


    Extraction
                              15
      Classification Goal
 The task for classification is to determine
whether the URL listed for the expert is his
            or her homepage.


                  HTML Text Files


                  Classified Homepages

                                         16
   Classification: Labeling
• Implemented supervised learning
• Collected text from crawled data
• Manually labeled 1300 web pages




                                 17
   Classification: Learning
Create directories for testing, training,
          and unlabeled data.

                         data



         train                  test       unlabeled



  plus           minus   plus          minus

                                               18
   Classification: Classification
 • Learning Algorithm: Sparse Network Learner
 • Feature: Bag of Words (bigram)
 • 1063 files classified as homepages by
   classifier
 • Accuracy: 87.333%

                    Precision        Recall
Not homepage
(minus)            90.000           84.867
Homepage (plus)
                   85.000           90.667
                                              19
Classification: Example




                          20
Classification




                 21
Classification: Example




                          22
Classification & Extraction


Classification


    Extraction
                              23
    Information Extraction
• Information(Keywords) Extraction
• Two types: 1. Homepages / 2. Papers
• Output used by Information Retrieval to
  match search query to an expert


                 Expert Text Files


                 Expert Interests
                                       24
               Extraction Task


                                           Use methods
Expert Text Files   1. HTML code           or apply rules
                    2. Parsed Text Files




                                            Extract
                                           Interests

         Expert Interest
            Text File                                  25
  Extraction Task- Challenges
• Various formats of homepages
• Needs to set rules to deal with various
  cases




                                     26
         Extraction Task: Rules
• Step 1: Extraction rules
- Get a big chunk of information
- Example of tokens: Research Areas, Interests,
    Specialization, Areas of Expertise, Field of Study

• Step 2: Iteration rules
 - Find what format it is and refine found information
 - Example of formats: List, Comma, Table, Paragraph, Link


• Repeat Step 1 and Step2
   until it founds the right part of information
                                                         27
             Extraction Task: Example




                       Webpage                    Step 1. Find “Research Areas“


Profile   Research     Courses        Education     Publications
           Areas
                                                  Step 2. Define what format it is
                                                       here: “List” with <ul>
                                                       apply iteration rule with <li>

  Soil erosion and sediment control
  Water quality and management
http://abe.illinois.edu/faculty/M_Hirschi
                                                                              28
   Extraction Task: Challenges
o Papers - include some non-word text
 (i.e. mathematical notation, etc), may
 be incorrectly identified as keywords
   Sol'n: take abstracts from paper
o How long should a keyword phrase
 be to be useful in associating it to an
 expert?
   Must define maximum length
o How can we identify keywords?
    Part-of-speech, noun phrases papers/pdfs
                                          29
                    Extraction Task : Tools

  Illinois Chunker




Above: part of abstract from Pictorial Structures for Object Recognition by P. Felzenszwalb and D. Huttenlocher




  • NP are Candidate Keywords
  • Calculate weight for potential keywords
  • Take top 10 highest weight noun phrases

                                                                                                      30
       Extraction Task : Tools
• Rapid Automatic Keyword Extraction (RAKE)

• Frequency: Total # of word occurrences.

• Degree: (total # of individual occurrences of
  word in document) + (length of each noun
  phrase the word appears in)

• Word score: s(w) = deg(w)/freq(w)

• NP Score: np_s(w) = s(w1) + s(w2) +...+
  s(wk), where w = (w1 w2...wk) = noun phrase
  and s(wk) = individual word score for word wk
                                             31
Classification & Extraction




                              32
                   Topic Modeling
_____________________________________________________________




                                                                33
Topic Modeling Team

    Pradip Karki
   Sam Somuah




                      34
         Topic Modeling
• Goal: To discover latent topics in the
  “bag of words” associated with expert
• Process: Latent Dirichlet Allocation
  and Gibbs Sampling


                  Expert Text Files

                 Distribution of words over
                 topics, topics over experts
                                        35
 Topic Modeling:Motivation
Challenges:
• Large number of documents
• Experts have multiple areas of
  expertise.
Topic Modeling:
• A generative model
• Reduce dimensionality by mapping to a
  limited number of topics
• "Hidden" topics can be discovered
  without the need for labeling.      36
          Topic Modeling
 The probabilities of the topics and words
  associated with the topics are used to
retrieve relevant results by the Information
               retrieval group




                                          37
                    Topic Modeling


Expert Text Files




                                     38
     Topic Modeling:Output
Expert-Topic
ExpertID       TopicID     Prob
3301           87          0.551046848
3301           173         0.127817630
3301           199         0.065532870
3301           176         0.049024193

 Topic-Word
TopicID        Word        Prob
87             data        0.177620899
87             mine        0.014229041
87             algorithm   0.012556624
87             pattern     0.124705841
                                         39
Topic Modeling: Tools




                        40
          Information Retrieval
_____________________________________________________________




                                                                41
Information Retrieval Team

      Sean Massung
          Fei Wu




                        42
     Information Retrieval
Given a user's query, the IR component
   acts as a search engine, ranking
     experts based on relevancy.




               List of experts
               ordered by relevancy

                                      43
          Information Retrieval
 System Flow
           HTTP POST            HTTP GET
           Request              with Key

                                              UI


Abstract or query
                                        Key
                             Key/List
                                                   List




                Crawl data
                              Database
                                                   44
45
Information Retrieval




                        46
             Results
• As expected, longer queries
  produce more accurate results
• LM method is more accurate for
  short queries, whereas the TM
  method performs well on longer
  queries
• Overall, we have good expert recall

                                   47
Information Retrieval




                        48
                 User Interface
__________________________________________________________




                                                             49
User Interface Team

  Fitzroy Nembhard

   Jerone Dunbar




                      50
        User Interface
The user interface allows anyone to
  search for experts using paper
abstracts, keywords, or department.

              Abstract, Keywords
              or Department

              Ranked Results

                                   51
                   User Interface
 System Flow
              HTTP POST                  HTTP GET
              Request                    with Key

                          IR SYSTEM


Abstract or
                                                 Key
  query                               Key/List
                                                       List




                 Crawl data
                                      Database
                                                       52
User Interface




                 53
User Interface - Ranked List




                           54
User Interface




                 55
   Expert Word Cloud




Above: Screen shot of the word cloud. This is for Jiawei Han56
   Distribution of Experts




Above: Distribution of experts across a geographical area, with 57
the top ranking experts in red
User Interface: Tools




                        58
                  Summary
• Mobile application
• Web Application
• DHS applications
  o   Preparedness
  o   Response and Recovery
        Given a particular event, such as natural
         disasters, have the capability of searching
         for experts to can help with the situation.
        Cuts across various sectors of the economy.
        For individual uses in time of emergencies,
         such as urgent medical conditions
  o   Each component of the Expert Search engine
      can still be worked upon to further enhance its
      capabilities.
                                                59
Questions ?




              60

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:13
posted:11/30/2011
language:English
pages:60