Embed
Email

Search

Document Sample
Search
Shared by: HC111130015410
Categories
Tags
Stats
views:
0
posted:
11/29/2011
language:
English
pages:
37
Search

Session 12

LBSC 690

Information Technology

Agenda

• The search process



• Information retrieval



• Recommender systems



• Evaluation

Information “Retrieval”

• Find something that you want

– The information need may or may not be explicit



• Known item search

– Find the class home page



• Answer seeking

– Is Lexington or Louisville the capital of Kentucky?



• Directed exploration

– Who makes videoconferencing systems?

Information Retrieval Paradigm

Document

Search Browse

Delivery





Select Examine





Query Document

Supporting the Search Process

Source IR System Predict Nominate Choose

Selection



Query

Query

Formulation



Search Ranked List





Query Reformulation Selection Document

and

Relevance Feedback



Examination Document



Source

Reselection

Delivery

Supporting the Search Process

Source IR System

Selection



Query

Query

Formulation



Search Ranked List





Selection Document

Indexing Index



Examination Document

Acquisition Collection



Delivery

Human-Machine Synergy

• Machines are good at:

– Doing simple things accurately and quickly

– Scaling to larger collections in sublinear time



• People are better at:

– Accurately recognizing what they are looking for

– Evaluating intangibles such as “quality”



• Both are pretty bad at:

– Mapping consistently between words and concepts

Search Component Model

Utility



Human Judgment



Information Need Document









Document Processing

Query Formulation

Query Processing









Query



Representation Function Representation Function



Query Representation Document Representation



Comparison Function



Retrieval Status Value

Ways of Finding Text

• Searching metadata

– Using controlled or uncontrolled vocabularies





• Free text

– Characterize documents by the words the

contain





• Social filtering

– Exchange and interpret personal ratings

“Exact Match” Retrieval

• Find all documents with some characteristic

– Indexed as “Presidents -- United States”

– Containing the words “Clinton” and “Peso”

– Read by my boss





• A set of documents is returned

– Hopefully, not too many or too few

– Usually listed in date or alphabetical order

Ranked Retrieval

• Put most useful documents near top of a list

– Possibly useful documents go lower in the list



• Users can read down as far as they like

– Based on what they read, time available, ...



• Provides useful results from weak queries

– Untrained users find exact match harder to use

Similarity-Based Retrieval

• Assume “most useful” = most similar to query



• Weight terms based on two criteria:

– Repeated words are good cues to meaning

– Rarely used words make searches more selective



• Compare weights with query

– Add up the weights for each query term

– Put the documents with the highest total first

Simple Example: Counting Words





Query: recall and fallout measures for information retrieval



1 2 3 Query



Documents: complicated 1

contaminated 1

1: Nuclear fallout contaminated Texas. fallout 1 1

information 1 1 1

2: Information retrieval is interesting.

interesting 1

3: Information retrieval is complicated. nuclear 1

retrieval 1 1 1

Texas 1

Discussion Point:

Which Terms to Emphasize?

• Major factors

– Uncommon terms are more selective

– Repeated terms provide evidence of meaning



• Adjustments

– Give more weight to terms in certain positions

• Title, first paragraph, etc.

– Give less weight each term in longer documents

– Ignore documents that try to “spam” the index

• Invisible text, excessive use of the “meta” field, …

“Okapi” Term Weights

TFi , j  N  DF j  0.5 

wi , j  * log  

Li  DF  0.5 

1.5  TFi , j  0.5  j 

L



TF component IDF component

1.0 6.0



5.8

0.8

5.6



L/ L 5.4

0.6

Okapi TF









0.5

Classic

IDF



1.0 5.2

Okapi

2.0

0.4

5.0



4.8

0.2

4.6



0.0 4.4

0 5 10 15 20 25 0 5 10 15 20 25

Raw TF Raw DF

Index Quality

• Crawl quality

– Comprehensiveness, dead links, duplicate detection

• Document analysis

– Frames, metadata, imperfect HTML, …

• Document extension

– Anchor text, source authority, category, language, …

• Document restriction (ephemeral text suppression)

– Banner ads, keyword spam, …

Indexing Anchor Text

• A type of “document expansion”

– Terms near links describe content of the target



• Works even when you can’t index content

– Image retrieval, uncrawled links, …

Queries on the Web (1999)

• Low query construction effort

– 2.35 (often imprecise) terms per query

– 20% use operators

– 22% are subsequently modified



• Low browsing effort

– Only 15% view more than one page

– Most look only “above the fold”

• One study showed that 10% don’t know how to scroll!

Types of User Needs

• Informational (30-40% of AltaVista queries)

– What is a quark?

• Navigational

– Find the home page of United Airlines

• Transactional

– Data: What is the weather in Paris?

– Shopping: Who sells a Viao Z505RX?

– Proprietary: Obtain a journal article

Searching Other Languages

English Definitions

Query Query

Formulation



Query

Translation Translated Query Translated “Headlines”



Search Ranked List MT



Selection Document







Examination Document



Query Reformulation

Use

Speech Retrieval Architecture



Query

Speech Formulation

Recognition





Boundary Automatic

Tagging Search







Content Interactive

Tagging Selection

Rating-Based Recommendation

• Use ratings as to describe objects

– Personal recommendations, peer review, …





• Beyond topicality:

– Accuracy, coherence, depth, novelty, style, …





• Has been applied to many modalities

– Books, Usenet news, movies, music, jokes, beer, …

Using Positive Information

Small Space Mad Dumbo Speed- Cntry

World Mtn Tea Pty way Bear

Joe D A B D ? ?

Ellen A F D F

Mickey A A A A A A

Goofy D A C

John A C A C A

Ben F A F

Nathan D A A

Using Negative Information

Small Space Mad Dumbo Speed- Cntry

World Mtn Tea Pty way Bear

Joe D A B D ? ?

Ellen A F D F

Mickey A A A A A A

Goofy D A C

John A C A C A

Ben F A F

Nathan D A A

Problems with Explicit Ratings

• Cognitive load on users -- people don’t like

to provide ratings

• Rating sparsity -- needs a number of raters

to make recommendations

• No ways to detect new items that have not

rated by any users

Implicit Evidence for Ratings

Segment Object Class

Examine View Select

Bookmark

Save

Retain Purchase Subscribe

Print

Delete

Cite

Reference Quote Link

Cut&Paste Reply

Forward

Rate

Interpret Annotate Publish

Organize

Click Streams

• Browsing histories are easily captured

– Send all links to a central site

– Record from and to pages and user’s cookie

– Redirect the browser to the desired page



• Reading time is correlated with interest

– Can be used to build individual profiles

– Used to target advertising by doubleclick.com

Estimating Authority from Links

Hub









Authority

Authority

Information Retrieval Types









Source: Ayse Goker

Hands On: Try Some Search Engines

• Web Pages (using spatial layout)

– http://kartoo.com/

• Images (based on image similarity)

– http://elib.cs.berkeley.edu/photos/blobworld/

• Multimedia (based on metadata)

– http://singingfish.com

• Movies (based on recommendations)

– http://www.movielens.umn.edu

• Grey literature (based on citations)

– http://citeseer.ist.psu.edu/

Evaluation

• What can be measured that reflects the searcher’s

ability to use a system? (Cleverdon, 1966)

– Coverage of Information

– Form of Presentation

– Effort required/Ease of Use Effectiveness

– Time and Space Efficiency

– Recall

– Precision

Measures of Effectiveness





Retrieved









| Ret  Rel |

Precision 

Relevant

| Ret |

| Ret  Rel |

Recall 

| Rel |

Precision-Recall Curves

1

0.9

0.8

0.7

Precision









0.6

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1



Recall



Source: Ellen Voorhees, NIST

Affective Evaluation

• Measure stickiness through frequency of use

– Non-comparative, long-term

• Key factors (from cognitive psychology):

– Worst experience

– Best experience

– Most recent experience

• Highly variable effectiveness is undesirable

– Bad experiences are particularly memorable

Other Web Search Quality Factors

• Spam suppression

– “Adversarial information retrieval”

– Every source of evidence has been spammed

• Text, queries, links, access patterns, …





• “Family filter” accuracy

– Link analysis can be very helpful

Summary

• Search is a process engaged in by people



• Human-machine synergy is the key



• Content and behavior offer useful evidence



• Evaluation must consider many factors


Related docs
Other docs by HC111130015410
Internal Validity
Views: 0  |  Downloads: 0
Inspection Process
Views: 3  |  Downloads: 0
Information Encoding and Modality
Views: 0  |  Downloads: 0
PREMIERE L1 - FRAN�AIS
Views: 32  |  Downloads: 0
ATTORNEYS
Views: 2  |  Downloads: 0
Encoding Information
Views: 0  |  Downloads: 0
Idaho 8th Grade Reading
Views: 8  |  Downloads: 0
WHOLE MUSCLE PRODUCTS
Views: 1  |  Downloads: 0
Lecture 7: Neptunium Chemistry
Views: 1  |  Downloads: 0
Promise Event 4th December Report
Views: 4  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!