Embed
Email

Slides - Max-Planck-Institut für Informatik

Document Sample

Shared by: wuzhenguang
Categories
Tags
Stats
views:
0
posted:
12/4/2011
language:
English
pages:
55
Database Researchers:

Plumbers or Thinkers?



Gerhard Weikum

Max Planck Institute for Informatics

http://www.mpi-inf.mpg.de/~weikum/

Acknowledgements

Personal Motivation

ACM SIGMOD Gerhard Weikum Speaks Out

on Why We Should Go for the Grand Challenges,

Record Interview: Why SQL Is Too Powerful, the Myth of Precision,

How to Have a Big Research Group in Germany, and More

Q: Someone suggested that …

you may be being seduced by the

dark side and end up doing AI.

Are you heading in this direction?



A: Mike Stonebraker used to say „This problem is AI-complete“,

meaning the problem was beyond any hope of solution, was science fiction,

was on the same level as „Scotty, beam me up!“.

Now, as I grow older, I think this attitude is wrong.



7. World Memex: Build a system that given a text corpus,

can answer questions about the text and summarize the text

as precisely and quickly as a human expert in that field.

Do the same for music, images, art, and cinema.

This is a demanding task.

It is probably AI Complete, but is an excellent goal …

(Jim Gray: What‘s Next? A Dozen Information-Technology Research Goals, 1999)

Take-Home Message

1970 1980 1990 2000 2010 2020

parallel distr. manycore megacore

B-tree B-tree B-tree B-tree B-tree B-tree

code code code code code code



multidim. parallel distr. cloud supercloud

index index index index index index

mgt. mgt. mgt. mgt. mgt. mgt.



custom scalable automatic

trans. query auto- stores & real-time data

proc. optim. admin engines analytics integration





universal

relation parallel knowledge Semantic Deep Turing

model DB sys. base Web QA test

Cool Problem: Semantic Queries on Web









www.google.com/squared/

Cool Problem: Semantic Queries on Web









www.google.com/squared/

Cool Problem: Semantic Queries on Web









www.google.com/squared/

Cool Problem: Deep QA in NL

William Wilkinson's "An Account of the

Principalities of Wallachia and Moldavia"

inspired this author's most famous novel

This town is known as "Sin City" & its

downtown is "Glitter Gulch"

As of 2010, this is the only

former Yugoslav republic in the EU

99 cents got me a 4-pack of Ytterlig

coasters from this Swedish chain



question knowledge

classification & back-ends

decomposition

YAGO

D. Ferrucci et al.: Building Watson: An Overview of the

DeepQA Project. AI Magazine, Fall 2010.

www.ibm.com/innovation/us/watson/index.htm

Cool Problem: Machine Reading

It’s about the disappearance forty years ago of Harriet Vanger, a young

scion of one of the wealthiest families in Sweden, and about her uncle,

determined to know the truth about what he believes was her murder.

tiny island of Hedeby.

Blomkvist visits Henrik Vanger at his estate on the same

same

The old man draws Blomkvist in by promising solid evidence against Wennerström.

same

Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real

owns

assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is

home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist

becomes acquainted with the members of the extended Vanger family, most of whom resent

uncleOf hires

his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.

enemyOf

same affairWith persuades her to assist

After discovering that Salander has hacked into his computer, he same

him with research. They eventually become lovers, but Blomkvist has trouble getting close

affairWith

to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two

discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.

A 24-year-old computer hacker sporting an assortment of tattoos and body piercings

headOf

supports herself by doing deep background investigations for Dragan Armansky, who, in

same

turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

O. Etzioni, M. Banko, M.J. Cafarella: Machine Reading, AAAI ‚06

T. Mitchell et al.: Populating the Semantic Web by Macro-Reading Internet Text, ISWC’09

Outline



 Intro: Motivation & Cool Problems



From Data Mining to Knowledge Harvesting



From Snapshots to Eternity

From Record Linkage to NL Disambiguation



Wrap-up









...

Goal: Turn Web into Knowledge Base

Source:

DB & IR methods for

knowledge discovery.

Communications of

the ACM 52(4), 2009









comprehensive DB of human knowledge

• everything that Wikipedia knows

• everything machine-readable

• capturing entities, classes, relationships

Approach: Harvesting Facts from Web

YAGO2: Politician

Angela Merkel

Political Party

CDU

10 Mio. entities, 500 000 classes, Karl-Theodor zu Guttenberg CDU

PoliticalParty Spokesperson Hartmann

Christoph FDP

300 Mio. facts for 100 relations, CDU Philipp Wachholz



100 languages, 95% accuracy Politician

Die Grünen

Facebook

Claudia Roth

Position

FriendFeed

Angela Merkel Chancellor Germany

Software AG IDS Scheer

Karl-Theodor zu Guttenberg Minister of Defense Germany



Christoph Hartmann Minister of Economy Saarland



Company AcquiredCompany

Google YouTube Company CEO

Yahoo Overture Google Eric Schmidt

Facebook FriendFeed Movie Yahoo Overture

ReportedRevenue

Software AG IDS Scheer Avatar Facebook FriendFeed

$ 2,718,444,933

… The Reader Software AG IDS Scheer

$ 108,709,522

Actor Award … FriendFeed

Facebook

Christoph Waltz Oscar

Software AG IDS Scheer

Sandra Bullock … Oscar

Sandra Bullock Golden Raspberry











SUMO



YAGO-NAGA IWP

Cyc

TextRunner WikiTax2WordNet ReadTheWeb

Knowledge in a KB

• facts / assertions: bornIn (GretaGarbo, Stockholm),

hasWon (GretaGarbo, AcademyAward),

playedRole (GretaGarbo, MataHari), livedIn (GretaGarbo, Klosters), …

• taxonomic: instanceOf (GretaGarbo, actress),

subclassOf (actress, artist), …

• lexical / terminology: means (“Big Apple“, NewYorkCity),

means (“Apple“, AppleComputerCorporation)

means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) …

• common-sense properties:

apples are green, red, juicy, sweet, sour … - but not fast, smart …

balls are round, smooth, slippery … - but not square, funny …

• common-sense axioms:

 x: human(x)  male(x)  female(x)

 x: (male(x)   female(x))  (female(x) )   male(x))

 x: animal(x)  (hasLegs(x)  isEven(numberOfLegs(x)) …

• procedural: how to fix/install/prepare/remove …

• epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)),

believes (Copernicus, shape(Earth, sphere)) …

...

Knowledge for Intelligence

• entity recognition & disambiguation

• understanding natural language & speech

• knowledge services & reasoning for semantic apps

(e.g. deep QA)

• semantic search: precise answers to advanced queries

(by scientists, students, journalists, analysts, etc.)



Swedish king‘s wife when Greta Garbo died?

FIFA 2010 finalists who played in a Champions League final?

Politicians who are also scientists?

Relationships between

Max Planck, Angela Merkel, Jim Gray, and the Dalai Lama?

Enzymes that inhibit HIV?

Influenza drugs for teens with high blood pressure?

...

KB Growth, Dynamics, Life-Cycle

Great mileage from semistructured data:

(infoboxes, category systems, tables, lists, etc.)



YAGO, Dbpedia, Freebase, Trueknowledge, etc.:

Bio‘s of facts about 10 Mio‘s of entities,

for 1000‘s of relations



But: "To know that we know what we know, and

that we do not know what we do not know,

that is true knowledge. "

Confucius,

551-479 BC

Most new & interesting facts/statements are in:

news, blogs, forums, tabloids,

essays, books, scientific papers, …

 knowledge harvesting from natural-language text !

 KB needs continuous updates & long-term mgt. !

French Marriage Problem









facts in KB: new facts or fact candidates:

married married (Cecilia, Nicolas)

(Hillary, Bill) married (Carla, Benjamin)

married married (Carla, Mick)

(Carla, Nicolas) married (Michelle, Barack)

married married (Yoko, John)

(Angelina, Brad) married (Kate, Leonardo)

married (Carla, Sofie)

married (Larry, Google)



1) for recall: pattern-based harvesting

2) for precision: consistency reasoning

Pattern-Based Harvesting

(Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)

Facts & Fact Candidates Patterns



(Hillary, Bill) X and her husband Y

(Carla, Nicolas) X and Y on their honeymoon



(Angelina, Brad)

(Victoria, David) X and Y and their children

(Hillary, Bill) X has been dating with Y

(Carla, Nicolas)

X loves Y

(Yoko, John)

(Kate, Pete) … • good for recall

(Carla, Benjamin) • noisy, drifting

(Larry, Google) • not robust enough

(Angelina, Brad) for high precision

(Victoria, David)

Reasoning about Fact Candidates

Use consistency constraints to prune false candidates

FOL rules (restricted): ground atoms:

spouse(Hillary,Bill)

spouse(x,y)  diff(y,z)  spouse(x,z) spouse(Carla,Nicolas)

spouse(x,y)  diff(w,x)  spouse(w,y) spouse(Cecilia,Nicolas)

spouse(x,y)  f(x) spouse(x,y)  m(y) spouse(Carla,Ben)

spouse(Carla,Mick)

spouse(x,y)  (f(x)m(y))  (m(x)f(y)) spouse(Carla, Sofie)

f(Hillary) m(Bill)

Rules reveal inconsistencies f(Carla) m(Nicolas)

Find consistent subset(s) of atoms f(Cecilia) m(Ben)

(“possible world(s)“, “the truth“) f(Sofie) m(Mick)



Rules can be weighted

(e.g. by fraction of ground atoms that satisfy a rule)

 uncertain / probabilistic data

 compute prob. distr. of subset of atoms being the truth

Markov Logic Networks (MLN‘s)

(M. Richardson / P. Domingos 2006)

Map logical constraints & fact candidates

into probabilistic graph model: Markov Random Field (MRF)

s(x,y)  diff(y,z)  s(x,z) s(x,y)  f(x) f(x)  m(x) s(Carla,Nicolas)

s(x,y)  diff(w,y)  s(w,y) s(x,y)  m(y) m(x)  f(x) s(Cecilia,Nicolas

s(Carla,Ben)

Grounding: Literal  Boolean Var s(Carla,Sofie)

Literal  binary RV …

s(Ca,Nic)  s(Ce,Nic)

s(Ca,Nic)  s(Ca,Ben) s(Ca,Nic)  m(Nic)

s(Ca,Nic)  s(Ca,So) s(Ce,Nic)  m(Nic)

s(Ca,Ben)  s(Ca,So) s(Ca,Ben)  m(Ben)

s(Ca,Ben)  s(Ca,So) s(Ca,So)  m(So)

Markov Logic Networks (MLN‘s)

(M. Richardson / P. Domingos 2006)

Map logical constraints & fact candidates

into probabilistic graph model: Markov Random Field (MRF)

s(x,y)  diff(y,z)  s(x,z) s(x,y)  f(x) f(x)  m(x) s(Carla,Nicolas)

s(x,y)  diff(w,y)  s(w,y) s(x,y)  m(y) m(x)  f(x) s(Cecilia,Nicolas

s(Carla,Ben)

s(Carla,Sofie)

s(Ce,Nic) …





m(Nic) RVs coupled

s(Ca,Nic) by MRF edge

s(Ca,Ben) if they appear

m(Ben)

in same clause

s(Ca,So) m(So) MRF assumption:

P[Xi|X1..Xn]=P[Xi|N(Xi)]

Variety of algorithms for joint inference: joint distribution

has product form

Gibbs sampling, other MCMC, belief propagation, over all cliques

randomized MaxSat, …

Reasoning for KB Growth: Direct Route

(F. Suchanek et al.: WWW‘09)

new fact candidates:

facts in KB: married (Cecilia, Nicolas)

married (Carla, Benjamin)

married

(Hillary, Bill) + married (Carla, Mick)

married (Carla, Sofie)

?

married married (Larry, Google)

(Carla, Nicolas)

married patterns:

(Angelina, Brad) X and her husband Y

X and Y and their children

X has been dating with Y

Direct approach: X loves Y

1. facts are true; fact candidates & patterns  hypotheses

grounded constraints  clauses with hypotheses as vars

2. type signatures of relations greatly reduce #clauses

3. cast into Weighted Max-Sat with weights from pattern stats

customized approximation algorithm

unifies: fact cand consistency, pattern goodness, entity disambig.

www.mpi-inf.mpg.de/yago-naga/sofie/

Facts & Patterns Consistency with SOFIE

(F. Suchanek et al.: WWW’09)

constraints to connect facts, fact candidates, patterns

pattern-fact duality:

occurs(p,x,y)  expresses(p,R)  type(x)=dom(R)  type(y)=rng(R)  R(x,y)

occurs(p,x,y)  R(x,y)  type(x)=dom(R)  type(y)=rng(R)  expresses(p,R)

name(-in-context)-to-entity mapping:

 means(n,e1)   means(n,e2)  …



functional dependencies: relation properties:

spouse(X,Y): X Y, Y X asymmetry, transitivity, acyclicity, …

type constraints, inclusion dependencies:

spouse  Person  Person capitalOfCountry  cityOfCountry

domain-specific constraints:

bornInYear(x) + 10years ≤ graduatedInYear(x)

hasAdvisor(x,y)  graduatedInYear(x,t)  graduatedInYear(y,s)  s 95% accuracy, >95% coverage, in one night

1) recall: gather temporal scopes for base facts

2) precision: reason on mutual consistency









consistency constraints are potentially helpful:

• functional dependencies: husband, time  wife

• inclusion dependencies: marriedPerson  adultPerson

• age/time/gender restrictions: birthdate +  < marriage < divorce

Difficult Dating

(Even More Difficult) Implicit Dating

explicit dates vs.

implicit dates relative to other dates

(Even More Difficult) Relative Dating

vague dates

relative dates









narrative text

relative order

Framework for T-Fact Extraction

(M. Theobald et al.: MUD’10, Y. Wang et al.: EDBT’10)





1) represent temporal scopes of facts

in the presence of incompleteness and uncertainty





2) gather & filter candidates for t-facts:

extract base facts R(e1, e2) first; then

focus on sentences with e1, e2 and date or temporal phrase



3) aggregate & reconcile evidence from observations





4) reason on joint constraints about facts and time scopes

Joint Reasoning on Facts and T-Facts

(M. Dylla et al.: BTW’11)



Combine & reconcile t-scopes across different facts

constraint:

marriedTo (m) is an injective function at any given point



 X, Y, Z, T1, T2:

m(X,Y)  m(X,Z) 

validTime(m(X,Y),T1)  validTime(m(X,Z),T2)

  overlaps(T1, T2)

after grounding:

m(Ca,Nic) 

m(Carla, Nicolas)  m(Cecilia, Nicolas) m(Ce,Nic) 

  overlaps ([2008,2010], [1996,2007]) false



m(Carla, Nicolas)  m(Carla, Benjamin) m(Ca,Nic) 

  overlaps ([2008,2010], [2009,2011]) m(Ca,Ben) 

true

Joint Reasoning on Facts and T-Facts

m(Ca, Mi)

m(Ca, Ben)

m(Ca, Nic)

m(Ce, Nic)

m(Ce, Mi)

time



Conflict graph:

m(Ca, Mi) m(Ca, Ben)

[2004,2008] [2009,2011] Find maximal

independent set:

m(Ce, Nic) m(Ca, Nic)

subset of nodes

[1996,2007] [2008,2010] w/o adjacent pairs

with (evidence-)

m(Ce, Mi) weighted nodes

[1998,2005]

Joint Reasoning on Facts and T-Facts

m(Ca, Mi)

m(Ca, Ben)

m(Ca, Nic)

m(Ce, Nic)

m(Ce, Mi)

time



Conflict graph:



30 m(Ca, Mi) m(Ca, Ben)

10 Find maximal

[2004,2008] [2009,2011]

independent set:

subset of nodes

100 m(Ce, Nic) m(Ca, Nic)

80 w/o adjacent pairs

[1996,2007] [2008,2010]

with (evidence-)

m(Ce, Mi) weighted nodes

[1998,2005] 20

Joint Reasoning on Facts and T-Facts

alternative approach:

split t-scopes and reason on

consistency of t-fact partitions

m(Ca, Mi)

m(Ca, Ben)

m(Ca, Nic)

m(Ce, Nic)

m(Ce, Mi)

time

Outline



 Intro: Motivation & Cool Problems



 From Data Mining to Knowledge Harvesting



 From Snapshots to Eternity

From Record Linkage to NL Disambiguation



Wrap-up

Record Linkage (Entity Resolution)

record 1 record 2 record 3 … record N

Susan B. Davidson O.P. Buneman P. Baumann Y. Davidson

Peter Buneman S. Davison S. Davidson Sean Penn

Yi Chen Y. Chen Cheng Y. S. Chen

University of U Penn Penn State Penn Station

Pennsylvania

Issues in … Issues in … Issues in … Issues in …

Int. Conf. on Very VLDB Conf. PVLDB XLDB

Large Data Bases Conference



Find equivalence classes of entities, and records, based on:

• similarity of values (edit distance, n-gram overlap, etc.)

• joint agreement of linkage

 similarity joins, grouping/clustering, collective learning, etc.





Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946

H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959.

Linked Data: Record Linkage at Web Scale









Source: Christian Bizer, Tom Heath, Tim Berners-Lee, Michael Hausenblas,

WWW 2010 Workshop on Linked Data on the Web

linkeddata.org

Linked Data: Record Linkage at Web Scale

yago/wordnet:Artist 109812338



yago/wordnet:Movie 106613686

yago/wikicategory:SwedishFilmDirectors





imdb.com/title/tt0050986/



dbpedia.org/resource/Ingmar_Bergman



dbpedia.org/resource/Woody_Allen



dbpedia.org/resource/David_Lynch

dbpedia.org/resource/Uppsala









rdf.freebase.com/ns/Uppsala

data.nytimes.com/

lynch_david_per

?

data.nytimes.com/uppsala_sweden_geo



quotationsbook.com/author/4561

sws.geonames.org/2666199/



data 43''

need referential E 17° 38'quality for Linked Data:

N 59° 51' 30''

automatic & dynamic !

Named-Entity Disambiguation in Text



Harry fought with you know who. He defeats the dark lord.





Dirty Harry Prince Harry The Who Lord

Harry Potter of England (band) Voldemort





Three NLP tasks:

1) named-entity detection: segment & label by HMM or CRF

(e.g. Stanford NER tagger)

2) co-reference resolution: link to preceding NP

(trained classifier over linguistic features)

3) named-entity disambiguation:

map each mention (name) to canonical entity (entry in KB)

Mentions, Meanings, Mappings

Agnetha Qvarnström

Agnetha,

Björn, Agnetha Fältskog

Benny, Benny Goodman

and Anni-Frid

Benny Andersson

were Sweden‘s

most successful Battle of Waterloo

pop music group.

Waterloo Station

Their greatest hits KB

were Waterloo Waterloo (song)

and Mamma Mia. Agnetha means Agnetha Fältskog

Agnetha means Agnetha Munther

Agnetha means Agnetha Qvarnström

Björn means Björn Borg

Björn means Björn Ulvaeus

Björn means Björn the Viking

Benny means Benny Goodman

Benny means Benny Andersson

Waterloo means Battle of Waterloo

Waterloo means Waterloo (Ontario)

Waterloo means Waterloo Station

Waterloo means Waterloo (song)

Mention-Entity Graph

weighted undirected graph with two types of nodes

Agnetha, Agnetha Q.

Björn,

Benny, Agnetha F.

and Anni-Frid Benny G.

were Sweden‘s

Benny A.

most successful

pop music group. B. Waterloo

Their greatest hits

Waterloo St.

were Waterloo

and Mamma Mia. Waterloo (s)



Popularity Similarity KB+Stats

(m,e): (m,e):

• freq(m,e|m) • cos/Dice/KL

• length(e) (context(m),

• #links(e) context(e))

Mention-Entity Graph

weighted undirected graph with two types of nodes

Agnetha, Agnetha Q.

Björn,

Benny, Agnetha F.

and Anni-Frid Benny G.

were Sweden‘s

Benny A.

most successful

pop music group. B. Waterloo

Their greatest hits

Waterloo St.

were Waterloo

and Mamma Mia. Waterloo (s)



Popularity Similarity KB+Stats Coherence

(m,e): (m,e): (e,e‘):

• freq(m,e|m) • cos/Dice/KL • dist(types)

• length(e) (context(m), • overlap(links)

• #links(e) context(e)) • overlap

(anchor words)

Mention-Entity Graph

weighted undirected graph with two types of nodes

Agnetha, Swedish female singers

Agnetha Q. people from Jönköping

Björn, singers

Benny, Agnetha F. musicians

and Anni-Frid Benny G. Swedish songwriters

were Sweden‘s people from Stockholm

Benny A. composers

most successful musicians

pop music group. B. Waterloo

Their greatest hits ABBA songs

Waterloo St. #1 chart singles

were Waterloo songs

and Mamma Mia. Waterloo (s) artifacts



Popularity Similarity KB+Stats Coherence

(m,e): (m,e): (e,e‘):

• freq(m,e|m) • cos/Dice/KL • dist(types)

• length(e) (context(m), • overlap(links)

• #links(e) context(e)) • overlap

(anchor words)

Mention-Entity Graph

weighted undirected graph with two types of nodes

Agnetha, http://.../wiki/ABBA

Agnetha Q. http://.../wiki/Anni-Frid_Lyngstad

Björn, http://.../wiki/Jönköping

Benny, Agnetha F. http://.../wiki/Eurovision_Song_Con

and Anni-Frid Benny G. http://.../wiki/ABBA

were Sweden‘s http://.../wiki/Anni-Frid_Lyngstad

most successful

Benny A. http://.../wiki/Mamma_Mia!

http://.../wiki/Agnetha_Fältskog

pop music group. B. Waterloo

Their greatest hits http://.../wiki/ABBA

Waterloo St. http://.../wiki/Eurovision_Song_Con

were Waterloo http://.../wiki/Mamma_Mia!

and Mamma Mia. Waterloo (s)



Popularity Similarity KB+Stats Coherence

(m,e): (m,e): (e,e‘):

• freq(m,e|m) • cos/Dice/KL • dist(types)

• length(e) (context(m), • overlap(links)

• #links(e) context(e)) • overlap

(anchor words)

Mention-Entity Graph

weighted undirected graph with two types of nodes

Agnetha, pop group ABBA

Agnetha Q. best-selling music artist in history

Björn, Melodifestivalen

Benny, Agnetha F. The Winner Takes It All

and Anni-Frid Benny G. pop group ABBA

were Sweden‘s Grammy Award nomination

Benny A. Melodifestivalen

most successful Mamma Mia!

pop music group. B. Waterloo

Their greatest hits Agnetha Fältskog

Waterloo St. Benny Andersson

were Waterloo number-one single in Norway

and Mamma Mia. Waterloo (s) Mamma Mia!



Popularity Similarity KB+Stats Coherence

(m,e): (m,e): (e,e‘):

• freq(m,e|m) • cos/Dice/KL • dist(types)

• length(e) (context(m), • overlap(links)

• #links(e) context(e)) • overlap

(anchor words)

Different Approaches

Combine Popularity, Similarity, and Coherence Features

(Cucerzan: EMNLP‘07, Milne/Witten: CIKM‘08):

• for sim (context(m), context(e)):

consider surrounding mentions

and their candidate entities

• use their types, links, anchors

as features of context(m)

• set m-e edge weights accordingly

• use greedy methods for solution



Collective Learning with Prob. Factor Graphs

(Chakrabarti et al.: KDD‘09):

• model P[m|e] by similarity and P[e1|e2] by coherence

• consider likelihood of P[m1 … mk | e1 … ek]

• factorize by all m-e pairs and e1-e2 pairs

• use hill-climbing, LP, etc. for solution

Graph Algorithms with Online DB



50

30 50

20

30 10 10

90

100



30

80 20

90

100 90

30

5





• Build mention-entity graph and compute edge weights

from knowledge and statistics in online DB

• Compute dense subgraph (e.g., high edge weight) such that:

each m is connected to exactly one e (or at most one e)

Online Disambiguation (Prototype)

Outline



 Intro: Motivation & Cool Problems



 From Data Mining to Knowledge Harvesting



 From Snapshots to Eternity



 From Record Linkage to NL Disambiguation



Wrap-up







...

Research Opportunities

Knowledge Harvesting from Text

• recall & precision by patterns & reasoning

• efficiency & scalability

• soft constraints, hard constraints, richer logics, …

• discovery of new relation types (open IE)



Temporal Knowledge

• capture uncertain / incomplete temporal scopes of facts

• joint reasoning on base-facts and time-scopes

• long-term life-cycle of KB maintenance



Named Entity Disambiguation in NL

• near-human accuracy, using popularity, similarity, coherence

• efficient algorithms and real-time response

• automatic sameAs for Linked Data at Web-scale

Overall Take-Home

AI-complete problems:

knowledge harvesting, semantic search,

deep QA, machine reading

• exciting times, major progress

• many data-centric sub-problems

DB community = data-centric research

• storing & managing data was yesterday

• tomorrow is: analyzing, distilling, making sense of data

(turning data into knowledge)



Tapping into natural language crucial for:

• knowledge mostly produced in news, books, papers

• smartphone UI for ad-hoc real-time QA and KDD









...

Thank You !









“The plumber and Michelangelo

used marble from the same quarry,

“Not only is there no God, but try

but what each saw in the marble

finding a plumber on a weekend.“

made the difference between

(Woody Allen)

a sink and a brilliant sculpture.”

(Bob Kall)



Related docs
Other docs by wuzhenguang
Is Air Quality a Problem in My Home
Views: 7  |  Downloads: 0
IHRM Chapter 6
Views: 8  |  Downloads: 0
37.10593
Views: 6  |  Downloads: 0
December_break
Views: 7  |  Downloads: 0
Lectures for 2nd Edition
Views: 7  |  Downloads: 0
Google Chart
Views: 14  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!