A Metric Cache for Similarity Search by pptfiles

VIEWS: 6 PAGES: 25

									A Metric Cache for Similarity Search
fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego

Similarity Search in Databases
o Objects are “unknown”, only distances are “well known” o Metric Space assumption: • Identity • Symmetry • Triangular inequality o Distance functions include: • Minkowski distances, edit and Jaccard distance, ... o Applications include: • Images, 3D shapes, medical data, text, dna sequences, graphs, etc. o Metric space indexing works better than multidimensional indexing.

Distributed Similarity Search System

Top-K queries

Front-end of the CBIR System
Parallel & Distributed CBIR System
Index of MM objects Unit 1
Index of MM objects Unit 2 Index of MM objects Unit n

Search cost is close to O( |DB| ) !!!!

Distributed & Cached Similarity Search System

Top-K queries

Metric Cache
Front-end of the CBIR System
Parallel & Distributed CBIR System
Index of MM objects Unit 1
Index of MM objects Unit 2 Index of MM objects Unit n

Distributed & Cached What’s different in Similarity Search System ?

Top-K queries

Metric Cache
Front-end of the CBIR System
Parallel & Distributed CBIR System
Index of MM objects Unit 1
Index of MM objects Unit 2

Index of MM objects Unit n

What’s different in
Metric Cache
o The cache stores result-objects, not only result-pointers
• e.g.: documents vs. documents ids

?

o The cache is a peculiar sample of the whole dataset
• the set of objects most recently seen by the users (= most interesting !?!)

o Claim: An interesting object may be used to answer approximately if it is sufficiently similar to the query.

What’s different in
Metric Cache
…and… o Queries may be approximate ! o [Zobel et al. CIVR 07] At least 8% of the images in the web are near-duplicates. Most of them are due to cropping, contrast adjustment, etc. o Requirement: the system must be robust w.r.t. near-duplicate queries.

?

What’s different in
Metric Cache
Approx. answer

?

q1
Exact answer

q2

q3
Exact answer

Metric Cache
Front-end of the CBIR System
Parallel & Distributed CBIR System

Two algorithms: RCache vs. QCache
RCache(q,k)
• If qCache return R • R = Cache.knn(q,k)

Cache Hit

Approximate Hit
• If quality(R) > return R • else R = DB.knn(q,k) Cache.add(q,R) return R
• •

Cache Miss

In case of approximate hit, the cached query q’, being the closest to q, is marked as used. The Least Recently Used query and its results are evicted.

Costs of RCache vs. QCache
RCache(q,k)
• Hash table access : O(1) • Search among all the result objects: O( |Cache| )

Cache Hit

Approximate Hit

• Search among all the objects in the database: O( |DB| )

Cache Miss

• |Cache| is the number of cached objects, and |DB| is the size of the database.

Two algorithms: RCache vs. QCache
RCache(q,k)
• If qCache return R • R = Cache.knn(q,k) • If quality(R) > return R • else R = DB.knn(q,k) Cache.add(q,R) return R
• •

QCache(q,k)
• • • • • If qCache return R Q* = Cache.knn(q,  ) R* = {results in Q*} R = R*.knn(q,k) If quality(R) > return R • else R = DB.knn(q,k) Cache.add(q,R) return R

Cache Hit

Approximate Hit

Cache Miss

In case of approximate hit, the cached query q’, being the closest to q, is marked as used. The Least Recently Used query and its results are evicted.

Costs of RCache vs. QCache
RCache(q,k)
• Hash table access : O(1) • Search among all the result objects: O( |Cache| ) • Search among the query objects: O( |Cache|/k )

QCache(q,k)
Cache Hit

Approximate Hit

• Search among all the objects in the database: O( |DB| )

Cache Miss

• Supposing k results are stored for each query. • |Cache| is the number of cached objects, and |DB| is the size of the database.

Approximation & Guarantees
r*

o Let the safe range be: s = r* - d( q,q* )
o The cached k* objects within distance s, are the true top-k* of the new query.

q* s

q

o Every cached query may provide some additional guarantee.

Experimental setup
o A collection of 1,000,000 images downloaded from Flickr:
• we extracted 5 MPEG-7 descriptors, which were used to measure similarity.

o A query log of 100,000 images:
• a random sample with replacement, using image views • 20% training – 80% testing

o k = 20,  = 10 o Quality function is:
• Safe range  0

Hit ratio

Throughput

Approximation quality I

RES (q, R) 





xR

d ( q, x ) d ( q, y )

1

REM (q, R ) 

yKNN ( q , k )

max xR d ( q, x ) 1 max yKNN ( q ,k ) d ( q, y )

Approximation quality II

What we also did ...
o Take queries from a different collection. o Inject duplicates in the query log. o Use an expectation of RES as quality measure.

Approximation quality III

o Bug: RES=0.07 o Portrait: RES=0.09 o Sunset: RES=0.12

Approximation quality III

o Bug: RES=0.07 o Portrait: RES=0.09 o Sunset: RES=0.12

Acknowledgements
o The European Project SAPIR
• http://www.sapir.eu

o P. Zezula and his colleagues for the M-Tree implementation
• http://mufin.fi.muni.cz/tiki-index.php

o The dataset used is available at
• http:// cophir.isti.cnr.it

Thank you.

Backup slides


								
To top