; Duplicate Data Detection Presentation
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

# Duplicate Data Detection Presentation

VIEWS: 39 PAGES: 34

• pg 1
```									Duplicate Data Detection

Abdur Chowdhury

Abdur Chowdhury, Slide:1

Outline
1. What is duplicate data 2. Similarity Metrics
1. Resemblance 2. Cosine

5. Lexical Randomization 6. Conclusions

3. Efficiency 4. Fuzzy Hashing
1. Shingling 2. I-Match
Abdur Chowdhury, Slide:2

Exact Duplicate
http://www.sigmod.org/dblp/db/indices/a-tree/c/Chowdhury:Abdur.html http://www.acm.org/sigs/sigmod/dblp/db/indices/a-tree/c/Chowdhury:Abdur.html

Abdur Chowdhury, Slide:3

Exact Duplicate – Web Results

Abdur Chowdhury, Slide:4

Exact Duplicates - Hashing
• Exact duplicate detection:
– Calculate a unique hash value for each document. – Each document is examined for duplication by looking up the value (hash) in either an in-memory hash or persistent lookup system
• Several common hash functions used are MD2, MD5, or SHA.

– These functions have three desirable properties:
• can be calculated on arbitrary data / document lengths • easy to compute • have very low probabilities of hash collisions
Abdur Chowdhury, Slide:5

Exact Duplicates Problems
• What if the layout changes but the content is the same? • What if the content is slightly modified?
– Time stamps, # of visitors, spelling corrections, personalized messages (Welcome “Abdur”), title change, etc. – Exact hashing methods fail
Abdur Chowdhury, Slide:6

Near Duplicates

Abdur Chowdhury, Slide:7

Near Duplicates
• Research Question?
– At what point are documents no longer duplicates?
• Title, formatting, and spelling changes easy for a human to say they are the same • Content changes, additional paragraphs, or small sentence changes difficult to determine and agree upon

• Applications
– Web {Search, Crawling, etc…} – Publishing with data feeds – Email {Spam duplication}
Abdur Chowdhury, Slide:8

Outline
1. What is duplicate data 2. Similarity Metrics
1. Resemblance 2. Cosine

5. Lexical Randomization 6. Conclusions

3. Efficiency 4. Fuzzy Hashing
1. Shingling 2. I-Match
Abdur Chowdhury, Slide:9

Similarity
• If we can measure the similarity of two documents to each other then:
– If they exceed some threshold they are the same

• So all we need:
– Similarity function to compare two documents – Agreement on a threshold that two documents are the same
Abdur Chowdhury, Slide:10

Cosine
• IR techniques for query to document similarity can be used for D1 to D2
– Cosine measure the vector distances of two objects – Terms used as features of the document in Ndimensional space
• N number of features in collection

– Magnitude can also be incorporated
Abdur Chowdhury, Slide:11

Resemblance
• Document plagiarism research used a set function R • Instead of using terms as features, shingles are used
– Shingle, a set of X consecutive terms hashed to a new feature – D{A B C A B}, X = 2 – {A B},{B C},{C A},{A B} – H1, H2, H3, H1
Abdur Chowdhury, Slide:12

Outline
1. What is duplicate data 2. Similarity Metrics
1. Resemblance 2. Cosine

5. Lexical Randomization 6. Conclusions

3. Efficiency 4. Fuzzy Hashing
1. Shingling 2. I-Match
Abdur Chowdhury, Slide:13

Efficiency
• Efficiency – how fast can I find the answer? • Applications
– Publishing with data feeds
• RSS and news feeds in the hundreds of thousands a day

– Web {Search, Crawling, etc…} Billions of documents – Email {Spam detection, Storage Reduction}
• > 1 Billion spam messages a day
Abdur Chowdhury, Slide:14

Hashing / Signatures
• Exact duplicate detection:
– Calculate a unique hash value for each document. – Each document is examined for duplication by looking up the value (hash) in either an inmemory hash or persistent lookup system

• Complexity - O(n)
– Can scale to large collections – Too brittle for most applications
Abdur Chowdhury, Slide:15

Cosine / Resemblance
• Algorithm 1
– For each document in collection {X}
• For each document in collection {Y} • If (X != Y) compare X to Y via Cosine or Resemblance
– If > 0 and > threshold flag as duplicate

– O (n2)

• Algorithm 2
– For each document in collection
• For each feature in document
– – – – Find all other documents with feature (use some type of inverted index) Calculate similarity on the fly Only compare documents to others if some feature is in common IDF filtering and stop words can speed up the process
Abdur Chowdhury, Slide:16

– O (n2)

Efficiency
• Hashing
– Good performance but cannot handle near duplicates

• Similarity Functions
– Bad performance but can handle fuzziness

Abdur Chowdhury, Slide:17

Outline
1. What is duplicate data 2. Similarity Metrics
1. Resemblance 2. Cosine

5. Lexical Randomization 6. Conclusions

3. Efficiency 4. Fuzzy Hashing
1. Shingling 2. I-Match
Abdur Chowdhury, Slide:18

Fuzzy Hashing
• Goal:
– Hashing speed – Fuzzy duplication detection

• Generating a single fingerprint per document is attractive when dealing with massive document repositories
Abdur Chowdhury, Slide:19

Shingling
• DSC
– Calculate all possible shingles in collection in pass 1 – Filter out common shingles – Apply either algorithm 1 or 2 from prior section to detect duplicates

• DSC-SS
– Keep mod(25) shingles for each document – Hash sets of shingles into SS (super shingles) – Any 2 documents with a SS in common are considered duplicates
Abdur Chowdhury, Slide:20

Shingling Optimizations Observations
• D1 is a duplicate of D2, does not imply D2 is a duplicate of D1 • Document similarity does not need a similarity threshold • Short documents do not provide enough data to build a SS • O(n), but still requires multiple passes over the collection • No real reasoning behind the feature space reduction optimizations beyond speed
Abdur Chowdhury, Slide:21

I-Match
• Need an approach with the following characteristics
1. 2. Run time performance of O(n) More reasoning behind document feature selection

Heaps’ Law

•
• • •

I-Match builds a signature for each document based on terms
Rare terms are removed Common terms are removed Theory based on document term information conveyed in rare and common terms are not likely to help find near duplicates

http://en.wikipedia.org/wiki/Heaps%27_law

Zipf’s Law

http://www.cs.unc.edu/~vivek/home/stenopedia/zipf/

Abdur Chowdhury, Slide:22

I-Match Algorithm
1. 2. 3. 4. 5. 6. 7. Get document. Parse document into a token steam, removing format tags. Using term thresholds (idf), retain only significant tokens. Insert relevant tokens into ascending ordered tree of unique tokens. Loop through tokens and add each to the digest. Upon completion a document digest is defined. The tuple (doc_id, SHA1 Digest) is inserted into a data structure keyed digest. If there is a collision of digest values then the documents are similar.

• • • • •

D1 is a duplicate of D2 implies D2 is a duplicate of D1 Some justification for feature selection Single pass over the data Really short documents??? Language Detection
Abdur Chowdhury, Slide:23

Outline
1. What is duplicate data 2. Similarity Metrics
1. Resemblance 2. Cosine

5. Lexical Randomization 6. Conclusions

3. Efficiency 4. Fuzzy Hashing
1. Shingling 2. I-Match
Abdur Chowdhury, Slide:24

I-Match Research Areas
• A document signature is equivalent to the intersection between terms contained in the document and the I-Match lexicon.
– Modification of a single word can lead to signature change, signature fragility. – Can signature fragility be overcome?

• The term distribution statistics of a large document collection are used to identify terms likely to be useful in determining document uniqueness.
– Such terms comprise I-Match lexicon. – Should the local collection or a external collection be used?
Abdur Chowdhury, Slide:25

I-Match w/ Randomized Lexicons
Document
Document

Feature extraction

Lexicon

Feature extraction

Lexicon

Concatenated common features

concatenations

HASH

HASH

Document signatures

Document signature

Abdur Chowdhury, Slide:26

Lexicon

Intersect

Intersect Intersect Intersect

Lexicon

Lexicon Randomization
• • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc …

Abdur Chowdhury, Slide:27

Lexicon Randomization
• • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc …

Abdur Chowdhury, Slide:28

Lexicon Randomization
• • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc …

Abdur Chowdhury, Slide:29

Perturbation curve
Signature stabilty probability 1 bag-2 bag-5 bag-10 0.9 0.8

y t i l i b a b o r p

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1

2

3

4 5 insertion/deletion count

6

7

8

Abdur Chowdhury, Slide:30

I-Match Randomization Experiments
• Test collection was created
– Random sample (5%) of web documents from the WT10G 1.6M document collection – Cosine similarity T>.9 used to create the base truth (15 days of CPU time ~80k documents)

• Experiments to compare the effectiveness
– Use same collection or representative language collection – Improvements with more randomized sets
Abdur Chowdhury, Slide:31

Impact of Randomization and Collection Statistics
Randomized I-Match recall for WT10G 0.7 l l a c e r e t a c i l p u d r a e n 0.65

0.6

0.55

0.5

0.45

0.4

lex-sgml lex-wt10g

0.35

0

1

2

3

4 5 6 number of extra lexicons used

7

8

9

10

Abdur Chowdhury, Slide:32

Outline
1. What is duplicate data 2. Similarity Metrics
1. Resemblance 2. Cosine

5. Lexical Randomization 6. Conclusions

3. Efficiency 4. Fuzzy Hashing
1. Shingling 2. I-Match
Abdur Chowdhury, Slide:33

Conclusions
• Near Duplicate Data is a problem for many applications • Traditional similarly approaches have significant performance penalties • I-Match overcomes most performance problems and provides good near duplicate data detection • The use of lexicon randomization can significantly improve the effectiveness of I-Match
Abdur Chowdhury, Slide:34

```
To top