Acrobat PDF

Duplicate Data Detection Presentation

You must be logged in to download this document
Reviews
Shared by: David Wierd
Categories
Tags
Stats
views:
27
rating:
not rated
reviews:
0
posted:
9/4/2008
language:
English
pages:
0
Duplicate Data Detection Abdur Chowdhury Abdur Chowdhury, Slide:1 Outline 1. What is duplicate data 2. Similarity Metrics 1. Resemblance 2. Cosine 5. Lexical Randomization 6. Conclusions 3. Efficiency 4. Fuzzy Hashing 1. Shingling 2. I-Match Abdur Chowdhury, Slide:2 Exact Duplicate http://www.sigmod.org/dblp/db/indices/a-tree/c/Chowdhury:Abdur.html http://www.acm.org/sigs/sigmod/dblp/db/indices/a-tree/c/Chowdhury:Abdur.html Abdur Chowdhury, Slide:3 Exact Duplicate – Web Results Abdur Chowdhury, Slide:4 Exact Duplicates - Hashing • Exact duplicate detection: – Calculate a unique hash value for each document. – Each document is examined for duplication by looking up the value (hash) in either an in-memory hash or persistent lookup system • Several common hash functions used are MD2, MD5, or SHA. – These functions have three desirable properties: • can be calculated on arbitrary data / document lengths • easy to compute • have very low probabilities of hash collisions Abdur Chowdhury, Slide:5 Exact Duplicates Problems • What if the layout changes but the content is the same? • What if the content is slightly modified? – Time stamps, # of visitors, spelling corrections, personalized messages (Welcome “Abdur”), title change, etc. – Exact hashing methods fail Abdur Chowdhury, Slide:6 Near Duplicates Abdur Chowdhury, Slide:7 Near Duplicates • Research Question? – At what point are documents no longer duplicates? • Title, formatting, and spelling changes easy for a human to say they are the same • Content changes, additional paragraphs, or small sentence changes difficult to determine and agree upon • Applications – Web {Search, Crawling, etc…} – Publishing with data feeds – Email {Spam duplication} Abdur Chowdhury, Slide:8 Outline 1. What is duplicate data 2. Similarity Metrics 1. Resemblance 2. Cosine 5. Lexical Randomization 6. Conclusions 3. Efficiency 4. Fuzzy Hashing 1. Shingling 2. I-Match Abdur Chowdhury, Slide:9 Similarity • If we can measure the similarity of two documents to each other then: – If they exceed some threshold they are the same • So all we need: – Similarity function to compare two documents – Agreement on a threshold that two documents are the same Abdur Chowdhury, Slide:10 Cosine • IR techniques for query to document similarity can be used for D1 to D2 – Cosine measure the vector distances of two objects – Terms used as features of the document in Ndimensional space • N number of features in collection – Magnitude can also be incorporated Abdur Chowdhury, Slide:11 Resemblance • Document plagiarism research used a set function R • Instead of using terms as features, shingles are used – Shingle, a set of X consecutive terms hashed to a new feature – D{A B C A B}, X = 2 – {A B},{B C},{C A},{A B} – H1, H2, H3, H1 Abdur Chowdhury, Slide:12 Outline 1. What is duplicate data 2. Similarity Metrics 1. Resemblance 2. Cosine 5. Lexical Randomization 6. Conclusions 3. Efficiency 4. Fuzzy Hashing 1. Shingling 2. I-Match Abdur Chowdhury, Slide:13 Efficiency • Efficiency – how fast can I find the answer? • Applications – Publishing with data feeds • RSS and news feeds in the hundreds of thousands a day – Web {Search, Crawling, etc…} Billions of documents – Email {Spam detection, Storage Reduction} • > 1 Billion spam messages a day Abdur Chowdhury, Slide:14 Hashing / Signatures • Exact duplicate detection: – Calculate a unique hash value for each document. – Each document is examined for duplication by looking up the value (hash) in either an inmemory hash or persistent lookup system • Complexity - O(n) – Can scale to large collections – Too brittle for most applications Abdur Chowdhury, Slide:15 Cosine / Resemblance • Algorithm 1 – For each document in collection {X} • For each document in collection {Y} • If (X != Y) compare X to Y via Cosine or Resemblance – If > 0 and > threshold flag as duplicate – O (n2) • Algorithm 2 – For each document in collection • For each feature in document – – – – Find all other documents with feature (use some type of inverted index) Calculate similarity on the fly Only compare documents to others if some feature is in common IDF filtering and stop words can speed up the process Abdur Chowdhury, Slide:16 – O (n2) Efficiency • Hashing – Good performance but cannot handle near duplicates • Similarity Functions – Bad performance but can handle fuzziness Abdur Chowdhury, Slide:17 Outline 1. What is duplicate data 2. Similarity Metrics 1. Resemblance 2. Cosine 5. Lexical Randomization 6. Conclusions 3. Efficiency 4. Fuzzy Hashing 1. Shingling 2. I-Match Abdur Chowdhury, Slide:18 Fuzzy Hashing • Goal: – Hashing speed – Fuzzy duplication detection • Generating a single fingerprint per document is attractive when dealing with massive document repositories Abdur Chowdhury, Slide:19 Shingling • DSC – Calculate all possible shingles in collection in pass 1 – Filter out common shingles – Apply either algorithm 1 or 2 from prior section to detect duplicates • DSC-SS – Keep mod(25) shingles for each document – Hash sets of shingles into SS (super shingles) – Any 2 documents with a SS in common are considered duplicates Abdur Chowdhury, Slide:20 Shingling Optimizations Observations • D1 is a duplicate of D2, does not imply D2 is a duplicate of D1 • Document similarity does not need a similarity threshold • Short documents do not provide enough data to build a SS • O(n), but still requires multiple passes over the collection • No real reasoning behind the feature space reduction optimizations beyond speed Abdur Chowdhury, Slide:21 I-Match • Need an approach with the following characteristics 1. 2. Run time performance of O(n) More reasoning behind document feature selection Heaps’ Law • • • • I-Match builds a signature for each document based on terms Rare terms are removed Common terms are removed Theory based on document term information conveyed in rare and common terms are not likely to help find near duplicates http://en.wikipedia.org/wiki/Heaps%27_law Zipf’s Law http://www.cs.unc.edu/~vivek/home/stenopedia/zipf/ Abdur Chowdhury, Slide:22 I-Match Algorithm 1. 2. 3. 4. 5. 6. 7. Get document. Parse document into a token steam, removing format tags. Using term thresholds (idf), retain only significant tokens. Insert relevant tokens into ascending ordered tree of unique tokens. Loop through tokens and add each to the digest. Upon completion a document digest is defined. The tuple (doc_id, SHA1 Digest) is inserted into a data structure keyed digest. If there is a collision of digest values then the documents are similar. • • • • • D1 is a duplicate of D2 implies D2 is a duplicate of D1 Some justification for feature selection Single pass over the data Really short documents??? Language Detection Abdur Chowdhury, Slide:23 Outline 1. What is duplicate data 2. Similarity Metrics 1. Resemblance 2. Cosine 5. Lexical Randomization 6. Conclusions 3. Efficiency 4. Fuzzy Hashing 1. Shingling 2. I-Match Abdur Chowdhury, Slide:24 I-Match Research Areas • A document signature is equivalent to the intersection between terms contained in the document and the I-Match lexicon. – Modification of a single word can lead to signature change, signature fragility. – Can signature fragility be overcome? • The term distribution statistics of a large document collection are used to identify terms likely to be useful in determining document uniqueness. – Such terms comprise I-Match lexicon. – Should the local collection or a external collection be used? Abdur Chowdhury, Slide:25 I-Match w/ Randomized Lexicons Document Document Feature extraction Lexicon Feature extraction Lexicon Concatenated common features concatenations HASH HASH Document signatures Document signature Abdur Chowdhury, Slide:26 Lexicon Intersect Intersect Intersect Intersect Lexicon Lexicon Randomization • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … Abdur Chowdhury, Slide:27 Lexicon Randomization • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … Abdur Chowdhury, Slide:28 Lexicon Randomization • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … • • • • • • • • • • • • • • • • avis awarness awhile axon bagel baghdad bandmember bang bangalore bangs berry bipedal capcity capelin cardiac … etc … Abdur Chowdhury, Slide:29 Perturbation curve Signature stabilty probability 1 bag-2 bag-5 bag-10 0.9 0.8 y t i l i b a b o r p 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 insertion/deletion count 6 7 8 Abdur Chowdhury, Slide:30 I-Match Randomization Experiments • Test collection was created – Random sample (5%) of web documents from the WT10G 1.6M document collection – Cosine similarity T>.9 used to create the base truth (15 days of CPU time ~80k documents) • Experiments to compare the effectiveness – Use same collection or representative language collection – Improvements with more randomized sets Abdur Chowdhury, Slide:31 Impact of Randomization and Collection Statistics Randomized I-Match recall for WT10G 0.7 l l a c e r e t a c i l p u d r a e n 0.65 0.6 0.55 0.5 0.45 0.4 lex-sgml lex-wt10g 0.35 0 1 2 3 4 5 6 number of extra lexicons used 7 8 9 10 Abdur Chowdhury, Slide:32 Outline 1. What is duplicate data 2. Similarity Metrics 1. Resemblance 2. Cosine 5. Lexical Randomization 6. Conclusions 3. Efficiency 4. Fuzzy Hashing 1. Shingling 2. I-Match Abdur Chowdhury, Slide:33 Conclusions • Near Duplicate Data is a problem for many applications • Traditional similarly approaches have significant performance penalties • I-Match overcomes most performance problems and provides good near duplicate data detection • The use of lexicon randomization can significantly improve the effectiveness of I-Match Abdur Chowdhury, Slide:34

Related docs
Duplicate File Finder
Views: 19  |  Downloads: 1
Delete Duplicate Files
Views: 38  |  Downloads: 0
INSTRUCTIONS FOR DETECTION LEVELS
Views: 1  |  Downloads: 0
premium docs
Other docs by David Wierd