survey on algorithms of edit distance based approximate string by luckboy

VIEWS: 115 PAGES: 52

survey on algorithms of edit distance based approximate string

More Info
									Survey on Algorithms of Edit Distance Based Approximate String Matching
Tian Song Tsinghua University Mar. 2006

Basic concepts Survey on the researches Approximate single string matching Approximate multiple string matching Conclusions

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Basic concepts Survey on the researches Approximate single string matching Approximate multiple string matching Conclusions

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Approximate String Matching
Let a given alphabet (finite) Σ a short pattern string p=p1p2…pm a large text string t=t1t2…tn a integer : k>=0 Finding all approximate occurrences of a substring of t whose distance to p is at most k

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Distance
The minimum number of operations that make 2 strings equal

Hamming distance
Operations: substitution

Levenshtein distance (Edit Distance)
Operations: deletion, insertion, substitution

Insertion distance
Operations: insertion
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Edit distance based Approximate String Matching Classification: single string, multiple strings Approximate Single String Matching Let a given alphabet (finite) Σ a pattern string p with the length of m a text string t with the length of n a integer : k>=0 Finding the positions in t from which occurrences in t and p are at most k under edit distance
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Approximate Multiple String Matching Let a given alphabet (finite) Σ

r pattern strings p1 p2 .. pr
a text string t with the length of n Each pattern corresponds an integer : k1 k2 .. kr Finding the positions in t from which occurrences in t and any one of all r patterns, suppose pi are at most ki under edit distance
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Online search and offline search
Online
We don’t build any structure on the text. Filtering is the main technique for skipping.

Offline
We need to build a structure on the text for high searching speed. The structure is called index.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Basic concepts Survey on the researches Approximate single string matching Approximate multiple string matching Conclusions

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

First
1974, Robert A. Wagner and Michael J. Fischer proposed this problem and original dynamic programming algorithms.

Overview
Five universities, nine researchers are involved. Almost all the researches are included about this topic.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Esko Ukkonen
Professor of University of Helsinki From 1981, he originally proposed automata algorithm and increased DP algorithms. After 2001, he focused on the researches of 2demension approximate matching and protein matching.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Udi Manber
Professor of University of Arizona From 1985, he proposed multiple approximate string matching algorithm and implemented Glimpse software for file systems searching and agrep software extending approximate string matching. After 2002, he joined Arizona.com and became a VP.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Sun Wu
Ph.d of University of Arizona Associate professor of Chung Cheng University The title of his Ph.d thesis is “Approximate Pattern Matching and it Applications”. Developing Glimpse and agrep software. After 2002, no papers are published.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Robert Muth
Ph.d of University of Arizona He firstly proposed MultiHash algorithm for multiple approximate string matching. After his graduation on 1999, he joined Intel for compiler optimization researches.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Eugene W. Myers
From 1981, he proposed algorithms for approximate regular expression matching and also proposed bit-parallel algorithms, such as BPM. Once be a professor of University of Arizona. After 2002, he became a professor of University of California, Berkeley for the researches on protein matching.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

U.Vishkin
Professor of University of Maryland From 1989, he proposed parallel algorithms for approximate string matching. After 1995, he focused on explicit multi threading parallel architecture.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Gad M. Laudau
Professor of University of Haifa From 1986, he focused on related researches. Now, he focused on Molecular Biology.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Ricardo Baeza-Yates
Professor of University of Chile From 1989, he proposed several algorithms for exact string matching and approximate string matching. Now he still does some researches on this topic.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Gonzalo Navarro
Ph.d, Associate professor of University of Chile From 1995, he mainly focused on approximate string matching. He proposed several algorithms for approximate string matching, such as MultiPEX, MultiBP, et al. After 2001, he proposed l-grams algorithm, which is the average optimal.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Basic concepts Survey on the researches Approximate single string matching Approximate multiple string matching Conclusions

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Overview

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Dynamic Programming

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Automata

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Bit parallel
BPR BPD BPM

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Filtering
Idea:
It may be easier to tell that a text position does not match than to tell that it does.

Filtering
PEX ABNDM

Verification
Hierarchical

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Basic concepts Survey on the researches Approximate single string matching Approximate multiple string matching Conclusions

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Overview
All algorithms are based on the idea of filtering. All algorithms needs at least three phases:
Preprocessing phase Scanning phase Verification phase

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiHash (for k=1)
Basic idea: filtering the text by exact matching of all prefixes of patterns. Proposed on 1996 Preprocessing phase:
Cut all patterns to an equal length by taking their prefixes of size r. Then, for each pattern A, we take all possible combinations of A with one deletion. Compute the hash values of these strings and store a pointer to the original full length string in hash table.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiHash (for k=1) Cont..
Scanning phase
Look at the text r characters at a time. For this subtext, construct the r strings resulting from deleting one character at a time. Check whether any of the resulting strings appear in hash table. If so, forward to verification phase.

Verification phase
Call an approximate single string matching to verify.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiHash (for k=1) Cont..
Improvement
Two-level hashing Idea: using locality to improve hashing One hashing – bit-map Second hashing – original pointer to patterns.

More than 3-fold improvement.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiBP
Basic idea: filtering the text by superimposing each pattern’s automata. Firstly proposed on 1996 Preprocessing phase:
Truncate all patterns to the same size Construct each pattern’s automata. Superimposing all patterns’ automata by OR. For two patterns: patt / wait

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiBP

Cont..

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiBP

Cont..

Scanning phase
Use the superimposed automata to filtering the text.

Verification phase
If the automata accepts a sub-text, verify the sub-text by approximate single string matching.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiPEX
Basic idea: exact partitioning extended to multiple patterns. Firstly proposed on 1997 Preprocessing phase:
We cut each one into k+1 pieces and search in parallel for all the r(k+1) pieces.

Scanning phase
Scanning all pieces with the manner of exact multiple string matching.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

MultiPEX

Cont..

Verification phase:
When a piece is found in the text, we use a classical algorithm to verify its pattern in the candidate area.

Improvement
Hierarchical verification

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

l-grams
Basic idea: using l-grams conception to shift text window. Firstly proposed on 2004 l-grams: any strings of l based on alphabet. Preprocessing phase:
Building a table D: Σl -> N telling, for each possible l-gram, the minimum number of differences necessary to match the l-gram inside any of the patterns.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

l-grams

Cont..

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

l-grams Cont..
Scanning phase
Proceeding by sliding a window of length m-k over the text. Read successive l-grams backwards. Any occurrence starting at the beginning of the window must fully contain those l-grams. Accumulate the D values for these l-grams read, Mu. If Mu exceeds k, we can safely shift the windows. Otherwise verify the strings.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

l-grams Cont..
verification phase
Scan the text area T[i+1, i+m+k] for each of the r patterns. Then shift the window by one position and resume the scanning.

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

l-grams Cont..

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

L-grams multiPEX multiBP

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Basic concepts Survey on the researches Approximate single string matching Approximate multiple string matching Conclusions

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Further thoughts
Jumping is the main method for fast matching Abstract and representation of pattern set determine the cost of jumping. The cost consists of memory, cache, the interval time of jumping. The key problem of approximate string matching is the abstract and representation of pattern set.
Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

abstract and representation of pattern set
Special arrangement of patterns N bytes prefix N bytes suffix The fixed length Abstract of sub string …

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching

Thanks

Tian Song:Survey on Algorithms of Edit Distance Based Approximate String Matching


								
To top