Dictionary Matching and Indexing with Edits and Don’t Cares

Post on 06-Feb-2016

24 views 0 download

description

Dictionary Matching and Indexing with Edits and Don’t Cares. Richard Cole NYU. Lee-Ad Gottlieb NYU. Moshe Lewenstein Bar-Ilan. Pattern Matching. Various problems of the following flavor: Preprocess a text t , or a collection of strings d 1 ,…,d x , - PowerPoint PPT Presentation

transcript

Dictionary Matching and Indexing with Edits and Don’t

Cares

Richard ColeNYU

Lee-Ad GottliebNYU

Moshe LewensteinBar-Ilan

Pattern Matching

Various problems of the following flavor:

Preprocess a text t,or a collection of strings d1,…,dx,

so that given a query string p, all matches with the text can be found quickly.

IndexingDictionary queries

Dictionary matchingAll-to-all matching

Pattern Matching

Dictionary queries.

Bate Beat Boat Boot

Beta

Pattern Matching

Dictionary matching.

Bate Beat Boat Boot

The fish beat my boot.

Pattern Matching

Text indexing.

abracadabra

ra ra

Pattern Matching

All-to-all matching.

Bate Beat Boat Boot

bat boots be

Previous Work

a

t

e o

o

t

Bate BeatBoat Boot

aa

e

t

b

t

Beta

Dictionary Queries

Previous Work

a

t

e o

o

t

Bate BeatBoat Boot

aa

e

t

b

t

Beta

Dictionary Queries

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Approximate Matches

Wildcards (don’t cares)BoatBo*t

SubstitutionsBoatBoot

Edits – insertions and deletionsBoatB_at

Previous Work – Best Results

Indexing and Dictionary Matching (edits) Buchsbaum, Goodrich, Westbrook.

k=1 p log log n + occ query timen log n space

Dictionary Queries (substitutions) Brodal, Gasieniec.

k=1 p + occ query timen space

Previous Work – Basic Intuition

abracadabra Build a suffix tree for

abracadab abracada abracad abraca abrac abra abr ab a

abracadabra And for

a ar arb arba arbad arbada arbadac arbadaca arbadacar

abrac*dabra

New Results

Indexing, Dictionary Queries, Dictionary Matches Substitutions

k < log n p + [(c1log n)k log log n] / k! + occ query timen(c2log n)k / k! space

Editsk < log n p + [(c3log n)k log log n] / k!

+ 3kocc query timen(c4log n)k / k! space

Wildcards in patternk < log n p + 2klog log n / k! + occ query time

n + (k+log n)k / k! space

Dictionary Wildcard Queries

Three data structures for dictionary wildcard queries

Naïve: O(n) space kp query time

Less-naïve: O(n1+k) p

New data structure: O(n logkn) 2kp

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Query time:k p

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

tr t

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

tr t

Query string:*it

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t

Query time:p

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

*

*

*

Space:O(n1+k)

*

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Query time:2kp

Space Analysis

Create a wildcard subtree at each node in the original trie. heaviest child is not in the wildcard tree.

Look at any leaf of the trie How many of its ancestors were not the heaviest child?

log2n So it appears in at most log n wildcard trees.

Space: n log n n logkn

Edit Distance

Wildcards is (algorithmically) the simplest type of approximate search.

What issues come up when dealing with substitutions, insertions and deletions?

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Tree

a

a

a

b

b

b

a

a

Query string:aab

Substitution Tree

a

a

a

b

b

b

a

a a

a

a

Query string:aab

Deletion Tree

a

a

a

b

b

c

a

a

Deletion tree

Deletion Tree

a

a

a

b

b

c

a

a

c

bDeletion tree!

Insertion Tree

a

a

a

b

b

c

a

a

Insertion tree

Insertion Tree

a

a

a

b

b

c

a

a

a

c

b

Insertion tree!

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

b

a

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

b

a

a

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

bGrouping!

Analysis

Can’t merge along all possible paths of original trie – too expensive.

Merge along centroid paths. Centroid paths always follow the heaviest child.

Any path from root to leaf traverses at most log n centroid paths.

Analysis

Analysis

Analysis

Analysis

Grouping

Grouping

Grouping

Grouping

Grouping

Grouping

Suppose a search reached up to the 7th edge with no

substitutions.

Grouping

Suppose a search reached up to the 7th edge with no

substitutions.

Grouping

Suppose a search reached up to the 7th edge with no

substitutions.

Grouping

…then we searchonly three

substitution trees.

Space increase:log n factor

Suppose a search reached up to the 7th edge with no

substitutions.

Analysis w1

w2

w3

w4

log n searches

log n searches

log n searches

Total number of searches:log n * log n = log2 n

Analysis

For k=1 For each centroid path traversed, log n substitution

subtree searches. A path to a leaf traverses at most log n centroid

paths. log2n searches log n searches using balanced

grouping.

More generally logkn searches Using a Y-fast trie, each search takes log log n time

logkn log log n

More Rigorous Analysis

Balanced SearchTree

More Rigorous Analysis

Weight Balanced Search Tree

More Rigorous Analysis

Weight Balanced Search Tree

More Rigorous Analysis

Weight Balanced Search Tree

More Rigorous Analysis

Weight Balanced Search Tree

More Rigorous Analysis

Weight Balanced Search Tree

O(log(W/w)) levels

More Rigorous Analysis

For a segment of a centroid path whose top has weight W and bottom has weight w we do about log (W/w) searches

Analysis w1

w2

w3

w4

log(w1/w2) searches

log(w2/w3) searches

log(w3/w4) searches

Total number of searches:log(w1/w2) + log(w2/w3) log(w3/w4) =log(w1/w4)

More Rigorous Analysis

Time for one match: logkn log log n / k!

Space: n(c log n)k / k! for some constant c

Open Problem

Dynamic search structure. Requires a less strict notion of “centroid path”?