Post on 07-Mar-2016
description
transcript
Finite State Automatain
DawidWEISS
DawidWeiss
20+ years of coding10 years assembly only
Academia & ResearchPhD in Information Retrieval, PUT
Open sourceCarrot2, HPPC, Lucene,…
Industry & BusinessCarrot Search s.c.
.
.
.
.
.
.
. .
Talk outline
State machines (automata)FSAs, DFAs, FSTs and other XXXs.
Use cases in Lucene and SolrSuggester. FuzzySearch. Index.
No API detailsStill @experimental.
(Non)? Deterministic FiniteState (Automata|Machines)
HashSet
hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
FSA (deterministic)
l u c e n e
id
fe
rexists(sequence)oor(pre x)
ceil(pre x)
HashSet
hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
FSA (deterministic)
l u c e n e
id
fe
r
exists(sequence)oor(pre x)
ceil(pre x)
HashSet
hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
FSA (deterministic)
l u c e n e
id
fe
rexists(sequence)oor(pre x)
ceil(pre x)
k i l l
bl
li
deterministic, non-minimal
i l l
b
k
deterministic, minimal
i
l
l
b
k
i
lnon-deterministic,non-minimal
k i l l
bl
li
deterministic, non-minimal
i l l
b
k
deterministic, minimal
i
l
l
b
k
i
lnon-deterministic,non-minimal
k i l l
bl
li
deterministic, non-minimal
i l l
b
k
deterministic, minimal
i
l
l
b
k
i
lnon-deterministic,non-minimal
(Sorted)Map
lucene → 1lucid → 2lucifer → 666
FST (transducer)
l|1 u c e n e
i|1d
f|664e
r
(Sorted)Map
lucene → 1lucid → 2lucifer → 666
FST (transducer)
l u c e n e|1
id|2
fe
r|666
l|1 u c e n e
i|1d
f|664e
r
(Sorted)Map
lucene → 1lucid → 2lucifer → 666
FST (transducer)
l|1 u c e n e
i|1d
f|664e
r
NFSAs and
Regular expressions
Determinizationstates explosion, not always possible
Backtrackingrecursion explosion
aa
e1e2 e1 e1
e+e
e*e
e?e
a?nan
n=3 → a?a?a?aaa
Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).
a?nann=3 → a?a?a?aaa
Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).
a?nann=3 → a?a?a?aaa
Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).
0
5000
10000
15000
20000
25000
30000
35000
0 5 10 15 20 25 30
Tim
e [
ms]
Time of matching an for pattern a?nan , depending on n. Java 1.6, modern hardware.
Linear-time, minimal, deterministic
FSA construction
Linear algorithm from sorted inputby Daciuk, Mihov, et al.
Active pathstates that still can change
States dictionarynodes that will never change
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
lucene
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
lucid
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
i
d
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
i
d
lucifer
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
id
fe r
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
id
fe
r
FS(A|T)s in (Lucene|Solr)
Automata in
Lucene|Solr
org.apache.lucene.util.automaton.*partial port of brics, FuzzyQuery, AutomatonTermsEnum
org.apache.lucene.util.automaton.fst.FSTFSA and FSTs from sorted data, suggester, indexes
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
Input: abc, bd, bde.a b c
b
d
d e
a b c
bd e
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
s2 s1s3a b c
s4
bs5
d e
s1
cFL bL eFL dL a bL
s1s1s2
s2 s4 s3s5
s1
cFL bL eFL dL abLN
s2 s4 s3s5
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
s2 s1s3a b c
s4
bs5
d e
s1
cFL bL eFL dL a bL
s1s1s2
s2 s4 s3s5
s1
cFL bL eFL dL abLN
s2 s4 s3s5
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
s2 s1s3a b c
s4
bs5
d e
s1
cFL bL eFL dL a bL
s1s1s2
s2 s4 s3s5
s1
cFL bL eFL dL abLN
s2 s4 s3s5
Input size Compressed size (MB)
Input MB Terms Lucene morf. gzip
Wikipedia t.index 481 38092 045 258 164 149Polish in . 162 3 672 200 3.1 1.7 15.4
.
Use Cases:Solr's Autocomplete
Solr's
Suggesters
Design choicessort order (alpha, score), pre x vs. spelling, boost exact matches?
Weightsterm→weight, lookup(term, onlyMorePopular)
org.apache.solr.spelling.suggest.LookupJaspellLookup, TSTLookup, FSTLookup
flour|3four|4fourier|3furious|2
f
l
o
u
o
u
r i
r
u
ri
|
o u
e
4
|3
s | 2
Find pre x.Depth-in traversal for completions.PQ on score|alpha
. ...Take 1
flour|3four|4fourier|3furious|2
→fou*
f
l
o
u
o
u
r i
r
u
ri
|
o u
e
4
|3
s | 2
Find pre x.Depth-in traversal for completions.PQ on score|alpha
. ...Take 1
2furious3flour3fourier4four
2
3
4
f
f
f o
lo
ur
u
u
rr
i o
i e
us
From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.
. ...Take 2
2furious3flour3fourier4four
→fou*
2
3
4
f
f
f o
lo
ur
u
u
rr
i o
i e
us
From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.
. ...Take 2
2furious5urious|furious5rious|furious5ious|furious5ous|furious5us|furious5s|furious3flour…
. ...Take 3 (in xes)
.
.
2
3
4
5
6
7
f
f
f
i
o
r
s
u
e
il
o
r
u
o
ru
u
|r
r
eo
u
|
i
r
o ui
|r
s
ol
o
u
r
u
u
s
i
|
u
|
f
r
f
r
r
i o
il
| o
e
us
Constant time lookups!Regardless of the terms dictionary size.
Regardless of pre x length.
Exact matches only.Static snapshot (not incremental).
Discretized weights.
Constant time lookups!Regardless of the terms dictionary size.
Regardless of pre x length.
Exact matches only.Static snapshot (not incremental).
Discretized weights.
Top50KWiki.utf8, 676 KB, 50 000 terms
Jaspell TST FST
..RAM [B] ..7 869 415 ..7 914 524 ..300 175
queries per second,. . . tpq
..PREFIX [100-200] ..458 ..966 ..742
..PREFIX [6-9] ..330 ..228 ..659
..PREFIX [2-4] ..126 ...29 ..501
Summary
Summary and Conclusions
Automatacompact, powerful, efficient data structure
Lucene/Solr bene tsbehind the scenes, but spreading: index, queries, suggesters
API in Lucene…is shaped right now, still @experimental
Acknowledgement
Michael McCandless
Robert Muir
committer: .+