Date post: | 14-Aug-2015 |
Category: |
Technology |
Upload: | yoshinari-fujinuma |
View: | 432 times |
Download: | 0 times |
Kuromoji FST2015/06/25
Yoshinari Fujinuma
Overview• Motivation
• Building FST
• How freezing works
• How equivalent detection works
• Compiled FST and Virtual Machine
Motivation• Efficient Key value store for dictionary look up
during tokenization
• String -> integers
• int -> token info
Why FST and not Trie? • Finite State Transducer (FST) = Finite State Automaton +
Output
• Able to merge both prefixes and suffixes too
• e.g. “can”, “cats”, “dogs”
Overview of how the build works
List of sorted words,
list of integers
FST Builder
FST Compiler
Object-based FST
Compiled FST
How Building / Compiling works
• two variables are the key
• previous word (prev)
• current word (current)
1. Skip common prefix between prev and current
2. make arcs to the temp states
3. Freeze (Finalize) states which suffix differ betw. prev and current
Toy example
• cat -> 0
• cats -> 1
• catx -> 2
Initializing states• InitializationFrozen states
Temp states
Freezing states• prev word = “”, current word = catFrozen states
Temp states t/0a/0c/0
Add Arc to suffix
Frozen states
Temp states s/1t/0a/0c/0
• prev word = cat, current word = cats
Freeze differing suffix• pre word = cats, current word = catxFrozen states
Temp states s/1t/0a/0c/0 x/2
Freeze differing suffix• pre word = cats, current word = catxFrozen states
Temp states s/1t/0a/0c/0 x/2
HashCode 1
Freeze differing suffix• pre word = cats, current word = catxFrozen states
Temp states s/1t/0a/0c/0 x/2
HashCode 1
Merge Equivalent states• pre word = catx, current word =“”Frozen states
Temp states s/1t/0a/0c/0
x/2
Freezing states• pre word = catx, current word = “”Frozen states s/1t/0a/0c/0
x/2
Temp states
Equivalent state detection• We want to merge equivalent states!
• Key-value store using HashMap
• Key: State.hashCode()
• Value: State Object
• Collisions are resolved by chaining
Arc Equivalence
c/0
• Same transition character
• Same destination state
• Same output
c/0
State Equivalence• All the outgoing set of arcs are equivalent
• Both states are of the same type of state
c/0
c/0
How Compiled FST works• Generates a “Program”
• Running a Program = look up a word in a dictionary
• Program runs on a Virtual Machine which we implemented
Compiled FST = “Program”
Virtual Machine
Worde.g. “cat”
Integer if exists in
dictionary
-1, it not
OR
Program• List of Instructions, 11 bytes each
• Operation code (Op code)
• Math or Accept, Match, Fail
Op code 1byte
transition char 2 bytes
output 4 bytes
target address 4 bytes
Match• Transition to a given address
• Accumulator += output
0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2
….
Fail• Stop running the Program and return -1
• e.g. “tss”
0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2
….
Match or Accept
• If the current character is the final char,
• Ends running the program and returns the accumulator
• Else Match
Instructions vs. Arcs• What instructions represent
0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2
….
s/1x/2
t/0
Virtual Machine running backwards
0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 25 Fail6 Match a 0 47 Fail8 Match c 0 6
• Because of freezing from suffixes
Use of Cache• The lookup for next state is done by linear search
• The num. of outgoing arcs from the start state is large
• Therefore, we cache those outgoing arcs
Summary• FST is theoretically more compact than tries
• Implemented FST Builder which builds
• Object-based FST
• Compiled FST, compact form
• Uses Virtual Machine to run the compiled program (= lookup a word)
References• Direct Construction of Minimal Acyclic Subsequential
Transducers, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
• Smaller representation of finite-state automata http://www.sciencedirect.com/science/article/pii/S0304397512003787
• Blog post by Ikawa-san http://qiita.com/ikawaha/items/be95304a803020e1b2d1
• This code is available at https://github.com/atilika/fst