Discover Elixir Compiler Internals
By creating a Morphological Parser
@xiamx (Meng Xuan Xia)
About me
A fan of diversified research topics: Distributed System, Machine Learning, Natural Language Processing and Formal Verification.
Open Source contributor.
Blog: http://cs.mcgill.ca/~mxia3
Github: https://github.com/xiamx
Morpho*&^ical Parser ?
Morphological parsing and lemmatization
Want to build a chatbot? Need to match on text.
We use different forms of a word, such as: organize, organizes, and organizing
…
Cuz grammar!
Lemmatization give words common forms
the boy's cars are different colors
the boy car be differ color
Simple Lemmatizer in Elixir (string pattern matching)
def lemma(root = "organiz" <> suffix) do
root <> case suffix do
"e" -> "e"
"es" -> "e"
"ed" -> "e"
"ing" -> "e"
end
end
Elixir source code
Elixir AST
Expanded Elixir AST
Erlang Abstract Format
Elixir_compiler_X BEAM assembly
Macro expansion
Parsing
How Elixir represent code internally (AST)
{:def, [context: Elixir, import: Kernel],
[{:lemma, [context: Elixir],
[{:=, [],
[{:root, [], Elixir},
{:<>, [context: Elixir, import: Kernel],
["organiz", {:suffix, [], Elixir}]}]}]},
[do: {:<>, [context: Elixir, import: Kernel],
[{:root, [], Elixir},
{:case, [],
[{:suffix, [], Elixir},
[do: [{:->, [], [["e"], "e"]}, {:->, [],
[["es"], "e"]},
{:->, [], [["ed"], "e"]}, {:->, [], [["ing"],
"e"]}]]]}]}]]}
def lemma(root = "organiz" <>
suffix) do
root <> case suffix do
"e" -> "e"
"es" -> "e"
"ed" -> "e"
"ing" -> "e"
end
end
Problem: we have 50 000 and more words
Solution:
Write a macro to generate 50 000 lemma/1 functions.
defmacro generate_lemma(roots) do
for root <- roots do
quote do
def lemma(unquote(root) <> suffix) do
unquote(root) <> case suffix do
"s" -> ""
"ed" -> ""
"ing" -> ""
end
end
end
end
end
Elixir source code
Elixir AST
Expanded Elixir AST
Erlang Abstract Format
Elixir_compiler_X BEAM assembly
Macro expansion
Parsing
Problem: AST explosion, compilation cost skyrocket
{:def, [context: Elixir, import: Kernel],
[{:lemma, [context: Elixir],
[{:=, [],
[{:root, [], Elixir},
{:<>, [context: Elixir, import: Kernel],
["organiz", {:suffix, [], Elixir}]}]}]},
[do: {:<>, [context: Elixir, import: Kernel],
[{:root, [], Elixir},
{:case, [],
[{:suffix, [], Elixir},
[do: [{:->, [], [["e"], "e"]}, {:->, [],
[["es"], "e"]},
{:->, [], [["ed"], "e"]}, {:->, [], [["ing"],
"e"]}]]]}]}]]}
Macro expansion generates this block 100 000 times
Problem: AST explosion, compilation cost skyrocket
Alternate data structure: Finite State Transducer
Witch(-es)
Wizard(-s)
“ “
“ “
“ “
How does one build a FST in Elixir ?
Using gen_fst’s rule/2 DSL
@type fst_rule :: String.t | {String.t, String.t}
@spec rule(fst, [fst_rule]) :: fst
GenFST.rule(fst, rule)
Define a transducing rule, adding it to the fst (which is just a plain graph)
A transducing rule is a List of String.t | {String.t, String.t}.
For example: rule fst, ["organiz", {"es", "e"}] means outputting "organiz" verbatimly, and transforming "es" into "e". If a finite state transducer built with this rule is fed with string "organizes", then the output will be "organize"
GenFST.rule/2 example
require GenFST
fst = GenFST.new
fst = fst |> rule(["organiz", ["e", "e"])
fst = fst |> rule(["organiz", ["es", "e"])
fst = fst |> rule(["organiz", ["ing", "e"])
# … other rules
s
e e e
organiz : organiz
e: e
es: eing: e
Difference between GenFST and Pattern Matching
GenFST Pattern Matching
Data structure Stores only data related to lemmatization problem
Stores Elixir semantic in addition to lemmatization data
Matching algorithm Traversal on a trie-like data structure (FST graph)
Using highly optimized Beam VM for binary pattern matching
Building time vs. number of lemma
Linear scaling Non-linear scaling
Lemmatization speed Constant Constant
*assuming avg word length << number of lemma
Building FST at run-time vs at compile-time
defmodule Lemma.Benchmark.ParserDynamic do
@moduledoc """
Parser is generated at run-time with `new/0`.
"""
import Lemma.MorphParserGenerator
def new do
fst = GenFST.new
|> generate_rules(Lemma.En.Verbs.all,
Lemma.En.Rules.verbs)
IO.puts("Rules generated")
fst
end
def parse(fst, word) do
GenFST.parse(fst, word)
end
end
defmodule Lemma.Benchmark.Parser do
@moduledoc """
Parser is generated at compile-time as a module
attribute
"""
import Lemma.MorphParserGenerator
@fst GenFST.new
@fst generate_rules(@fst, Lemma.En.Verbs.all,
Lemma.En.Rules.verbs)
IO.puts("Rules generated")
def parse(word) do
GenFST.parse(@fst, word)
end
end
Performance benchmark
Name ips average deviation median
Compiled parser 16.25 K 61.55 us 13.87% 62.0 us
Dynamic parser 4.39 K 228.91 us 13.93% 250.00 us
Comparison:
Compiled parser 16.25 K
Dynamic parser 4.39 K - 3.70x slower
Why building FST at compile-time makes run-time lemmatization faster?
Elixir source code
Elixir AST
Expanded Elixir AST
Erlang Abstract Format
Elixir_compiler_X BEAM assembly
Macro expansion
Parsing
Why building FST at compile-time makes run-lemmatization faster?
When a structure is known at compile-time it is cached into a special cache area in the BEAM file, and when loaded those values are marked as do not need to garbage collect, and it can make a lot of other assumptions like not needing to copy them between processes and instead just pointing to it everywhere, which does indeed give quite a speed boost.
--OvermindDL1
Can we parallelize* lemmatization ?
*parallelizing the dynamic loaded version
Naive parallelization using Task.async/awaitparagraph_of_words
|> Enum.map(&(Task.async(fn -> Lemma.parse(fst, &1) end)))
|> Enum.map(&Task.await/1)
Naive parallelism doesn’t work
fst is a 500MB large graph data structure
In a 80 words sentences, fst gets copied 80 times, one for each task
Run out of memory very quickly!!!
Recap
Elixir source code
Elixir AST
Expanded Elixir AST
Erlang Abstract Format
Elixir_compiler_X BEAM assembly
Macro expansion
Parsing
Questions ?
More on Elixir Compiler:
● https://medium.com/@fxn/how-does-elixir-compile-execute-code-c1b36c9ec8cf● https://elixirforum.com/t/getting-each-stage-of-elixirs-compilation-all-the-way-to-the-beam-byteco
de/1873
More on Lemma: https://github.com/xiamx/lemma
More on GenFST: https://github.com/xiamx/gen_fst