Discover Elixir Compiler Internals

Discover Elixir Compiler Internals

By creating a Morphological Parser

@xiamx (Meng Xuan Xia)

About me

A fan of diversified research topics: Distributed System, Machine Learning, Natural Language Processing and Formal Verification.

Open Source contributor.

Blog: http://cs.mcgill.ca/~mxia3

Github: https://github.com/xiamx

http://cs.mcgill.ca/~mxia3/pdfs/UCORE14.pdf

http://www.aclweb.org/anthology/W/W16/W16-58.pdf#page=83

https://github.com/xiamx/awesome-sentiment-analysis

https://www.slideshare.net/xiamxqt/case-study-formal-verification-of-the-brain-fuck-scheduler-41880642

http://cs.mcgill.ca/~mxia3

https://github.com/xiamx

Morpho*&^ical Parser ?

Morphological parsing and lemmatization

Want to build a chatbot? Need to match on text.

We use different forms of a word, such as: organize, organizes, and organizing

…

Cuz grammar!

Lemmatization give words common forms

the boy's cars are different colors

the boy car be differ color

Simple Lemmatizer in Elixir (string pattern matching)

def lemma(root = "organiz" <> suffix) do

root <> case suffix do

"e" -> "e"

"es" -> "e"

"ed" -> "e"

"ing" -> "e"

end

end

Elixir source code

Elixir AST

Expanded Elixir AST

Erlang Abstract Format

Elixir_compiler_X BEAM assembly

Macro expansion

Parsing

How Elixir represent code internally (AST)

{:def, [context: Elixir, import: Kernel],

[{:lemma, [context: Elixir],

[{:=, [],

[{:root, [], Elixir},

{:<>, [context: Elixir, import: Kernel],

["organiz", {:suffix, [], Elixir}]}]}]},

[do: {:<>, [context: Elixir, import: Kernel],


{:case, [],

[{:suffix, [], Elixir},

[do: [{:->, [], [["e"], "e"]}, {:->, [],

[["es"], "e"]},

{:->, [], [["ed"], "e"]}, {:->, [], [["ing"],

"e"]}]]]}]}]]}

def lemma(root = "organiz" <>

suffix) do

root <> case suffix do

"e" -> "e"

"es" -> "e"

"ed" -> "e"

"ing" -> "e"

end

end

Problem: we have 50 000 and more words

Solution:

Write a macro to generate 50 000 lemma/1 functions.

defmacro generate_lemma(roots) do

for root <- roots do

quote do

def lemma(unquote(root) <> suffix) do

unquote(root) <> case suffix do

"s" -> ""

"ed" -> ""

"ing" -> ""

end

end

end

end

end

Elixir source code

Elixir AST

Expanded Elixir AST



Macro expansion

Parsing

Problem: AST explosion, compilation cost skyrocket

{:def, [context: Elixir, import: Kernel],

[{:lemma, [context: Elixir],

[{:=, [],


{:<>, [context: Elixir, import: Kernel],

["organiz", {:suffix, [], Elixir}]}]}]},

[do: {:<>, [context: Elixir, import: Kernel],


{:case, [],

[{:suffix, [], Elixir},

[do: [{:->, [], [["e"], "e"]}, {:->, [],

[["es"], "e"]},

{:->, [], [["ed"], "e"]}, {:->, [], [["ing"],

"e"]}]]]}]}]]}

Macro expansion generates this block 100 000 times

Problem: AST explosion, compilation cost skyrocket

Alternate data structure: Finite State Transducer

Witch(-es)

Wizard(-s)

“ “

“ “

“ “

How does one build a FST in Elixir ?

Using gen_fst’s rule/2 DSL

@type fst_rule :: String.t | {String.t, String.t}

@spec rule(fst, [fst_rule]) :: fst

GenFST.rule(fst, rule)

Define a transducing rule, adding it to the fst (which is just a plain graph)

A transducing rule is a List of String.t | {String.t, String.t}.

For example: rule fst, ["organiz", {"es", "e"}] means outputting "organiz" verbatimly, and transforming "es" into "e". If a finite state transducer built with this rule is fed with string "organizes", then the output will be "organize"

https://github.com/xiamx/gen_fst

GenFST.rule/2 example

require GenFST

fst = GenFST.new

fst = fst |> rule(["organiz", ["e", "e"])

fst = fst |> rule(["organiz", ["es", "e"])

fst = fst |> rule(["organiz", ["ing", "e"])

# … other rules

s

e e e

organiz : organiz

e: e

es: eing: e

Difference between GenFST and Pattern Matching

GenFST Pattern Matching

Data structure Stores only data related to lemmatization problem

Stores Elixir semantic in addition to lemmatization data

Matching algorithm Traversal on a trie-like data structure (FST graph)

Using highly optimized Beam VM for binary pattern matching

Building time vs. number of lemma

Linear scaling Non-linear scaling

Lemmatization speed Constant Constant

*assuming avg word length << number of lemma

Building FST at run-time vs at compile-time

defmodule Lemma.Benchmark.ParserDynamic do

@moduledoc """

Parser is generated at run-time with `new/0`.

"""

import Lemma.MorphParserGenerator

def new do

fst = GenFST.new

|> generate_rules(Lemma.En.Verbs.all,

Lemma.En.Rules.verbs)

IO.puts("Rules generated")

fst

end

def parse(fst, word) do

GenFST.parse(fst, word)

end

end

defmodule Lemma.Benchmark.Parser do

@moduledoc """

Parser is generated at compile-time as a module

attribute

"""

import Lemma.MorphParserGenerator

@fst GenFST.new

@fst generate_rules(@fst, Lemma.En.Verbs.all,

Lemma.En.Rules.verbs)

IO.puts("Rules generated")

def parse(word) do

GenFST.parse(@fst, word)

end

end

Performance benchmark

Name ips average deviation median

Compiled parser 16.25 K 61.55 us 13.87% 62.0 us

Dynamic parser 4.39 K 228.91 us 13.93% 250.00 us

Comparison:

Compiled parser 16.25 K

Dynamic parser 4.39 K - 3.70x slower

Why building FST at compile-time makes run-time lemmatization faster?

Elixir source code

Elixir AST

Expanded Elixir AST



Macro expansion

Parsing

Why building FST at compile-time makes run-lemmatization faster?

When a structure is known at compile-time it is cached into a special cache area in the BEAM file, and when loaded those values are marked as do not need to garbage collect, and it can make a lot of other assumptions like not needing to copy them between processes and instead just pointing to it everywhere, which does indeed give quite a speed boost.

--OvermindDL1

Can we parallelize* lemmatization ?

*parallelizing the dynamic loaded version

Naive parallelization using Task.async/awaitparagraph_of_words

|> Enum.map(&(Task.async(fn -> Lemma.parse(fst, &1) end)))

|> Enum.map(&Task.await/1)

Naive parallelism doesn’t work

fst is a 500MB large graph data structure

In a 80 words sentences, fst gets copied 80 times, one for each task

Run out of memory very quickly!!!

Recap

Elixir source code

Elixir AST

Expanded Elixir AST



Macro expansion

Parsing

Questions ?

More on Elixir Compiler:

● https://medium.com/@fxn/how-does-elixir-compile-execute-code-c1b36c9ec8cf● https://elixirforum.com/t/getting-each-stage-of-elixirs-compilation-all-the-way-to-the-beam-byteco

de/1873

More on Lemma: https://github.com/xiamx/lemma

More on GenFST: https://github.com/xiamx/gen_fst

https://medium.com/@fxn/how-does-elixir-compile-execute-code-c1b36c9ec8cf

https://elixirforum.com/t/getting-each-stage-of-elixirs-compilation-all-the-way-to-the-beam-bytecode/1873

https://elixirforum.com/t/getting-each-stage-of-elixirs-compilation-all-the-way-to-the-beam-bytecode/1873

https://github.com/xiamx/lemma

https://github.com/xiamx/gen_fst

Date post:	07-Feb-2022
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Discover Elixir Compiler Internals

Documents