+ All Categories
Home > Documents > Discover Elixir Compiler Internals

Discover Elixir Compiler Internals

Date post: 07-Feb-2022
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
28
Discover Elixir Compiler Internals By creating a Morphological Parser @xiamx (Meng Xuan Xia)
Transcript
Page 1: Discover Elixir Compiler Internals

Discover Elixir Compiler Internals

By creating a Morphological Parser

@xiamx (Meng Xuan Xia)

Page 2: Discover Elixir Compiler Internals

About me

A fan of diversified research topics: Distributed System, Machine Learning, Natural Language Processing and Formal Verification.

Open Source contributor.

Blog: http://cs.mcgill.ca/~mxia3

Github: https://github.com/xiamx

Page 3: Discover Elixir Compiler Internals

Morpho*&^ical Parser ?

Page 4: Discover Elixir Compiler Internals

Morphological parsing and lemmatization

Want to build a chatbot? Need to match on text.

We use different forms of a word, such as: organize, organizes, and organizing

Cuz grammar!

Page 5: Discover Elixir Compiler Internals

Lemmatization give words common forms

the boy's cars are different colors

the boy car be differ color

Page 6: Discover Elixir Compiler Internals

Simple Lemmatizer in Elixir (string pattern matching)

def lemma(root = "organiz" <> suffix) do

root <> case suffix do

"e" -> "e"

"es" -> "e"

"ed" -> "e"

"ing" -> "e"

end

end

Page 7: Discover Elixir Compiler Internals

Elixir source code

Elixir AST

Expanded Elixir AST

Erlang Abstract Format

Elixir_compiler_X BEAM assembly

Macro expansion

Parsing

Page 8: Discover Elixir Compiler Internals

How Elixir represent code internally (AST)

{:def, [context: Elixir, import: Kernel],

[{:lemma, [context: Elixir],

[{:=, [],

[{:root, [], Elixir},

{:<>, [context: Elixir, import: Kernel],

["organiz", {:suffix, [], Elixir}]}]}]},

[do: {:<>, [context: Elixir, import: Kernel],

[{:root, [], Elixir},

{:case, [],

[{:suffix, [], Elixir},

[do: [{:->, [], [["e"], "e"]}, {:->, [],

[["es"], "e"]},

{:->, [], [["ed"], "e"]}, {:->, [], [["ing"],

"e"]}]]]}]}]]}

def lemma(root = "organiz" <>

suffix) do

root <> case suffix do

"e" -> "e"

"es" -> "e"

"ed" -> "e"

"ing" -> "e"

end

end

Page 9: Discover Elixir Compiler Internals

Problem: we have 50 000 and more words

Solution:

Write a macro to generate 50 000 lemma/1 functions.

defmacro generate_lemma(roots) do

for root <- roots do

quote do

def lemma(unquote(root) <> suffix) do

unquote(root) <> case suffix do

"s" -> ""

"ed" -> ""

"ing" -> ""

end

end

end

end

end

Page 10: Discover Elixir Compiler Internals

Elixir source code

Elixir AST

Expanded Elixir AST

Erlang Abstract Format

Elixir_compiler_X BEAM assembly

Macro expansion

Parsing

Page 11: Discover Elixir Compiler Internals

Problem: AST explosion, compilation cost skyrocket

{:def, [context: Elixir, import: Kernel],

[{:lemma, [context: Elixir],

[{:=, [],

[{:root, [], Elixir},

{:<>, [context: Elixir, import: Kernel],

["organiz", {:suffix, [], Elixir}]}]}]},

[do: {:<>, [context: Elixir, import: Kernel],

[{:root, [], Elixir},

{:case, [],

[{:suffix, [], Elixir},

[do: [{:->, [], [["e"], "e"]}, {:->, [],

[["es"], "e"]},

{:->, [], [["ed"], "e"]}, {:->, [], [["ing"],

"e"]}]]]}]}]]}

Macro expansion generates this block 100 000 times

Page 12: Discover Elixir Compiler Internals

Problem: AST explosion, compilation cost skyrocket

Page 13: Discover Elixir Compiler Internals

Alternate data structure: Finite State Transducer

Witch(-es)

Wizard(-s)

“ “

“ “

“ “

Page 14: Discover Elixir Compiler Internals

How does one build a FST in Elixir ?

Page 15: Discover Elixir Compiler Internals

Using gen_fst’s rule/2 DSL

@type fst_rule :: String.t | {String.t, String.t}

@spec rule(fst, [fst_rule]) :: fst

GenFST.rule(fst, rule)

Define a transducing rule, adding it to the fst (which is just a plain graph)

A transducing rule is a List of String.t | {String.t, String.t}.

For example: rule fst, ["organiz", {"es", "e"}] means outputting "organiz" verbatimly, and transforming "es" into "e". If a finite state transducer built with this rule is fed with string "organizes", then the output will be "organize"

Page 16: Discover Elixir Compiler Internals

GenFST.rule/2 example

require GenFST

fst = GenFST.new

fst = fst |> rule(["organiz", ["e", "e"])

fst = fst |> rule(["organiz", ["es", "e"])

fst = fst |> rule(["organiz", ["ing", "e"])

# … other rules

s

e e e

organiz : organiz

e: e

es: eing: e

Page 17: Discover Elixir Compiler Internals

Difference between GenFST and Pattern Matching

GenFST Pattern Matching

Data structure Stores only data related to lemmatization problem

Stores Elixir semantic in addition to lemmatization data

Matching algorithm Traversal on a trie-like data structure (FST graph)

Using highly optimized Beam VM for binary pattern matching

Building time vs. number of lemma

Linear scaling Non-linear scaling

Lemmatization speed Constant Constant

*assuming avg word length << number of lemma

Page 18: Discover Elixir Compiler Internals

Building FST at run-time vs at compile-time

defmodule Lemma.Benchmark.ParserDynamic do

@moduledoc """

Parser is generated at run-time with `new/0`.

"""

import Lemma.MorphParserGenerator

def new do

fst = GenFST.new

|> generate_rules(Lemma.En.Verbs.all,

Lemma.En.Rules.verbs)

IO.puts("Rules generated")

fst

end

def parse(fst, word) do

GenFST.parse(fst, word)

end

end

defmodule Lemma.Benchmark.Parser do

@moduledoc """

Parser is generated at compile-time as a module

attribute

"""

import Lemma.MorphParserGenerator

@fst GenFST.new

@fst generate_rules(@fst, Lemma.En.Verbs.all,

Lemma.En.Rules.verbs)

IO.puts("Rules generated")

def parse(word) do

GenFST.parse(@fst, word)

end

end

Page 19: Discover Elixir Compiler Internals

Performance benchmark

Name ips average deviation median

Compiled parser 16.25 K 61.55 us 13.87% 62.0 us

Dynamic parser 4.39 K 228.91 us 13.93% 250.00 us

Comparison:

Compiled parser 16.25 K

Dynamic parser 4.39 K - 3.70x slower

Page 20: Discover Elixir Compiler Internals

Why building FST at compile-time makes run-time lemmatization faster?

Page 21: Discover Elixir Compiler Internals

Elixir source code

Elixir AST

Expanded Elixir AST

Erlang Abstract Format

Elixir_compiler_X BEAM assembly

Macro expansion

Parsing

Page 22: Discover Elixir Compiler Internals

Why building FST at compile-time makes run-lemmatization faster?

When a structure is known at compile-time it is cached into a special cache area in the BEAM file, and when loaded those values are marked as do not need to garbage collect, and it can make a lot of other assumptions like not needing to copy them between processes and instead just pointing to it everywhere, which does indeed give quite a speed boost.

--OvermindDL1

Page 23: Discover Elixir Compiler Internals

Can we parallelize* lemmatization ?

*parallelizing the dynamic loaded version

Page 24: Discover Elixir Compiler Internals

Naive parallelization using Task.async/awaitparagraph_of_words

|> Enum.map(&(Task.async(fn -> Lemma.parse(fst, &1) end)))

|> Enum.map(&Task.await/1)

Page 25: Discover Elixir Compiler Internals

Naive parallelism doesn’t work

fst is a 500MB large graph data structure

In a 80 words sentences, fst gets copied 80 times, one for each task

Run out of memory very quickly!!!

Page 26: Discover Elixir Compiler Internals

Recap

Page 27: Discover Elixir Compiler Internals

Elixir source code

Elixir AST

Expanded Elixir AST

Erlang Abstract Format

Elixir_compiler_X BEAM assembly

Macro expansion

Parsing

Page 28: Discover Elixir Compiler Internals

Questions ?

More on Elixir Compiler:

● https://medium.com/@fxn/how-does-elixir-compile-execute-code-c1b36c9ec8cf● https://elixirforum.com/t/getting-each-stage-of-elixirs-compilation-all-the-way-to-the-beam-byteco

de/1873

More on Lemma: https://github.com/xiamx/lemma

More on GenFST: https://github.com/xiamx/gen_fst


Recommended