Spec + onyx

Post on 21-Feb-2017

25 views 0 download

transcript

Spec + Onyx: an experience report

@sbelak simon@goopti.com

Onyxa masterless, cloud scale, fault tolerant, high performance distributed computation system

… written entirely in Clojure

Onyx at• In production for almost a year

• ETL

• online machine learning

• offline (batch) machine learning

• ad-hoc analysis

Self-service infrastructure for data scientists

1.Onyx at a glance

2.How Onyx rewired my brain

3.Building on top of spec

Onyx at a glance

Job =

[[:input :processing-1] [:input :processing-2] [:processing-1 :output-1] [:processing-2 :output-2]]

[{:flow/from :input-stream :flow/to [:process-adults] :flow/predicate :my.ns/adult? :flow/doc "Emits segment if an adult.”}]

workflow + flow conditions + catalogue [{:onyx/name :add-5

:onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n] :onyx/batch-size batch-size}

{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}

{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Writes segments to a core.async channel"}]

Catalogue[{:onyx/name :add-5 :onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n] :onyx/batch-size batch-size}

{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}

{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Writes segments to a core.async channel"}]

Vanilla Clojure function(defn adder [n {:keys [x] :as segment}] (assoc segment :x (+ n x))))

Plugins (I/O)seq, async, Kafka, Datomic, SQL,…

parameter

self-documenting

Computation entirely described with data

data is

code!

Everything can be run locally!

Testing without mocking

How Onyx rewired my brain

It’s not about scaling, but clean architecture

My goto architecture

KafkaDB EventsOnyx Onyx

Onyx

Persist all messages to S3

(time travel!)

Decomplect everything

Computation graphs

Building on top of spec

Queryable data descriptions

• s/registry, s/form

• Build a graph (Datomic)

Interact with your type system!co

de is d

ata!

Case study: autogenerating materialised views

KafkaMaterialised views

Events External data

Automatic view generation• Event & attribute ontology

• Manual (via spec) • Inferred

• Statistical analysis (seasonality detection, outlier removal, …)

Onyx Onyx

Onyx

Automatic view generation

1. Walk spec registry

2. Apply rules

1. Define new view (spec)

2. Trigger Onyx job that creates the view

Code is data or

data is code?

Takeouts

Onyx is production ready

Everything should be live and interactive

Computation graphs are a great way to structure data processing code

Queryable data and computation descriptions supercharge interactive development and are a great building block for automation

Questions@sbelak

simon@goopti.com