Date post: | 21-Feb-2017 |
Category: |
Data & Analytics |
Upload: | simon-belak |
View: | 25 times |
Download: | 0 times |
Onyxa masterless, cloud scale, fault tolerant, high performance distributed computation system
… written entirely in Clojure
Onyx at• In production for almost a year
• ETL
• online machine learning
• offline (batch) machine learning
• ad-hoc analysis
Self-service infrastructure for data scientists
1.Onyx at a glance
2.How Onyx rewired my brain
3.Building on top of spec
Onyx at a glance
Job =
[[:input :processing-1] [:input :processing-2] [:processing-1 :output-1] [:processing-2 :output-2]]
[{:flow/from :input-stream :flow/to [:process-adults] :flow/predicate :my.ns/adult? :flow/doc "Emits segment if an adult.”}]
workflow + flow conditions + catalogue [{:onyx/name :add-5
:onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n] :onyx/batch-size batch-size}
{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}
{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Writes segments to a core.async channel"}]
Catalogue[{:onyx/name :add-5 :onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n] :onyx/batch-size batch-size}
{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}
{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Writes segments to a core.async channel"}]
Vanilla Clojure function(defn adder [n {:keys [x] :as segment}] (assoc segment :x (+ n x))))
Plugins (I/O)seq, async, Kafka, Datomic, SQL,…
parameter
self-documenting
Computation entirely described with data
data is
code!
Everything can be run locally!
Testing without mocking
How Onyx rewired my brain
It’s not about scaling, but clean architecture
My goto architecture
KafkaDB EventsOnyx Onyx
Onyx
Persist all messages to S3
(time travel!)
Decomplect everything
Computation graphs
Building on top of spec
Queryable data descriptions
• s/registry, s/form
• Build a graph (Datomic)
Interact with your type system!co
de is d
ata!
Case study: autogenerating materialised views
KafkaMaterialised views
Events External data
Automatic view generation• Event & attribute ontology
• Manual (via spec) • Inferred
• Statistical analysis (seasonality detection, outlier removal, …)
Onyx Onyx
Onyx
Automatic view generation
1. Walk spec registry
2. Apply rules
1. Define new view (spec)
2. Trigger Onyx job that creates the view
⤾
Code is data or
data is code?
Takeouts
Onyx is production ready
Everything should be live and interactive
Computation graphs are a great way to structure data processing code
Queryable data and computation descriptions supercharge interactive development and are a great building block for automation
viebel.github.io/klipse/examples/onyx.html
onyxplatform.org
onyxplatform.org/jekyll/update/2017/02/08/Pyroclast-Preview-Simulation.html