Date post: | 23-Jun-2015 |
Category: |
Technology |
Upload: | chris-hausler |
View: | 165 times |
Download: | 0 times |
Data @ ZendeskClojure, Cascalog, Hadoops and Datas
Web
Data
But…● There is (too?) much of it● I t ’ s s p r e a d o u t ● Optimised for other stuff
We has Data!
Lower barrier to entry for analytics
What we want from our data
Add value for our customers
Understandable & Concise
not
Open Source
What we want from our solution
Extensible
Customisable
Headphones
We settled on
(def cascalog “Pretty”)(ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[\[\]\\\(\),.)\s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
It was a journey, we learnt lots
● Taps & Sinks● Group By, Aggregation & Filters● Joins & Function Calls
Cascalog Basics in Gorilla
(def review-scores (repeatedly 5000 rand))
(defn grab-score [x] {:score [x]})
; BAD - stack overflow(def combine-score (partial merge-with concat)); BETTER - no stack overflow, but wait for GC(def combine-score (partial merge-with (comp doall concat))); BEST - snappy fast(def combine-score (partial merge-with into))
(defparallelagg bucket-scores :init-var #'grab-score :combine-var #'combine-score)
(defn median-scores [bucketed-scores] {:median-score (median (:score bucketed-scores))})
(??<- [?median-score] (review-scores :> ?score) (bucket-scores :< ?score :> ?bucketed-scores) (median-scores :< ?bucketed-scores :> ?median-score))
Learnings
Lazy sequences are not always your friend
Midje for Testing. And why it’s good
The Result
Bonus!
Clojure from python ( for prettier graphs)