Clojure Reducers / clj-syd Aug 2012

Post on 01-Nov-2014

2,573 views 0 download

Tags:

description

Talk given at the Sydney Clojure User group, August 2012

transcript

ReducersA library and model for collection processing in Clojure

Leonardo Borges@leonardo_borgeshttp://www.leonardoborges.comhttp://www.thoughtworks.com

Thursday, 30 August 12

ReducersA library and model for collection processing in Clojure

Leonardo Borges@leonardo_borgeshttp://www.leonardoborges.comhttp://www.thoughtworks.com

...in 20 mins or le

ss

Thursday, 30 August 12

Reducers huh? Here’s the gist

Thursday, 30 August 12

You get parallel versions of reduce, map and filter

Reducers huh? Here’s the gist

Thursday, 30 August 12

You get parallel versions of reduce, map and filter

Reducers huh? Here’s the gist

Ta-da! I’m done!

Thursday, 30 August 12

You get parallel versions of reduce, map and filter

Reducers huh? Here’s the gist

Ta-da! I’m done!

and well under my 20 min limit :)

Thursday, 30 August 12

Alright, alright I’m kidding

Thursday, 30 August 12

How do reducers make parallelism possible?

Thursday, 30 August 12

• JVM’s Fork/Join framework• Reduction Transformers

How do reducers make parallelism possible?

Thursday, 30 August 12

Java requirements

• Fork/Join framework• Java 7 [1] or• Java 6 + the JSR166 jar [2]

Clojure requirements

• 1.5.0-* (this is still MASTER on github [3] as of 30/08/2012)

[1] - http://jdk7.java.net/[2] - http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166.jar[3] - https://github.com/clojure/clojure

Before we start - this is bleeding edge stuff

Thursday, 30 August 12

The Fork/Join Framework

Thursday, 30 August 12

•Based on divide and conquer

The Fork/Join Framework

Thursday, 30 August 12

•Based on divide and conquer•Work stealing algorithm

The Fork/Join Framework

Thursday, 30 August 12

•Based on divide and conquer•Work stealing algorithm•Uses deques - double ended queues.

The Fork/Join Framework

Thursday, 30 August 12

•Based on divide and conquer•Work stealing algorithm•Uses deques - double ended queues.•Progressively divides the workload into tasks, up to a threshold

The Fork/Join Framework

Thursday, 30 August 12

•Based on divide and conquer•Work stealing algorithm•Uses deques - double ended queues.•Progressively divides the workload into tasks, up to a threshold•Once it finished one task, it pops another one form its deque

The Fork/Join Framework

Thursday, 30 August 12

•Based on divide and conquer•Work stealing algorithm•Uses deques - double ended queues.•Progressively divides the workload into tasks, up to a threshold•Once it finished one task, it pops another one form its deque•After at least two tasks have finished, results can be combined/joined

The Fork/Join Framework

Thursday, 30 August 12

•Based on divide and conquer•Work stealing algorithm•Uses deques - double ended queues.•Progressively divides the workload into tasks, up to a threshold•Once it finished one task, it pops another one form its deque•After at least two tasks have finished, results can be combined/joined•Idle workers can pop tasks from the deques of workers which fall behind

The Fork/Join Framework

Thursday, 30 August 12

Text is boring

Thursday, 30 August 12

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Fork/Join algorithm - simplified view

Workload is put in “deques”

Thursday, 30 August 12

Fork/Join algorithm - simplified view

...and progressively halved

Thursday, 30 August 12

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Fork/Join algorithm - simplified view

...up to a configured threshold

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1

Combine

Worker 2

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Idle workers can “steal” items from other workersThursday, 30 August 12

Worker 1 Worker 2

Combine Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Combine

Fork/Join algorithm - simplified view

Thursday, 30 August 12

Worker 1 Worker 2

Fork/Join algorithm - simplified view

Final result

Thursday, 30 August 12

Let’s talk about Reducers

Thursday, 30 August 12

Let’s talk about Reducers

Motivations

• Performance• via less allocation• via parallelism (leverage Fork/Join)

Thursday, 30 August 12

Let’s talk about Reducers

Motivations

• Performance• via less allocation• via parallelism (leverage Fork/Join)

Issues

• Lists and Seqs are sequential• map / filter implies order

Thursday, 30 August 12

A closer look at what map does

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Thursday, 30 August 12

A closer look at what map does

• Recursion

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Thursday, 30 August 12

A closer look at what map does

• Recursion• Order

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Thursday, 30 August 12

A closer look at what map does

• Recursion• Order• Laziness (not shown)

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Thursday, 30 August 12

A closer look at what map does

• Recursion• Order• Laziness (not shown)• Consumes List

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Thursday, 30 August 12

A closer look at what map does

• Recursion• Order• Laziness (not shown)• Consumes List• Builds List

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Thursday, 30 August 12

A closer look at what map does

• Recursion• Order• Laziness (not shown)• Consumes List• Builds List

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Oh, and it also applies the functionto each item before putting the result into the new list

Thursday, 30 August 12

A closer look at what map does

• Recursion• Order• Laziness (not shown)• Consumes List• Builds List

;; a naive map implementation(defn map [f coll] (if (seq coll) (cons (f (first coll)) (map f (rest coll))) '()))

Oh, and it also applies the functionto each item before putting the result into the new list

This is what mapping means!

Thursday, 30 August 12

Reduction Transformers

Thursday, 30 August 12

Reduction Transformers

• Idea is to build map / filter on top of reduce to break from sequentiality

Thursday, 30 August 12

Reduction Transformers

• Idea is to build map / filter on top of reduce to break from sequentiality• map / filter then builds nothing and consumes nothing

Thursday, 30 August 12

Reduction Transformers

• Idea is to build map / filter on top of reduce to break from sequentiality• map / filter then builds nothing and consumes nothing• It changes what reduce means to the collection by transforming the reducing functions

Thursday, 30 August 12

What map is really all about

(defn mapping [f] (fn [f1] (fn [result input] (f1 result (f input)))))

Thursday, 30 August 12

But wait! If map doesn’t consume the list any longer, who does?

• reduce does!• Since Clojure 1.4 reduce lets the collection reduce itself (through the CollReduce / CollFold protocols)• Think of what this means for tree-like structures such as vectors• This is key to leveraging the Fork/Join framework

Thursday, 30 August 12

Now we can use mapping to create reducing functions

(reduce ((mapping inc) +) 0 [1 2 3 4]) ;; 14

Thursday, 30 August 12

Now we can use mapping to create reducing functions

(reduce ((mapping inc) +) 0 [1 2 3 4]) ;; 14

(fn [result input] (+ result (inc input)))

Thursday, 30 August 12

Now we can use mapping to create reducing functions

(reduce ((mapping inc) conj) [] [1 2 3 4]);; [2 3 4 5]

Thursday, 30 August 12

Now we can use mapping to create reducing functions

(reduce ((mapping inc) conj) [] [1 2 3 4]);; [2 3 4 5]

(fn [result input] (conj result (inc input)))

Thursday, 30 August 12

Now we can use mapping to create reducing functions

(reduce ((mapping inc) conj) [] [1 2 3 4]);; [2 3 4 5]

(fn [result input] (conj result (inc input)))

But it feels awkward to use it in this form

Thursday, 30 August 12

What do we have so far?

• Performance has been improved due to less allocations• No intermediary lists need to be built (see Haskell’s StreamFusion [4])• However reduce is still sequential

[4] - http://bit.ly/streamFusionThursday, 30 August 12

Enters fold

Thursday, 30 August 12

Enters fold

• Takes the sequentiality out or foldl, foldr and reduce

Thursday, 30 August 12

Enters fold

• Takes the sequentiality out or foldl, foldr and reduce• Potentially parallel (fallsback to standard reduce otherwise)

Thursday, 30 August 12

Enters fold

• Takes the sequentiality out or foldl, foldr and reduce• Potentially parallel (fallsback to standard reduce otherwise)• Reduce/Combine strategy (think Fork/Join Framework)

Thursday, 30 August 12

Enters fold

• Takes the sequentiality out or foldl, foldr and reduce• Potentially parallel (fallsback to standard reduce otherwise)• Reduce/Combine strategy (think Fork/Join Framework)• Segments the collection

Thursday, 30 August 12

Enters fold

• Takes the sequentiality out or foldl, foldr and reduce• Potentially parallel (fallsback to standard reduce otherwise)• Reduce/Combine strategy (think Fork/Join Framework)• Segments the collection• Runs multiple reduces in parallel

Thursday, 30 August 12

Enters fold

• Takes the sequentiality out or foldl, foldr and reduce• Potentially parallel (fallsback to standard reduce otherwise)• Reduce/Combine strategy (think Fork/Join Framework)• Segments the collection• Runs multiple reduces in parallel• Uses a combining function to join/reduce results

Thursday, 30 August 12

Enters fold

• Takes the sequentiality out or foldl, foldr and reduce• Potentially parallel (fallsback to standard reduce otherwise)• Reduce/Combine strategy (think Fork/Join Framework)• Segments the collection• Runs multiple reduces in parallel• Uses a combining function to join/reduce results

(defn fold [combinef reducef coll] ...)

Thursday, 30 August 12

The combining function is a monoid

• A binary function with an identity element• All the following functions are equivalent monoids

Thursday, 30 August 12

The combining function is a monoid

• A binary function with an identity element• All the following functions are equivalent monoids

+(+ 2 3) ; 5(+) ; 0

Thursday, 30 August 12

The combining function is a monoid

• A binary function with an identity element• All the following functions are equivalent monoids

(defn my-+ ([] 0) ([a b] (+ a b)))

(my-+ 2 3) ; 5(my-+) ; 0

Thursday, 30 August 12

The combining function is a monoid

• A binary function with an identity element• All the following functions are equivalent monoids

(require ‘[clojure.core.reducers :as r])

(def my-+ (r/monoid + (fn [] 0)))

(my-+ 2 3) ; 5(my-+) ; 0

Thursday, 30 August 12

fold by examples

;; all examples assume the reducers library is available as r(ns reducers-playground.core (:require [clojure.core.reducers :as r]))

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk(def my-vector (into [] (range 10000000)))

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk(def my-vector (into [] (range 10000000)))

(time (reduce + (map inc (filter even? my-vector))))

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk(def my-vector (into [] (range 10000000)))

(time (reduce + (map inc (filter even? my-vector)))) ;; 500msecs

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk(def my-vector (into [] (range 10000000)))

(time (reduce + (map inc (filter even? my-vector)))) ;; 500msecs

(time (reduce + (r/map inc (r/filter even? my-vector))))

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk(def my-vector (into [] (range 10000000)))

(time (reduce + (map inc (filter even? my-vector)))) ;; 500msecs

(time (reduce + (r/map inc (r/filter even? my-vector)))) ;; 260msecs

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk(def my-vector (into [] (range 10000000)))

(time (reduce + (map inc (filter even? my-vector)))) ;; 500msecs

(time (reduce + (r/map inc (r/filter even? my-vector)))) ;; 260msecs

(time (r/fold + (r/map inc (r/filter even? my-vector))))

Thursday, 30 August 12

fold by examples:increment all even positive integers up to 10 million

and add them all up;; these were taken from Rich’s reducers talk(def my-vector (into [] (range 10000000)))

(time (reduce + (map inc (filter even? my-vector)))) ;; 500msecs

(time (reduce + (r/map inc (r/filter even? my-vector)))) ;; 260msecs

(time (r/fold + (r/map inc (r/filter even? my-vector)))) ;; 130msecs

Thursday, 30 August 12

fold by examples:standard word count

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn count-words [text] (reduce (fn [memo word] (assoc memo word (inc (get memo word 0)))) {} (map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

Thursday, 30 August 12

fold by examples:standard word count

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn count-words [text] (reduce (fn [memo word] (assoc memo word (inc (get memo word 0)))) {} (map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

(time (count-words wiki-dump)) ;; 45 secs

Thursday, 30 August 12

fold by examples:parallel word count

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn p-count-words [text] (r/fold (r/monoid (partial merge-with +) hash-map) (fn [memo word] (assoc memo word (inc (get memo word 0)))) (r/map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

Thursday, 30 August 12

fold by examples:parallel word count

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn p-count-words [text] (r/fold (r/monoid (partial merge-with +) hash-map) (fn [memo word] (assoc memo word (inc (get memo word 0)))) (r/map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

Combining fn

Thursday, 30 August 12

fold by examples:parallel word count

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn p-count-words [text] (r/fold (r/monoid (partial merge-with +) hash-map) (fn [memo word] (assoc memo word (inc (get memo word 0)))) (r/map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

Will be called at the leaves to merge the partial computations

Thursday, 30 August 12

fold by examples:parallel word count

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn p-count-words [text] (r/fold (r/monoid (partial merge-with +) hash-map) (fn [memo word] (assoc memo word (inc (get memo word 0)))) (r/map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

Will be called with no arguments to provide a seed value

Thursday, 30 August 12

fold by examples:parallel word count

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn p-count-words [text] (r/fold (r/monoid (partial merge-with +) hash-map) (fn [memo word] (assoc memo word (inc (get memo word 0)))) (r/map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

Thursday, 30 August 12

fold by examples:parallel word count

(time (p-count-words wiki-dump)) ;; 30 secs

(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB

(defn p-count-words [text] (r/fold (r/monoid (partial merge-with +) hash-map) (fn [memo word] (assoc memo word (inc (get memo word 0)))) (r/map #(.toLowerCase %) (into [] (re-seq #"\w+" text)))))

Thursday, 30 August 12

fold by examples:Load 100k records into PostgreSQL

(def records (into [] (line-seq (BufferedReader. (FileReader. "dump.txt")))))

Thursday, 30 August 12

fold by examples:Load 100k records into PostgreSQL

(time (doseq [record records] (let [tokens (clojure.string/split record #"\t" )] (insert users/users (values { :account-id (nth tokens 0) ... })))))

Thursday, 30 August 12

fold by examples:Load 100k records into PostgreSQL

(time (doseq [record records] (let [tokens (clojure.string/split record #"\t" )] (insert users/users (values { :account-id (nth tokens 0) ... })))))

;; 90 secsThursday, 30 August 12

fold by examples:Load 100k records into PostgreSQL in parallel

(time (r/fold + (r/map (fn [record] (let [tokens (clojure.string/split record #"\t" )] (do (insert users/users (values { :account-id (nth tokens 0) ... })) 1))) records)))

Thursday, 30 August 12

fold by examples:Load 100k records into PostgreSQL in parallel

;; 50 secs

(time (r/fold + (r/map (fn [record] (let [tokens (clojure.string/split record #"\t" )] (do (insert users/users (values { :account-id (nth tokens 0) ... })) 1))) records)))

Thursday, 30 August 12

When to use it

Thursday, 30 August 12

When to use it

• Exploring decision trees

Thursday, 30 August 12

When to use it

• Exploring decision trees• Image processing

Thursday, 30 August 12

When to use it

• Exploring decision trees• Image processing• As a building block for bigger, distributed systems such as Datomic and Cascalog (maybe around parallel agregators)

Thursday, 30 August 12

When to use it

• Exploring decision trees• Image processing• As a building block for bigger, distributed systems such as Datomic and Cascalog (maybe around parallel agregators)• Basically any list intensive program

Thursday, 30 August 12

When to use it

• Exploring decision trees• Image processing• As a building block for bigger, distributed systems such as Datomic and Cascalog (maybe around parallel agregators)• Basically any list intensive program

But the tools are available to anyone so be creative!

Thursday, 30 August 12

Resources

• The Anatomy of a Reducer - http://bit.ly/anatomyReducers• Rich’s announcement post on Reducers - http://bit.ly/reducersANN• Rich Hickey - Reducers - EuroClojure 2012 - http://bit.ly/reducersVideo (this presentation was heavily inspired by this video)• The Source on github - http://bit.ly/reducersCore

Leonardo Borges@leonardo_borgeshttp://www.leonardoborges.comhttp://www.thoughtworks.com

Thursday, 30 August 12

Thanks!

Questions?

Leonardo Borges@leonardo_borges

http://www.leonardoborges.comhttp://www.thoughtworks.com

Thursday, 30 August 12