Date post: | 11-Apr-2017 |
Category: |
Software |
Author: | chris-fregly |
View: | 1,347 times |
Download: | 5 times |
Click to edit Master text styles
Click to edit Master text styles
After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations
Barcelona Spark Meetup
Oct 20th, 2015
Chris FreglyPrincipal Data Solutions Engineer
IBM Spark Technology Center** Were Hiring!! Nice People Only, Please. **
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Who Am I?
2
Streaming Data EngineerNetflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions EngineerIBM Technology Center
Meetup OrganizerAdvanced Apache Meetup
Book AuthorAdvanced (2016)
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Advanced Apache Spark MeetupTotal Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share the patterns and idioms of these well-designed, distributed, big data components
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark4
Core
Spark Streaming
real-time Spark SQL structured data
MLlib machine learning
GraphX graph
analytics
BlinkDB approx queries
What is Spark?
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Spark Deployments In Production
5
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Tools of the Talk
6
Redis Docker Cassandra MLlib, GraphX Parquet, JSON Apache Zeppelin Spark Streaming, Kafka Spark SQL, DataFrames Spark JDBC/ODBC Hive ThriftServer ElasticSearch, Logstash, Kibana (ELK)
and
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
SMACK Stack!
7
S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Themes of this Talk
Parallelism Performance Streaming Approximations Similarity Measures Recommendations
8
and
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Goals of Spark After Dark Generate high-quality recommendations
Demonstrate Spark high-level libraries
Spark Streaming -> Kafka, Approximates
Spark SQL -> DataFrames, Cassandra
GraphX -> PageRank, Shortest Path
MLlib -> Matrix Factor, Word2Vec
9
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Popular Dating Sites
10
Click to edit Master text styles
Click to edit Master text stylesParallelism
11
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
My First Experience With ParallelismBrady Bunch circa 1980 Season 5, Episode 18: Two Petes in a Pod
12
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Parallel Algorithm: O(log n)
13
O(log n)
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Non-Parallel Algorithm: O(n)
14
O(n)
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Spark is Parallel!
15
Click to edit Master text styles
Click to edit Master text stylesPerformance
16
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Spark Beats Hadoop @ 100 TB GraySort
17
On-disk only 28,000 partitions No in-memory caching
(2014) (2013) (2014)
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Improved Shuffle and Network Layer Sort-based shuffle
Minimize OS resources
Switched to async Netty
Keep CPUs hot
Reuse byte buffers to minimize GC
Use epoll for I/O to stay in kernel space 18
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Project Tungsten: CPU and Memory More JVM bytecode generation, JIT optimize
CPU-cache-aware data structs and algos -->
Custom memory management Serializers Performance New HashMap
19
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
DataFrames and Catalyst Optimizer
20
20
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please Use DataFrames!
--> -->
JVM bytecode generation
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Columnar Storage Format
21
Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Parquet File FormatBased on Google Dremel
Implemented by Twitter and Cloudera
Columnar storage format
Optimized for fast columnar aggregations
Tight compression
Supports pushdowns
Nested, self-describing, evolving schema22
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values
Delta, Prefix Encoding: Sorted data
23
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Types of Query Optimizations Column, Partition Pruning Row, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]
24
Click to edit Master text styles
Click to edit Master text stylesStreaming
25
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Direct Kafka Streaming KafkaRDD No single Receiver, no Write Ahead Log (WAL) Workers pull from Kafka in parallel Each KafkaRDD partition stores relevant offsets Upon Worker Node failure, rebuild from offsets Optimizes happy path by avoiding the WAL
26
At least once delivery guarantee
Click to edit Master text styles
Click to edit Master text stylesApproximations
27
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Count Min Sketch Approximate counters
Better than HashMap
Low, fixed memory Known error bounds Large num of counters From Twitters Algebird Streaming example in Spark codebase
28
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
HyperLogLog Approximate cardinality
Approx count distinct!
From Twitters Algebird!
Low memory
1.5KB @ 2% error, 10^9 elements !
Streaming example in Spark codebase
RDD: countApproxDistinctByKey()29
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Monte Carlo SimulationsFrom Manhattan Project (A-bomb) Simulate movement of neutrons
Law of Large Numbers (LLN) Average of results of many trials Converge on expected value
SparkPi example in Spark codebase
Pi ~ (# red dots /
# total dots * 4)
30
Click to edit Master text styles
Click to edit Master text stylesRecommendations
31
Click to edit Master text styles
Click to edit Master text stylesInteractive Demo!
32
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Audience Participation Needed!
33
Navigate to sparkafterdark.com
Click 3 actresses and 3 actors
->You are here
->
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Types of RecommendationsNon-personalized Cold Start No preference or behavior data for user, yetPersonalized User-Item Similarity Items that others with similar prefs have liked
Item-Item Similarity Items similar to your previously-liked items
34
Click to edit Master text styles
Click to edit Master text stylesNon-Personalized Recommendations
35
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Summary Statistics and Aggregations Top Users by Like Count
I might like users with the highest sum aggregation of likes overall.
SparkSQL + DataFrame = Aggregations
36
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Graph Analytics Top Influencers by Like Graph
I might like users who have the highest probability of me liking them randomly while walking the like graph.
GraphX: PageRank
37
Click to edit Master text styles
Click to edit Master text stylesDemo!Spark SQL/DataFrames + GraphX/PageRank
38
Click to edit Master text styles
Click to edit Master text stylesSimilarities
39
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Types of SimilarityEuclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias
40
Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1
z!
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
All-Pairs Similarity ComparisonCompare everything to everything aka. pair-wise similarity or similarity join Nave shuffle: O(m*n^2); m=rows, n=cols
Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis
41
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Reduce m: DIMSUM SamplingDimension Independent Matrix Square Using MR Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities()
Twitter: 40% efficiency gain over Cosine Similarity
42
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Reduce m: LSH BucketingLocality Sensitive Hashing Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
ie. 500k x 500k matrix
O(1.25e17) -> O(1.25e13); b=50
github.com/mrsqueeze/spark-hash43
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Reduce n: Remove Most Frequent ValueEliminate most-frequent valueRepresent other values with (index,value) pairsConverts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz
Click to edit Master text styles
Click to edit Master text stylesPersonalized Recommendations
45
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Recommendation TerminologyUser User seeking recommendations Item
Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering
Dimension reduction
46
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Collaborative Filtering Personalized Recs Like behavior of similar users
I like the same people that you like. What other people did you like that I havent seen? MLlib: Matrix Factorization, User-Item Similarity
47
Click to edit Master text styles
Click to edit Master text stylesDemo!Spark SQL/DataFrames + MLlib/Alternating Least Squares
48
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Text-based Personalized Recs (1/3) Similar profiles to meOur profiles have similar, unique k-skip n-grams. We might like each other. MLlib: Word2Vec, TF/IDF, Doc Similarity
49
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Text Based Personalized Recs (2/3)
50
Similar profiles from my past likesYour profile shares a similar feature vector space to others that Ive liked. I might like you. MLlib: Word2Vec, TF/IDF, Doc Similarity
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Text-based Personalized Recs (3/3) Relevant, High-Value Emails
Your initial email has similar named entities to my profile.
I might like you just for making the effort. MLlib: Word2Vec, TF/IDF, Entity Recognition
51
^ Her Email < My Profile
Click to edit Master text styles
Click to edit Master text stylesThe Future of Recommendations!
52
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Facial Recognition Eigenfaces
Your face looks similar to others that Ive liked. I might like you.
MLlib: RowMatrix, PCA, Item-Item Similarity
53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Natural Language Processing: Convo Bot NLP and DecisionTrees
If your responses to my trite opening lines are positive, I may read your profile. MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
54
Positive Negative
Click to edit Master text styles
Click to edit Master text styles
55
Maintaining the Spark!
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Recommendations for Couples Pathways of Similarity
I want Mad Max. You want Message In a Bottle. Lets find something in between to watch tonight.
MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path
similar similar plots ->
Click to edit Master text styles
Click to edit Master text stylesFinal Recommendation!
57
Click to edit Master text styles
Click to edit Master text styles
spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Get Off the Computer & Meet People! Thank you!!
Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA
Relevant Links advancedspark.com
Signup for the book and meetup! github.com/fluxcapacitor/pipeline
Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline
Run all demos presented today!
58
Image courtesy of http://www.duchess-france.org/
Click to edit Master text styles
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation.
IBM Spark