Home > Software > Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

Date post: 11-Apr-2017
Category:
Author: chris-fregly
View: 1,347 times
Download: 5 times
Share this document with a friend
Embed Size (px)
of 59 /59
Click to edit Master text styles Click to edit Master text styles After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations Barcelona Spark Meetup Oct 20 th , 2015 Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **
Transcript
  • Click to edit Master text styles

    Click to edit Master text styles

    After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations

    Barcelona Spark Meetup

    Oct 20th, 2015

    Chris FreglyPrincipal Data Solutions Engineer

    IBM Spark Technology Center** Were Hiring!! Nice People Only, Please. **

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Who Am I?

    2

    Streaming Data EngineerNetflix Open Source Committer

    Data Solutions Engineer

    Apache Contributor

    Principal Data Solutions EngineerIBM Technology Center

    Meetup OrganizerAdvanced Apache Meetup

    Book AuthorAdvanced (2016)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Advanced Apache Spark MeetupTotal Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc

    Surface and share the patterns and idioms of these well-designed, distributed, big data components

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark4

    Core

    Spark Streaming

    real-time Spark SQL structured data

    MLlib machine learning

    GraphX graph

    analytics

    BlinkDB approx queries

    What is Spark?

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Spark Deployments In Production

    5

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Tools of the Talk

    6

    Redis Docker Cassandra MLlib, GraphX Parquet, JSON Apache Zeppelin Spark Streaming, Kafka Spark SQL, DataFrames Spark JDBC/ODBC Hive ThriftServer ElasticSearch, Logstash, Kibana (ELK)

    and

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    SMACK Stack!

    7

    S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Themes of this Talk

    Parallelism Performance Streaming Approximations Similarity Measures Recommendations

    8

    and

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Goals of Spark After Dark Generate high-quality recommendations

    Demonstrate Spark high-level libraries

    Spark Streaming -> Kafka, Approximates

    Spark SQL -> DataFrames, Cassandra

    GraphX -> PageRank, Shortest Path

    MLlib -> Matrix Factor, Word2Vec

    9

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Popular Dating Sites

    10

  • Click to edit Master text styles

    Click to edit Master text stylesParallelism

    11

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    My First Experience With ParallelismBrady Bunch circa 1980 Season 5, Episode 18: Two Petes in a Pod

    12

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Parallel Algorithm: O(log n)

    13

    O(log n)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Non-Parallel Algorithm: O(n)

    14

    O(n)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Spark is Parallel!

    15

  • Click to edit Master text styles

    Click to edit Master text stylesPerformance

    16

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Spark Beats Hadoop @ 100 TB GraySort

    17

    On-disk only 28,000 partitions No in-memory caching

    (2014) (2013) (2014)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Improved Shuffle and Network Layer Sort-based shuffle

    Minimize OS resources

    Switched to async Netty

    Keep CPUs hot

    Reuse byte buffers to minimize GC

    Use epoll for I/O to stay in kernel space 18

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Project Tungsten: CPU and Memory More JVM bytecode generation, JIT optimize

    CPU-cache-aware data structs and algos -->

    Custom memory management Serializers Performance New HashMap

    19

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    DataFrames and Catalyst Optimizer

    20

    20

    https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

    Please Use DataFrames!

    --> -->

    JVM bytecode generation

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Columnar Storage Format

    21

    Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Parquet File FormatBased on Google Dremel

    Implemented by Twitter and Cloudera

    Columnar storage format

    Optimized for fast columnar aggregations

    Tight compression

    Supports pushdowns

    Nested, self-describing, evolving schema22

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values

    Delta, Prefix Encoding: Sorted data

    23

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Types of Query Optimizations Column, Partition Pruning Row, Predicate Pushdown

    SELECT b FROM table WHERE a in [a2,a3]

    24

  • Click to edit Master text styles

    Click to edit Master text stylesStreaming

    25

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Direct Kafka Streaming KafkaRDD No single Receiver, no Write Ahead Log (WAL) Workers pull from Kafka in parallel Each KafkaRDD partition stores relevant offsets Upon Worker Node failure, rebuild from offsets Optimizes happy path by avoiding the WAL

    26

    At least once delivery guarantee

  • Click to edit Master text styles

    Click to edit Master text stylesApproximations

    27

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Count Min Sketch Approximate counters

    Better than HashMap

    Low, fixed memory Known error bounds Large num of counters From Twitters Algebird Streaming example in Spark codebase

    28

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    HyperLogLog Approximate cardinality

    Approx count distinct!

    From Twitters Algebird!

    Low memory

    1.5KB @ 2% error, 10^9 elements !

    Streaming example in Spark codebase

    RDD: countApproxDistinctByKey()29

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Monte Carlo SimulationsFrom Manhattan Project (A-bomb) Simulate movement of neutrons

    Law of Large Numbers (LLN) Average of results of many trials Converge on expected value

    SparkPi example in Spark codebase

    Pi ~ (# red dots /

    # total dots * 4)

    30

  • Click to edit Master text styles

    Click to edit Master text stylesRecommendations

    31

  • Click to edit Master text styles

    Click to edit Master text stylesInteractive Demo!

    32

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Audience Participation Needed!

    33

    Navigate to sparkafterdark.com

    Click 3 actresses and 3 actors

    ->You are here

    ->

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Types of RecommendationsNon-personalized Cold Start No preference or behavior data for user, yetPersonalized User-Item Similarity Items that others with similar prefs have liked

    Item-Item Similarity Items similar to your previously-liked items

    34

  • Click to edit Master text styles

    Click to edit Master text stylesNon-Personalized Recommendations

    35

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Summary Statistics and Aggregations Top Users by Like Count

    I might like users with the highest sum aggregation of likes overall.

    SparkSQL + DataFrame = Aggregations

    36

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Graph Analytics Top Influencers by Like Graph

    I might like users who have the highest probability of me liking them randomly while walking the like graph.

    GraphX: PageRank

    37

  • Click to edit Master text styles

    Click to edit Master text stylesDemo!Spark SQL/DataFrames + GraphX/PageRank

    38

  • Click to edit Master text styles

    Click to edit Master text stylesSimilarities

    39

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Types of SimilarityEuclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias

    40

    Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

    z!

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    All-Pairs Similarity ComparisonCompare everything to everything aka. pair-wise similarity or similarity join Nave shuffle: O(m*n^2); m=rows, n=cols

    Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis

    41

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Reduce m: DIMSUM SamplingDimension Independent Matrix Square Using MR Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities()

    Twitter: 40% efficiency gain over Cosine Similarity

    42

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Reduce m: LSH BucketingLocality Sensitive Hashing Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2);

    m=rows, n=cols, b=buckets

    ie. 500k x 500k matrix

    O(1.25e17) -> O(1.25e13); b=50

    github.com/mrsqueeze/spark-hash43

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Reduce n: Remove Most Frequent ValueEliminate most-frequent valueRepresent other values with (index,value) pairsConverts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz

  • Click to edit Master text styles

    Click to edit Master text stylesPersonalized Recommendations

    45

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Recommendation TerminologyUser User seeking recommendations Item

    Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering

    Dimension reduction

    46

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Collaborative Filtering Personalized Recs Like behavior of similar users

    I like the same people that you like. What other people did you like that I havent seen? MLlib: Matrix Factorization, User-Item Similarity

    47

  • Click to edit Master text styles

    Click to edit Master text stylesDemo!Spark SQL/DataFrames + MLlib/Alternating Least Squares

    48

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Text-based Personalized Recs (1/3) Similar profiles to meOur profiles have similar, unique k-skip n-grams. We might like each other. MLlib: Word2Vec, TF/IDF, Doc Similarity

    49

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Text Based Personalized Recs (2/3)

    50

    Similar profiles from my past likesYour profile shares a similar feature vector space to others that Ive liked. I might like you. MLlib: Word2Vec, TF/IDF, Doc Similarity

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Text-based Personalized Recs (3/3) Relevant, High-Value Emails

    Your initial email has similar named entities to my profile.

    I might like you just for making the effort. MLlib: Word2Vec, TF/IDF, Entity Recognition

    51

    ^ Her Email < My Profile

  • Click to edit Master text styles

    Click to edit Master text stylesThe Future of Recommendations!

    52

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Facial Recognition Eigenfaces

    Your face looks similar to others that Ive liked. I might like you.

    MLlib: RowMatrix, PCA, Item-Item Similarity

    53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Natural Language Processing: Convo Bot NLP and DecisionTrees

    If your responses to my trite opening lines are positive, I may read your profile. MLlib: TF/IDF, DecisionTree,

    Sentiment Analysis

    54

    Positive Negative

  • Click to edit Master text styles

    Click to edit Master text styles

    55

    Maintaining the Spark!

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Recommendations for Couples Pathways of Similarity

    I want Mad Max. You want Message In a Bottle. Lets find something in between to watch tonight.

    MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path

    similar similar plots ->

  • Click to edit Master text styles

    Click to edit Master text stylesFinal Recommendation!

    57

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Get Off the Computer & Meet People! Thank you!!

    Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA

    Relevant Links advancedspark.com

    Signup for the book and meetup! github.com/fluxcapacitor/pipeline

    Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline

    Run all demos presented today!

    58

    Image courtesy of http://www.duchess-france.org/

  • Click to edit Master text styles

    Click to edit Master text styles

    Power of data. Simplicity of design. Speed of innovation.

    IBM Spark


Recommended