Date post: | 02-Jul-2015 |
Category: |
Data & Analytics |
Upload: | tumra-big-data-science-gain-a-competitive-advantage-through-big-data-data-science |
View: | 1,524 times |
Download: | 0 times |
Clickstream & Social Media Analysis Use cases and examples using Apache Spark
Michael Cutler @ TUMRA – November 2014
Hello
• Early adopter of Hadoop
• Spoke at Hadoop World on
machine learning
• Twitter: @cotdp
About Me We use Data Science and Big Data
technology to help ecommerce
companies understand their
customers and increase sales.
TUMRA • Slide are on Slideshare
• Code example on Github
• Twitter: @tumra
This Talk
Examples 3
Introducing Apache Spark 2
Background 1
Background 1
Clickstream & Social Media Analysis Generalised Approach
Mobile/Tablet App
Data Collection
Data Processing
Reporting & Analysis
Web Site
You People
Social Network
Events Files Tables
How has this approach evolved? Rapidly reducing the ‘time to insight’
• Proprietary & Expensive
• Slow Constrained
Time to Insight
48+ hours
pre-Historic Hadoop • Open-source & Inexpensive
• Flexible but complex to use
Time to Insight
hours
2008 - Hadoop • Batch, Streaming & Interactive
• Fast & Easy to use
Time to Insight
minutes
2014 - Spark
Weaving a story from a string of activities Understanding the shoppers journey
Day #0
PPC long-tail
keyword
Day #7 Day #10 Day #13 Day #17
PPC brand keyword &
signed up email
Opened Email
Newsletter on iPad PPC brand
keyword
Add To Cart
Order
Placed
Shopper Journey Understanding the shoppers journey
Time
Consumer
Shopper
Research Consideration Purchase
Need
It’s all about People & Products Not just boring log files!
Turn low-level events like “Page Views” into something meaningful
e.g. <Person1234> <viewed-a> <Product:Camera>
Bought a …
Activity & Interactions
Measuring the degree of interest a Person has about a Product
e.g. are 10 views for a certain Product a good or bad thing?
Gauging Interest
Either inferred from other Peoples activities, or Product similarity
Affinities
Both people and products have properties,
e.g. <Person1234> <is:gender> <Female>
Properties
People & Product Interactions
e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”
Source: Snowplow Analytics
That sounds like a Graph … Use graphs to understand user intent
Interest Graph Visualisation
• Collect user activity data in real-time, not just
clicks but mouse-overs, images, video, social.
• Algorithms identify products, categories and
brands a particular person is interested in.
• Cluster users into ‘neighborhoods’ to infer what to
show to existing and future visitors.
This visualization illustrates just 1% of 6 weeks visitor
activity data. Blue data points are People, Orange
data points are Products.
Introducing Apache Spark 2
Three reasons Apache Spark is awesome! Apart from “no more Java Map/Reduce code!!!”
• In-memory Caching
• DAG execution optimisation
• Easy to use in Scala, Java, Python
Fast • Machine Learning baked in
• Graph algorithms
• Interactive Shell
Smart • Query from Spark SQL
• Streaming
• Batch (file based)
Flexible
Apache Spark Architecture Overview
Apache ZooKeeper Hadoop Filesystem (HDFS)
Yarn / Mesos (optional)
Apache Spark Coexists with your existing Hadoop Infrastructure
Apache ZooKeeper
Hadoop Filesystem (HDFS)
Map / Reduce
Apache Hive etc.
Yarn / Mesos
Apache Spark can … Simple example of Spark SQL used from Scala
Source: Databricks
Go from a SQL query… … to a trained machine learning model in three lines of code.
Examples 3
Example Architecture Coexists with your existing Hadoop Infrastructure
Apache ZooKeeper
Hadoop Filesystem (HDFS) NoSQL Store (Cassandra)
Reporting Dashboard
Apache Kafka
Analytics Jobs
Social Media Analysis Converting a low-level event into a meaningful high-level interaction
• A user-interaction from the Facebook firehose, received as a real-time stream of JSON
• Streamed into Apache Kafka, also stored in SequenceFiles
• Modeled into Scala Case Class:
Example - Spark (Scala) Using the Spark (Scala) interface to analyze the data
• Parse JSON
• Extract interesting attributes • ‘Reduce by Key’ to sum the result
• Print results
Example - Spark SQL Using the Spark SQL interface to analyze the data
• Parse JSON
• Extract interesting attributes, transform into Case Classes
• ‘Register as table’
• Execute SQL, print results
Want to play with awesome tech and data?
We’re hiring! [email protected]
Data Engineer
Scala, functional programming, Hadoop, NoSQL
Sales & Marketing
Experience with SaaS and ecommerce sales