Post on 11-Apr-2017
transcript
Many definitions
Very large volume with low density of information
Three V’s: Velocity Variety Volume
Social interactions and web activity data
Amongst many others..
GB TB PB
Compute Storage Big Data Unconstrained data growth
95% of the 1.2 zettabytes of data in the digital universe is unstructured
70% of of this is user-generated content
Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.
Source: IDC
ZB
EB
Web sites Blogs/Reviews/Emails/Pictures
Social Graphs Facebook, Linked-in, Contacts
Application server logs Web sites, games
Sensor data Weather, water, smart grids
Images/videos Traffic, security cameras
Twitter 50m tweets/day 1,400% growth per
year
Where does it come from?
Compute Storage Big Data
Web sites Blogs/Reviews/Emails/Pictures
Social Graphs Facebook, Linked-in, Contacts
Application server logs Web sites, games
Sensor data Weather, water, smart grids
Images/videos Traffic, security cameras
Twitter 50m tweets/day 1,400% growth per
year
Why now?
Compute Storage Big Data
Web sites Blogs/Reviews/Emails/Pictures
Social Graphs Facebook, Linked-in, Contacts
Application server logs Web sites, games
Sensor data Weather, water, smart grids
Images/videos Traffic, security cameras
Twitter 50m tweets/day 1,400% growth per
year
Why now?
Mobile connected world (more people using, easier to collect)
Compute Storage Big Data
Web sites Blogs/Reviews/Emails/Pictures
Social Graphs Facebook, Linked-in, Contacts
Application server logs Web sites, games
Sensor data Weather, water, smart grids
Images/videos Traffic, security cameras
Twitter 50m tweets/day 1,400% growth per
year
Why now?
More aspects of data (variety, depth, location, frequency)
Compute Storage Big Data
Web sites Blogs/Reviews/Emails/Pictures
Social Graphs Facebook, Linked-in, Contacts
Application server logs Web sites, games
Sensor data Weather, water, smart grids
Images/videos Traffic, security cameras
Twitter 50m tweets/day 1,400% growth per
year
Why now?
Possible to understand (not just answer specific questions)
Compute Storage Big Data
Data App App
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Data has gravity
Compute Storage Big Data
Data
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
…and inertia at volume…
Compute Storage Big Data
Data
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
…easier to move applications to the data
Compute Storage Big Data
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit. Etiam
quis ligula neque, eget
venenatis sem.
Suspendisse non eros
nulla, at placerat nibh.
Very large dataset seeks strong &
consistent compute for
short term relationship,
possibly longer. GSOH a
plus aws.amazon.com
Personal
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit. Etiam
quis ligula neque, eget
venenatis sem.
Suspendisse non eros
nulla, at placerat nibh.
Cras id lectus mattis est
ullamcorper blandit.
Proin ut nisi vitae enim
vulputate tempor.
Phasellus id commodo
eros. Mauris nec
dignissim turpis. Nunc
Cras id lectus mattis
est ullamcorper
blandit. Proin ut nisi
vitae enim vulputate
tempor. Phasellus id
commodo eros.
Mauris nec dignissim
turpis. Nunc
Bring compute capacity to the data
Compute Storage Big Data
Cras id lectus mattis
est ullamcorper
blandit. Proin ut nisi
vitae enim vulputate
tempor. Phasellus id
commodo eros.
Mauris nec dignissim
turpis. Nunc
Who is your customer really?
What do people really like?
What is happening socially with your products?
How do people really use your products?
Lesson 1: don’t leave your Amazon account logged in at home
Lesson 2: use the data you have to
drive proactive marketing
Elastic MapReduce
Code Name node
Output S3 + SimpleDB
S3 + DynamoDB
Elastic cluster
HDFS Queries
+ BI Via JDBC, Pig, Hive
Input data
Very large click log (e.g TBs)
Lots of actions by John Smith
Split the log into
many small pieces
Process in an EMR cluster
Very large click log (e.g TBs)
Lots of actions by John Smith
Split the log into
many small pieces
Process in an EMR cluster
Aggregate the results
from all the nodes
Very large click log (e.g TBs)
What John Smith
did
Lots of actions by John Smith
Split the log into
many small pieces
Process in an EMR cluster
Aggregate the results
from all the nodes
Features powered by Amazon Elastic MapReduce:
People Who Viewed this Also Viewed
Review highlights Auto complete as you type on search
Search spelling suggestions Top searches
Ads
200 Elastic MapReduce jobs per day Processing 3TB of data
Data Analytics
3.5 billion records
71 million unique cookies
1.7 million targeted ads
required per day
Execute batch processing data sets
ranging in size from dozens of
Gigabytes to Terabytes
Building in-house infrastructure to
analyze these click stream datasets
requires investment in expensive
“headroom” to handle peak demand.
“Our first client
campaign experienced a 500% increase in their return on ad
spend from a similar campaign a year
before”
Targeted Ad
User recently
purchased a
sports movie
and is searching
for video games (1.7 Million per day)