Date post: | 15-Jul-2015 |
Category: |
Software |
Upload: | datakitchen |
View: | 351 times |
Download: | 0 times |
Agenda
08:30 AM Breakfast
09:00 AM Introduction and Strengths of Technologies
10:00 AM Start an EMR Cluster
10:15 AM break + set up query tool
10:30 AM Hadoop hands-on
10:55 AM break
11:10 AM Redshift hands-on
11:40 AM Operationalizing your code
12:00 PM adjourn
12/6/2014 2
DataKitchen Leadership
Chris Bergh (Executive Chef)
4
Gil Benghiat(VP Product)
Eric Estabrooks (VP Cloud and Data Services)
Software development origins and executive experience delivering enterprise software focused on Marketing and Health Care sectors.
Deep Analytic Experience: Spent past decade solving the analytic data preparation problem
New Approach To Data Preparation and Production: focused on the Analysts
This creates an expectation gap
6
Analyze
Prepare Data
C
Analyze
Prepare Data
Business Customer Expectation
AnalystReality
Communicate
The business does not think that Analysts are preparing data
(Analysts don’t want to prepare data)
What Analyst Really Want: An Integrated Data Set Ready For Analysis
With: Autonomy & Agility
Without: All the Work & Anxiety
Agenda
08:30 AM Breakfast
09:00 AM Introduction and Strengths of Technologies
10:00 AM Start an EMR Cluster
10:15 AM break + set up query tool
10:30 AM Hadoop hands-on
10:55 AM break
11:10 AM Redshift hands-on
11:40 AM Operationalizing your code
12:00 PM adjourn
12/6/2014 9
Experience of Audience
• Who considers themselves
• Analyst
• Data scientist
• Programmer / Scripter
• On the Business side
• Who knows SQL – can write a simple select?
• Who had an AWS account before today?
12/6/2014 10
What Is Apache Hadoop?
• Software framework
• Large scale processing
• Network of commodity hardware
• Handles hardware failures
12/6/2014 12
http://hadoop.apache.org/
What is Hadoop good for?
• Problems that are huge (batch), but not hard, and can be run in parallel over immutable data
• NOT OLTP (e.g. backend to e-commerce site)
• Providing a Map Reduce framework
12/6/2014 13
You can write map reduce jobs in your favorite language
Streaming Interface
• Lets you specify mappers and reducer
• Supports• Java• Python• Ruby• Unix Shell• R• Any executable
Map Reduce “generators”
• Results in map reduce jobs
• PIG
• Hive
12/6/2014 16
Applications that lend themselves to map reduce
• Word Count
• PDF Generation (NY Times 11,000,000 articles)
• Analysis of stock market historical data (ROI and standard deviation)
• Geographical Data (Finding intersections, rendering map files)
• Log file querying and analysis
• Statistical machine translation
• Spam detection
• Analyzing Tweets
12/6/2014 17
Another use …Some people use a Hadoop cluster for a “data lake”
• Store all your raw data
• Cook it on demand
12/6/2014 19
Pig
• Pig Latin - the scripting language
• Grunt – Shell for executing Pig Commands
12/6/2014 21
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
This is what it would be in Java
12/6/2014 22
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
Hive
You write SQL! Well, almost, it is HiveQL
12/6/2014 23
SELECT user.*FROM userWHERE user.active = 1;
JDBCSQL
Workbench
The first hands on session will focus on this.
In Amazon, the common workflow for batch processing starts and ends with s3.
12/6/2014 24
HiveScript
Impala
• Uses SQL very similar to HiveQL
• Runs 10-100x faster
• Runs in memory so it does not scale up as well
• Great for developing your code on a small data set
• Can use interactively with Tableau and other BI tools
• Some batch jobs run faster on Impala than Hive
12/6/2014 25
What is EMR?
• Hadoop offered by Amazon
• EMR = Elastic Map Reduce
• Amazon does almost all of the work to create a cluster
12/6/2014 26
OR
Three ways to pay for EMR
• On Demand - highest price, by the hour, no commitment
• m1.small $0.055 per Hour
• i2.8xlarge $7.09 per hour
• (29 different machine options)
• Reservation - 1 and 3 year terms (No, All, & Partial Upfront)
• Spot - lowest price, machine can be taken away
Do I leave my cluster up all the time?
12/6/2014 27
What Is Redshift?
• Columnar database
• Great for reads
• Scale by adding machines
• Two ways to pay
• On Demand
• Reservation
• Good for SQL-based ETL too
12/6/2014 29
http://hadoop.apache.org/
Redshift Machine Options (on demand prices)
12/6/2014 30
Petabyte scale
Remember: Amazon charges for s3 storage too
Redshift usage pattern
• Load data to s3 first
• Use BI tools to send in SQL
• Amazon Redshift is based on PostgreSQL
12/6/2014 31
The second hands on session will focus on this.
JDBCSQL
Workbench
Agenda
08:30 AM Breakfast
09:00 AM Introduction and Strengths of Technologies
10:00 AM Start an EMR Cluster
10:15 AM break + set up query tool
10:30 AM Hadoop hands-on
10:55 AM break
11:10 AM Redshift hands-on
11:40 AM Operationalizing your code
12:00 PM adjourn
12/6/2014 32
Should I use Redshift or EMR?
Redshift for
• Structured data
• Interactive queries
• Speed
Hadoop for
• Data format flexibility
• Computation flexibility
• Super Big Data
12/6/2014 33
• Try both
• Compare costs
• If it works in Redshift, start there
Recap
• Started a Hadoop cluster via the AWS Console (Web UI)
• Loaded Data
• Wrote some queries
• Same for Redshift
Eventually, you will do this for real and have a script that has value.
Now what?
12/6/2014 35
To run your data job you need to …
• Wait for the new data to arrive
• Move it to s3
• Start a cluster
• Load the data
• Run your SQL scripts
• Wait for it to finish
• Shut down your cluster
12/6/2014 36
And hope …
• The new data is in the right format
• Assumptions you made during development are still true
• Someone did not mess up your code with an "easy change“
• The new data transfers run successfully
• A table you depend on has been updated correctly
• The new data has not been truncated by the source
• No data quality issues with the source data
Wouldn’t it be great to turn your hopes into tests?
12/6/2014 37
DataKitchen: We produce the data
12/6/2014 38
SQL, tests and the check list
go into a Recipe
You data are
Ingredients
The results are
Servings
DataKitchen brings reality in line with expectations
39
Analyze
Prepare Data
C
Analyze
Prepare Data
Business Customer Expectation
AnalystReality
Communicate
Analyze
Prepare Data
With DataKitchen
Communicate
The story of our first Recipe
With DataKitchen, we got 75% of our time back!
… and we don’t have to remember to shut down our cluster.
12/6/2014 41