Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SAS DATA LOADER FOR HADOOPCUSTOMER CHALLENGES AND SOLUTION BENEFITS
TASS – SEPTEMBER 2015
JAMES WAITE
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SAS DATA LOADER
FOR HADOOPAGENDA
What Is Hadoop?
Big Data Challenges
Hadoop Challenges
Data Loader for Hadoop
Demo
Additional Resources
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
WHAT IS HADOOP?
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
HADOOP WHAT IT PROVIDES
Open-source Software
• Free to download, use and contribute to
Framework
• All program elements, connections, etc. are provided by the software
Massive Storage
• Framework breaks big data into blocks, which are stored on clusters of
commodity hardware
Processing Power
• Concurrently processes large amounts of data using multiple low-cost
computers
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
HADOOP WHAT IT OFFERS
Computing Power
• Distributed computing
Flexibility
• No need to preprocess data
Fault Tolerance
• Processing failover, data redundancy
Low Cost
• Open source, runs on commodity hardware
Scalability
• Add unlimited nodes, little administration
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
TERMINOLOGY TRADITIONAL
Primary Key
Index
Table
Normalize
Foreign Key
Relationship
Constraint
RDBMS
SQL
Database
Primary Key
Schema
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
TERMINOLOGY HADOOP
Hadoop
Pig
Block
Hive
Cloudera
NameNode
YARN
Cluster
JobTracker
HDFS
DataNode
MapReduce
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
TERMINOLOGY “IT’S ALL GREEK” TO ME (MOST)!
Είναι όλα τα
ελληνικά
μου.
Παραδεισένι
ο νησί.Όμορφη
αρχιτεκτονικ
ή.
Ο Θεός της
βροντής.
Τραγωδία.
Ολυμπιακοί
Αγώνες.
Γιαούρτι.
Ελληνορωμ
αϊκή.
Μεγάλοι της
λογοτεχνίας
και της
φιλοσοφίας.Σαλάτα.
Αρχαίοι
ναοί.
Μεσογείου.
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
BIG DATA DRIVERS AND CUSTOMER CHALLENGES
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CHALLENGE HADOOP SKILLS SHORTAGE
Performing even the simplest tasks in
Hadoop typically requires mastering
disparate tools and writing hundreds of
lines of code.
Fact: There are a limited # of users
with the necessary Hadoop skills
• MapReduce
• Pig Latin
• HiveQL
• HDFS
• Sqoop and Oozie
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SAS & INTEL
STUDYHADOOP ADOPTION & CHALLENGES
Research summary: SAS and Intel asked more
than 300 IT-managers from the largest companies
in Denmark, Finland, Norway and Sweden about
the adoption of Big Data analytics and Hadoop.
http://nordichadoopsurvey.com
60% - cited advanced analytics,
data discovery, or as an
analytical lab
22% - would like to
speed up processing
Primary reason for considering Hadoop
Adoption / Obstacles
35% - cited “Resources and Competencies”
Results & Key
Findings
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
HADOOP BIG DATA CHALLENGES
Source: Gartner (Sep 2014), Big Data Investment Grows but Deployments Remain Scarce in 2014 By Nick Heudecker, Lisa Kart
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CHALLENGE HADOOP SKILLS SHORTAGE
Performing even the simplest tasks in
Hadoop typically requires mastering
disparate tools and writing hundreds of
lines of code.
Fact: There are a limited # of users
with the necessary Hadoop skills
• MapReduce
• Pig Latin
• HiveQL
• HDFS
• Sqoop and Oozie
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CHALLENGE HADOOP SKILLS SHORTAGE
proc sort data=dsn out=temp;
by usubjid;
run;
data unique;
set temp;
by usubjid;
if not first.usubjid and last.usubjid;
run;
data nodups;
set temp;
by usubjid;
if first.usubjid;
run;
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CHALLENGE HADOOP SKILLS SHORTAGE
public class CalculateDistinct {
public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text("");
public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter)
throws IOException {
word.set(value.toString());
output.collect(word,one);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += 1;
values.next();
}
output.collect(key, new IntWritable(sum));
}
}
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CHALLENGE HADOOP SKILLS SHORTAGE
(cont’d)
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(CalculateDistinct.class);
conf.setJobName("Calculate Distinct");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
javac -classpath hadoop-0.20.1-dev-core.jar -d CalculateDistinct/ CalculateDistinct.java
jar -cvf CalculateDistinct.jar -C CalculateDistinct/ .
hadoop jar CalculateDistinct.jar org.myorg.CalculateDistinct /user/john/in/abc.txt /user/john/out
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CHALLENGE HADOOP SKILLS
The skill sets required to leverage the many benefits of a Hadoop driven data
environment are substantial, and often requires training in many areas.
http://hortonworks.com/training/class/applying-data-science-using-apache-
hadoop/
http://university.cloudera.com/instructor-led-training/introduction-to-data-science-
--building-recommender-systems
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CHALLENGE USER TOOLS ARE NOT BIG DATA ENABLED
Big data brings new requirements:
• Access to HDFS
• Parallel Loads
• New Native file types
• Knowledge of file structures
• New languages & code
• Need to transform data In-cluster
User tools are not engineered to process
data inside Hadoop.
• Tools are not optimized for Hadoop
• Users move data out of Hadoop to do
data management and data quality
• This requires more processing time
• Data is duplicated and more storage is
required
• Users do not use the Hadoop platform
as it was designed
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SOLUTION SAS & HADOOP
SAS has worked closely with the industry leaders in Hadoop development, an
developed tools and solutions to facilitate and leverage SAS with Hadoop.
a growing asking users to adapt to entirely new languages to leverage Hadoop,
SAS has adapted traditional SAS routines and procedures to leverage Hadoop,
the end result being “SAS users can stay in SAS”.
• SAS/ACCESS Interface to Hadoop
• DS2 Programming
• SAS Data Loader
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SOLUTION SAS & HADOOP
Rather than asking users to adapt to entirely new languages to leverage Hadoop,
SAS has adapted traditional SAS routines and procedures to leverage Hadoop,
the end result being “SAS users can stay in SAS”.
DS2 Programming: Essentials
https://support.sas.com/edu/schedules.html?id=1798&ctry=CA
DS2 Programming Essentials with Hadoop
https://support.sas.com/edu/schedules.html?id=2468&ctry=CA
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
THE KEY
CHALLENGE
CLOSING THE GAPS IN THE DATA TO DECISION
LIFECYCLE
BUSINESS
MANAGER
TIME TO DECISION
IT SYSTEMS /
MANAGEMENT
DATA SCIENTIST
/ STATISTICIAN
BUSINESS
ANALYST
VALUE CAPTURED
Hadoop Skill
Shortage
User Tools are
Not Hadoop
Enabled
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
BIG DATA
MANAGEMENTANALYSTS TAKE
Recommendation
“Use self-service interactive data preparation tools to enhance analyst productivity.” and
“improve the quality of data”
– Gartner, “Data Preparation Is Not an Afterthought”
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
THE KEY
CHALLENGE
CLOSING THE GAPS IN THE DATA TO DECISION
LIFECYCLE
BUSINESS
MANAGER
TIME TO DECISION
IT SYSTEMS /
MANAGEMENT
DATA SCIENTIST
/ STATISTICIAN
BUSINESS
ANALYST
VALUE CAPTURED
Hadoop Skill
Shortage
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
MARKET TRENDS SELF-SERVICE DATA PREPARATION
Typically, data preparation is 70-80% of the work involved in any analytic project. That number increases as complexities of the data environment increase.
The rise of self-service data-preparation tools … is putting data management directly into the hands of analysts
SAS Data Loader for Hadoop showcases the
company's solid engineering talent and
reputation for building high-quality software
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SAS DATA LOADER FOR HADOOP
SOLUTION OVERVIEW
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SAS DATA LOADER
FOR HADOOPKEY FEATURES
Point-and-click UI designed for self-service data preparation
Leverage existing skills to prepare data on Hadoop as used on other data sources
Consistency & reuse: apply existing DQ standards on Hadoop data
Familiar toolset for the end-to-end analytical lifecycle
Purpose-Built to run on Hadoop, keeps it simple and focused
Enables parallel data movement and data quality tasks without writing code
Loads data to the SAS LASR Analytic Server
Big Compute: Moves the processing to the data
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SAS DATA LOADER FOR HADOOP…
“Purpose-built” easy to use data management solution
to specifically address: acquiring, structuring, cleaning
and transforming data inside Hadoop
SAS Data Loader for Hadoop is a smart approach,
turning the Hadoop environment into a productive
environment; where barriers are removed, and data is
accessible and usable
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
Manage data inside
Hadoop
Reduce Complexity of Hadoop
Accelerate Business
user adoption
SAS DATA LOADER ENABLES ORGANIZATIONS TO…
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
CAPABILITIES - SAS DATA LOADER FOR HADOOP
• Copy Data to Hadoop
• Profile Data
• Identification Analysis
• Query
ACQUIRE DATADISCOVER DATA
Access data, move it
into Hadoop, and
assess the data
structure and content
1TRANSFORM DATA
• Query
• Select Columns
• Apply Filters
• Map Columns
• Sort / Order
• Calculate Columns
• Transpose data
• Aggregate
• Transform data
Select data of interest,
manipulate it, and
structure it into the data
format desired
2 CLEANSE DATA
• Validate
• Parse
• Standardize
Put data into a
consistent format
3 INTEGRATE DATA
• Join
• Create Match codes
• Sort & De-duplicate
• Aggregate
• Run a SAS program
Combine datasets,
including data that has
no common key,
remove duplicate data,
and create new data
points thru aggregation
4 DELIVER DATA
• Load SAS LASR
• Create tables
• Create views
• Copy from Hadoop
Load datasets into SAS
LASR in-memory
analytic server, Create
new Hadoop tables, and
deliver data to other
databases and apps
5
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
INTRODUCING SAS DATA LOADER FOR HADOOP
Self-service big
data preparation
for business users
Certified by Hortonworks and Cloudera
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
Business Users
Data Analysts, Data Scientists, Statisticians
Data Management Specialists
PRIMARY AUDIENCE: WHO IS SAS DATA LOADER DESIGNED FOR?
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
SAS DATA LOADER
FOR HADOOPBENEFITS
• Users of all skill levels can manage data in Hadoop
• Users can manipulate Hadoop data to fit their specific needs
• No need to write code
• Increases worker productivity and improves data quality
• Leverages the Hadoop cluster including
• Parallel processing
• Minimizes data movement
• Enables reuse of skills you already have
• Unlocks and accesses many types of data
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
ADDITIONAL RESOURCES
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
FOR MORE INFORMATION
• Learn more about SAS Data Loader for Hadoop
• SAS Data Loader for Hadoop
• Learn more about SAS Data Management:
• SAS Data Management
• Learn more about SAS Hadoop offerings:
• SAS Solutions for Hadoop
• Follow us on Twitter: @sasdatamgmt
• Like us on Facebook: SAS Software
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
TRAINING
• Big Data Matters Webinar Series:
• Big Data On-Demand Webinar Series
• SAS Training:
• Introduction to SAS and Hadoop
• DS2 Programming Essentials with Hadoop
• Data Science: Building Recommender Systems with SAS and
Hadoop
THANK YOU!
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
USE CASES
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
Business Users
Data Analysts, Data Scientists, Statisticians
Data Management Specialists
PRIMARY AUDIENCE: WHO IS SAS DATA LOADER DESIGNED FOR?
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
USERS BUSINESS USERS
• Self service access to data
• Query and manipulate data
• Copy data to/from Hadoop
• Load data into SAS LASR
Activities:
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
USERS DATA ANALYSTS, DATA SCIENTISTS, & STATISTICIANS
• Create an analytics ready dataset
• Discover new data sources
• Transform and manipulate data
• Optional: Write SAS DS2 code
• Load data into SAS LASR server
Activities:
Analytics ready dataset
Event data
Customer data
Log files
Data Preparation
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
USERS DATA MANAGEMENT SPECIALISTS
• Apply enterprise data management practices to Hadoop
• Manage data with discipline inside Hadoop
• Reuse data quality standards inside Hadoop
• Copy data to/from Hadoop
• Optimize SAS code to run in Hadoop
• Learn from Hadoop data discoveries
• Apply knowledge gained in enterprise environment
Activities:
Hadoop
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
BUSINESS USER
USE CASE
SELF SERVICE BIG DATA ON-BOARDING, EXPLORATION
AND DISCOVERY
• User copies data from a data source into Hadoop
• User profiles the table to learn the structure/content of the data
• User queries the data and creates a new table specific to their needs
• User loads the new table into SAS LASR
Activities
SAS® LASR ANALYTIC SERVER
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
BUSINESS USER
USE CASECONTINUES EXPLORATION AND DISCOVERY USING SAS VA…
SAS Data Loader for Hadoop SAS Visual Analytics
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d .
DATA SCIENTIST
USE CASEBIG DATA PREPARATION FOR ADVANCED ANALYTICS
• User access previously run profile report showing table information
• User defines a new table
• Creates new columns using calculations
• Pivots / transposes the table
• Uses functions to aggregate variables
• Writes a SAS DS2 program to append records with a calculated score
• Sorts the data and applies filters
• Then User loads the table into SAS LASR
Activities
Co p y r ig ht © 2 0 1 2 , SAS I ns t i t ut e I nc . A l l r i g ht s r e s e r ve d . www.SAS.com
THANK YOU !