© 2015 IBM Corporation
© 2015 IBM Corporation2
Hardware improvements through the years...� CPU Speeds:
– 1990 - 44 MIPS at 40 MHz
– 2000 - 3,561 MIPS at 1.2 GHz
– 2010 - 147,600 MIPS at 3.3 GHz
� RAM Memory– 1990 – 640K conventional memory (256K extended memory recommended)
– 2000 – 64MB memory
– 2010 - 8-32GB (and more)
� Disk Capacity– 1990 – 20MB 2000 - 1GB 2010 – 1TB
� Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently around 70 – 80MB / sec
� How long it will take to read 1TB of data? (at 80Mb / sec):
– 1 disk - 3.4 hours
– 10 disks - 20 min
– 100 disks - 2 min
– 1000 disks - 12 sec
© 2015 IBM Corporation3
Parallel Data Processing is the answer!
� It was with us for a while:–GRID computing - spreads processing load
–Distributed workload - hard to manage applications, overhead on developer
–Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the data)
© 2015 IBM Corporation4
About the IBM Open Platform for Apache Hadoop
� Flexible, enterprise-class support for processing large volumes of data – Supports wide variety of data (structured, unstructured, semi-structured)
– Supports variety of popular APIs (industry-standard SQL, MapReduce, …)
� Enables applications to work with thousands of nodes and petabytesof data in a highly parallel, cost effective manner– CPU + local disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without changing • Data formats
• How data is loaded
• How jobs are written
© 2015 IBM Corporation5
Design principles of Hadoop
� New way of storing and processing the data:– Let system handle most of the issues automatically:
• Failures• Scalability• Reduce communications • Distribute data and processing power to where the data is• Make parallelism part of operating system• Meant for heterogeneous commodity hardware
� Bring processing to Data!
� Hadoop = HDFS + MapReduce infrastructure
� Optimized to handle– Massive amounts of data through parallelism
� Reliability provided through replication
© 2015 IBM Corporation6
What is the Hadoop Distributed File System (HDFS)?
� Driving principals
– Data is stored across the entire cluster (multiple nodes)
– Programs are brought to the data, not the data to the program
– Follows the Divide and Conquer paradigm.
� Data is stored across the entire cluster (the DFS)
– The entire cluster participates in the file system
– Blocks of a single file are distributed across the cluster
– A given block is typically replicated as well for resiliency
101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
© 2015 IBM Corporation7
Introduction to MapReduce
MapReduce Application
1. Map Phase(break job into small parts)
2. Shuffle(transfer interim outputfor final processing)
3. Reduce Phase(boil all output down toa single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
Distribute maptasks to cluster
Hadoop Data Nodes
� Scalable to thousands of nodes and petabytes of data
© 2015 IBM Corporation8
• Reliability
• Resiliency
• Security
• Multiple data sources
• Multiple applications
• Multiple users
Benefits
• Files
• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of Data Formats
© 2015 IBM Corporation9
Hadoop MapReduce Challenges
• Need deep Java skills
• Few abstractions available for
analysts
• No in-memory framework
• Application tasks write to disk
with each cycle
• Only suitable for batch
workloads
• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows
CONS
© 2015 IBM Corporation10
An Apache Foundation open source project. Not a product.
What is Spark
1
2
5
33
4
Enables highly iterative analysis on large volumes of data at scale
An in-memory compute engine that works with data. Not a data store.
Radically simplifies process of developing intelligent apps fueled by data
Unified environment for data scientists, developers and data engineers
© 2015 IBM Corporation11
Open Source Project
� 2002 – MapReduce @ Google
� 2004 – MapReduce paper
� 2006 – Hadoop @ Yahoo
� 2008 – Hadoop Summit
� 2010 – Spark paper
� 2014 – Apache Spark top-level
� 2014 – 1.2.0 release in December
� 2015 – 1.3.0 release in March
� 2015 – 1.4.0 release in June
� Spark is HOT!!!
� Most active project in Hadoop
ecosystem
� One of top 3 most active Apache projects
� Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
Activity for 6 months in 2014(from Matei Zaharia – 2014 Spark Summit)
1
© 2015 IBM Corporation12
Resilient Distributed Dataset: definition
Slave node 1
c3 d2
a2 b1
partition3
partition1
partition2
Slave node 2
c2 d1
a1 b2
partition1
partition3
Slave node 3
c1 d2
a3 b3
partition2
partition2
partition1
RDD1
RDD2
RDD3
Spark RDDIn-memory distribution
HDFSOn-disk distribution
� An RDD is a distributed collection of Scala/Python/Java objects of the same type:
– RDD of strings, integers …
– RDD of (key, value) pairs
– RDD of class Java/Python/Scala objects
2
© 2015 IBM Corporation13
Spark Programming Model
• Operations on RDDs (datasets)– Transformation
– Action
• Transformations use lazy evaluation– Executed only if an action requires it
• An application consist of a directed acyclic graph (DAG)– Each action results in a separate batch job
– Parallelism is determined by the number of RDD partitions
RDD1 RDD2 RDD3
Act1
Act2
Job-1
Job-2
Resilient Distributed Dataset: Operations
© 2015 IBM Corporation14
What happens when an action is executed?
// Creating the RDD
val logFile = sc.textFile(“hdfs://…”)
// Transformations
val errors = logFile.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(“\t”)).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains(“mysql”)).count()
messages.filter(_.contains(“php”)).count()
Driver
Worker Worker WorkerBlock 1 Block 3Block 2
The data is partitioned into
different blocks
© 2015 IBM Corporation15
What happens when an action is executed?
// Creating the RDD
val logFile = sc.textFile(“hdfs://…”)
// Transformations
val errors = logFile.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(“\t”)).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains(“mysql”)).count()
messages.filter(_.contains(“php”)).count()
Driver
Worker Worker WorkerBlock 1 Block 3Block 2
Driver sends the code to be
executed on each block
© 2015 IBM Corporation16
What happens when an action is executed?
// Creating the RDD
val logFile = sc.textFile(“hdfs://…”)
// Transformations
val errors = logFile.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(“\t”)).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains(“mysql”)).count()
messages.filter(_.contains(“php”)).count()
Driver
Worker Worker WorkerBlock 1 Block 3Block 2
Read HDFS block
© 2015 IBM Corporation17
What happens when an action is executed?
// Creating the RDD
val logFile = sc.textFile(“hdfs://…”)
// Transformations
val errors = logFile.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(“\t”)).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains(“mysql”)).count()
messages.filter(_.contains(“php”)).count()
Driver
Worker Worker WorkerBlock 1 Block 3Block 2
Process + cache data
Cache Cache Cache
© 2015 IBM Corporation18
What happens when an action is executed?
// Creating the RDD
val logFile = sc.textFile(“hdfs://…”)
// Transformations
val errors = logFile.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(“\t”)).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains(“mysql”)).count()
messages.filter(_.contains(“php”)).count()
Driver
Worker Worker WorkerBlock 1 Block 3Block 2
Cache Cache Cache
Send the data back
to the driver
© 2015 IBM Corporation19
What happens when an action is executed?
// Creating the RDD
val logFile = sc.textFile(“hdfs://…”)
// Transformations
val errors = logFile.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(“\t”)).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains(“mysql”)).count()
messages.filter(_.contains(“php”)).count()
Driver
Worker Worker WorkerBlock 1 Block 3Block 2
Cache Cache Cache
Process from cache
© 2015 IBM Corporation20
What happens when an action is executed?
// Creating the RDD
val logFile = sc.textFile(“hdfs://…”)
// Transformations
val errors = logFile.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(“\t”)).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains(“mysql”)).count()
messages.filter(_.contains(“php”)).count()
Driver
Worker Worker WorkerBlock 1 Block 3Block 2
Cache Cache Cache
Send the data back
to the driver
© 2015 IBM Corporation21
Spark Libraries
Apache Spark
Spark SQLSpark
StreamingGraphX MLlib SparkR
• Extensions to the core Spark API (Python, Scala, Java)
• Improvements made to the core are passed to these libraries
• Little overhead to use with the Spark core
Simplifies process of developing 3
© 2015 IBM Corporation22
Data Scientist
Data Engineer App Developer
“the convincer”
“the builder”“the thinker”
Spark SQL
MLlib
Scala API
Unified environment4
© 2015 IBM Corporation23
Data Engineer
Data ScientistApp Developer
Spark Empowers More to Accelerate The Insight
Business
Use CaseUnderstanding
attributes Data
Cleaning
Machine
Learning
Analysis of
Accuracy
With the
other
business
applications
5
© 2015 IBM Corporation24
In-Memory Performance
Ease of Development
• Easier APIs
• Python, Scala, Java
• Resilient Distributed Datasets
• Unify processing
Spark Advantages
• Batch
• Interactive
• Iterative algorithms
• Micro-batch
Combine Workflows
© 2015 IBM Corporation25
Example of Hadoop Ecosystem
Zo
oke
ep
er
(Coord
ination)
Flu
me
(Data
Colle
ction)
HDFS (or GPFS)(Distributed File System)
YARN(Resource Manager)
Pig
(ET
L)
Oo
zie
(Work
flow
)
Hiv
e(D
ata
Ware
house S
QL)
Syste
mT
(Text A
naly
tics)
Big
R &
ML
(Sta
tistica
l An
aly
sis
)
Blue Boxes components only available with IBM BigInsights product.
Am
ba
ri(M
onitoring)
HB
ase
(Inte
ractive S
tora
ge)
Map Reduce v2(Processing
Framework)
Ma
p R
ed
uce
v1
(Pro
cessin
g F
ram
ew
ork
)
Spark(Processing
Framework)
© 2015 IBM Corporation26
IBM Announces Major
Commitment to Advance
Apache® Spark™
⦁ …the Most Significant Open Source Project of the Next Decade…
© 2015 IBM Corporation27
Open Source SystemML
Educate One Million Data Professionals
Establish Spark Technology Center
Founding Member of AMPLab
Contributing to the Core
Announcing
Our commitment to Spark
© 2015 IBM Corporation28
SystemML unifies the fractured machine learning environments
Gives the core Spark ecosystem a complete set of DML
Allows a data scientist to focus on the algorithm, not the implementation
Improves time to value for data science teams
Establish a de facto standard for reusable machine learning routines
We are Contributing SystemML
Our largest contribution to open source since Linux
© 2015 IBM Corporation29
© 2015 IBM Corporation30
Educate 1 Million Data Scientists and Data Engineers
Big Data University MOOC
Spark Fundamentals I and II
Advanced Spark Development series
Foundational Methodology for Data Science
Partnerships with Databricks, AMPLab, DataCamp and MetiStream
Our investment to grow skills
© 2015 IBM Corporation31
Inspire the use of Spark to solve business problems
Encourage adoption through open and free educational assets
Demonstrate real world solutions to identify opportunities
Use the learning to improve Spark and its application
Spark Technology Center
Our goal is to be the #1 Spark contributor and adopter
© 2015 IBM Corporation32
Our Partner Ecosystem
© 2015 IBM Corporation33
Now
IBM Open Platform with Apache Hadoop
IBM InfoSphere Streams
IBM Platform Computing
Our Use of Spark at IBM
⦁ More than 30 IBM Research initiatives
⦁ 100 incubated applications in 10 days
⦁ 3,500 Researchers and Developers to Spark
Targeted for later in year
Apache Spark as a Service on IBM Bluemix (in beta)
IBM Watson Analytics
SPSS Modeler & Analytics Server
IBM DataWorks
IBM PureData Systems with Fluid Query
IBM Commerce