The Modern Analytics ArchitectureMaking Big Data UsefulJoseph D’Antoni, Solutions
ArchitectAnexinet
May 7-9, 2014 | San Jose, CA
Please silence
cell phones
Joey D’AntoniJoey has over 15 years of experience with a wide variety of data platforms, in both Fortune 50 companies as well as smaller organizationsHe is a frequent speaker on database administration, big data, and career managementHe is the co-president of the Philadelphia SQL Server User’s GroupHe wants you to make sure you can restore your data
Agenda
• Data Warehouses—how did we get here?• Big Data—Hadoop and more• Modern Analytic Tools• Building Our New Architecture
4
5
Data Warehouses—A History
• Data Warehousing had it origins in the 1970s—A.C. Nielsen provided clients with data marts
• In 1988—Bill Inmon (IBM) published “An Architecture for a Business Information System”
• In 1996—Ralph Kimball published “The Data Warehouse Toolkit” which showcased models for OLAP style modelling
6
Data Warehouse Models
• Star Schema
• Advantage is that the DW is easier to use
• Facts and dimensions allow queries to perform faster
• Loading and ETL become more complicated
• Structure changes are very expensive
Dimensional Model
7
Data Warehouse Model
• Tables are grouped by subject area (consumer, finance, products)
• Tables are linked by joins
• Very easy to add information into the database
• Queries are harder to write, and joins can be very expensive performance wise
Normalization
8
Data Warehousing Challenges
Data QualityETLPerformance and ScalabilityCosts—Licensing and Hardware
9
Data Quality
10
Extract, Transform, Load (ETL) Process
Some Database Business Doesn’t
Care About
Process
Your
Some
Credit—Buck Woody, Microsoft
11
Performance and Scalability
Given the volume of data, DW queries can be very slowWe use techniques like data compression to make them fasterCPU was older problem—now tends to be storage
12
Costs
Data Warehouses need large serversDatabase systems are licensed by the size of the server (core)Data Warehouses need a whole lot fast storageLarge volumes of fast storage (SANs) are expensive
13
Traditional Solutions
Classic Data Analysis
Data Warehouse & BI Solutions
ETL
…Uses Just a Subset
Common Technical Themes
There are a lot of “big data” solutions, but most of have a lot of things in common
• Built in HA/DR through multiple copies of the data• Designed for analytics processing more than OLTP• Derived from Open Source solutions• Designed around local storage and commodity
hardware
Components Of Modern ArchitectureHadoop• (And it’s ecosystem)
EDWAnalytics EngineVisualization Engine
Big Data Workflow for Combined Data and Analytics
Data Acquire Organize Analyze Decide
Str
uct
ur
ed
Sem
i-S
tru
ctu
red
Un
-S
tru
ctu
red
Master and
Reference
Transactions
Machine Generated
(Logs)
Web
Text, Image, Audio, Video
DBMS (OLTP)
Files
NoSQL(Key Value
Data Store)
HDFS
ETL/ELT
Change Data
Capture
Real-Time
Message-Based
Hadoop MR
ODS
Data Warehouse
Streaming(CEP
Engine)
In-Database Analytics
Analytics
• Reporting and dashboards
• Alerting and recommendations
• EPM, Social Apps
• Text analytics and search
• Advanced analytics
• Interactive discovery
Hardware
Big Data Cluster
High Speed
Network
RDBMS Cluster
In-MemoryAnalytics
Source—Gartner, Credit Suisse, 8/12
Are We Leaving the RDBMS?
19
CPUs
Hadoop Project StartsExadata Launched
20
Costs—Big Data versus Data Warehouse
Server Storage Licensing Total $-
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
Hadoop and Data Warehouse Costs
Hadoop Data Warehouse
• For same costs you build a 15-node Hadoop cluster
• The Hadoop cluster would have 3840 GB of RAM versus the 1024 in the DW sever
Enter the Yellow Elephant
21
Hadoop
Hadoop is the leading Big Data platform (eco-system)Invented by Yahoo• Scales Horizontally (2 socket x86 servers
in massive clusters)• Uses big, slow, local storage • Extremely fault-tolerant• In a nutshell—it’s a Distributed File
System (3 copies of data in cluster) and a programming framework called MapReduce
23
Introducing Hadoop
Host 1
Name Node
Host 3
Data Node
Host 5
Data Node
Host 2
Secondary Name Node
Host 4
Data Node
Host 6
Data Node
24
How Map Reduce Works
• Automatic parallelism
• Fault tolerance
Map Phase
Input File: foo.log
HDFS Block
1
HDFS Block
19
HDFS Block 1051) Read
splits into records
Split 1
K:0 V…
Map Task 1
K:INFO V…
Split 2
K:123 V…
Map Task 2
K:INFO V:1K:WARN
V:1
Split 3K:332 V…
K:368 V…
Map Task 3
K:Debug V:1
K:INFO V:1
2) Run Map
3) Write and Sort Output
Hadoop Ecosystem
HDFS
MapReduce
Note: This is only a subset of ecosystem!
YARN
28
Spark and Shark
• Hadoop 2 Enhancements
• Spark is in-memory• Shark integrates
Spark with Hive
Hadoop Architectural Decisions
• Distribution• Components• Support• Cloud vs On-Premises
Choosing Your Hadoop Distribution
Hadoop Vendors
Technology Vendor Description
Hadoop Distributions Apache Completely open source software for distributed clusters and map/reduce
Cloudera Industry leading commercial distribution, good management tools
Hortonworks Open source distribution—Apache compatible
MapR Multiple enhancements to Apache Hadoop (rewrite of HDFS), high performance, enterprise ready
Pivotal HD EMC spinoff with strong financial backing, this is full high performance RDBMS (with BI connectors) on top of Hadoop
32
Cloud vs On-Premises
• Short Term Use• Rapid Scale
• Test Use Cases• Pay as you go• Internet data
source
• Large long term implementations
• Well known workloads• Shared clusters• Large initial investment
On-Premises
Analytics Engine33
34
Analytics
Hadoop is was not fastFull scans of filesSo How Do We Rapidly Analyze Data?
35
Columnar Databases
Microsoft SQL Server (2012 & 2014)PDWHP VerticaHBaseParAccelInfiniDBEMC Greenplum
36
In-Memory Databases
SQL Server 2014SAP HanaOracle Times TenVoltDBApache Spark
37
Analytics Tools Past and Present
38
Data Visualization
Tools for Data Visualization
Excel (Power View and Power Map)TableauQlikPlatforaPentaho
40
Bringing This All Together
Power Query (Excel)
Some Database Business Doesn’t
Care About
Process
Your
Some
Q & A ?
Session Evaluations
Submit by 5pmFriday May 9 to WIN prizes
Your feedback is important and valuable.
ways to access
Go to passbac2014/evals
Download the PASS EVENT App from your App Store and search: PASS BAC 2014
Follow the QR code link displayed on session signage throughout the conference venue and in the program guide
for attending this session and the PASS Business Analytics Conference 2014
Thank
You
May 7-9, 2014 | San Jose, CA