Managing Big DataWith the Microsoft Parallel Data Warehouse Appliance
Chris CampbellAnalytics Platform Practice Lead
About Me
Business Insights. Delivered.BlueGranite provides end-to-end business analytics solutions.
Enable the organization to store and analyze large volumes of structured and non-structured data with optimized systems that can scale to meet demand.
Help your team understand past performance and prescribe actions through interactive dashboards, reports and predictive analysis.
Keep data in the hands of your decision makers wherever they are with interactive solutions on today’s mobile devices.
What’s all this about Big Data?
“Every day, the amount of data eBay processes adds up to an astonishing 50 petabytes”
“Walmart handles more than 1 million customer transactions every hour”
“Facebook handles 50 billion photos from its user base”
“The volume of business data worldwide, across all companies, doubles every 1.2 years”
“the data flow from all four LHC experiments represents 25 petabytes annual rate”
“as of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created”
The Data Explosion
– Wikipedia
Big Data Defined
• Line of Business MegabytesTransactional / Relational
User / Customer Generated
• Spreadsheets• Documents• Text Files
Gigabytes
External / Public
• Demographics• Weather• Government• Marketing
Terabytes
Streaming / Social / Machine Generated
• Server logs• Clickstreams• Sensor Output• Manufacturing• Images and Video• Medical Equipment• Test Results• Social, Social, Social!
Petabytes
The three “V”sThe three four “V”s
Volume
Variety
Value
Velocity
Walmart handles over
1 million customer
transactions every
hour
eBay processes up to
50PB a day
LHC experiments
generate 25PB annually
”Most firms estimate they are only analyzing 12% of
the data they already have”
… “In addition, it’s often impossible to judge what
data is valuable and what isn’t”
– Forrester ResearchThe Forrester Wave™: Big Data Hadoop Solutions, Q1 2014
Why Big Data Matters
Warehouse Layer
Traditional Data Warehouse Architecture
Source System Layer
Data Integration Layer
Analytics Layer
Which products
sell better when it rains?
What demographicmakes up our product’s primary
customer base?
Can I prevent failure modes?
Are my employees engaging in fraud?
WHAT WILL
A PATIENT’SOUTCOME LIKELY BE?
What is being said about our
customer service?
Which products
sell better when it rains?WHAT WILL
A PATIENT’SOUTCOME LIKELY BE?
Can I prevent failure modes?What demographic
makes up our product’s primary customer base?
Are my employees engaging in fraud?
What is being said about our
customer service?
Unstructured Data – The New Problem
Soci
al • Twitter
• Vine
• Blogs
• Comments
• Likes
• Surveys
Stre
amin
g • Server Logs
• Manufacturing Equipment
• Alerts
• Sensor Data
• Medical Instruments
• Test Results
• Diagnostics
• Search
Sem
i-St
ruct
ure
d • Spreadsheets
• Documents
• Drawings
• Text
• XML
• Images and Video
• Gene Sequences
• Drug Interactions
A Modern Approach
The Data Lake
Parallel Data Warehouse
Scale Up or Scale Out?
Two types of architectures
SMP – Scale Up MPP – Scale Out
Scalability Decreases as Cost Increases Capacity and Performance Scale Linearly with Cost
Scale Up (SMP) vs. Scale Out (MPP)
1 x HP DL360 = $17,430.00 MSRP
16 Cores (2 x Intel Xeon E5-2690 @ 2.9 GHz, 20 MB)256 GB Memory (16 x 16GB PC3-12800R)
Scale Up (SMP) vs. Scale Out (MPP)
1 x HP DL560 = $36,487.00 MSRP 2 x HP DL360 = $34,860.00 MSRP
32 Cores (4 x Intel Xeon E5-4650 @ 2.7 GHz, 20 MB)256 GB Memory (16 x 16GB PC3-12800R)
32 Cores (2 x 2 x Intel Xeon E5-2690 @ 2.9 GHz, 20 MB)512 GB Memory (2 x 16 x 16GB PC3-12800R)
Scale Up (SMP) vs. Scale Out (MPP)
1 x HP DL980 = $121,353.00 MSRP 4 x HP DL360 = $69,720.00 MSRP
64 Cores (8 x Intel Xeon E7-2380 @ 2.13 GHz, 24 MB)1 TB Memory (64 x 16GB PC3-10600R LV)
64 Cores (4 x 2 x Intel Xeon E5-2690 @ 2.9 GHz, 20 MB)1 TB Memory (4 x 16 x 16GB PC3-12800R)
16 Cores 32 Cores 64 Cores
Scale Up (SMP) vs. Scale Out (MPP)
Scale Out Scale Up
CO
ST
PERFORMANCE
SMP vs. MPP ROI
SMP
MPP
Appliance Architecture
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
HP PDW Architecture
• Designed to be fault tolerant from the ground up
Quarter Rack• 2 Active Compute Servers• 32 Cores• 512 GB Memory• 15 TB Uncompressed Storage
7 Full Racks• 56 Active Compute Servers• 896 Cores• 14.3 TB Memory• 1.2 PB Uncompressed Storage
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Server
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Each Node• 16 Cores• 256 GB Memory• 7.5 TB Uncompressed Storage
Full Rack• 8 Active Compute Servers• 128 Cores• 2 TB Memory• 60 TB Uncompressed Storage
Dell/Quanta PDW Architecture
• Designed to be fault tolerant from the ground up
Third Rack• 3 Active Compute Servers• 48 Cores• 768 GB Memory• 15 TB Uncompressed Storage
6 Full Racks• 54 Active Compute Servers• 864 Cores• 13.8 TB Memory• 1.2 PB Uncompressed Storage
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Node
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Node
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Node
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Node
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Node
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Node
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Each Node• 16 Cores• 256 GB Memory• 7.5 TB Uncompressed Storage
Full Rack• 9 Active Compute Servers• 144 Cores• 2.3 TB Memory• 67.5 TB Uncompressed Storage
HP App System for PDW Dell Parallel Data Warehouse Appliance
Virtualized Architecture Overview
• General Details
• Hosts and guests run Windows Server 2012 Standard
• Fabric and workload contained in Hyper-V virtual machines
• PDW Agent runs on all hosts and all VMs
• Windows Storage Spaces handles mirroring and spares
• PDW Workload Details
• SQL Server 2012 Enterprise Edition (PDW build) Host 2
Host 1
Host 3
Host 4
JBOD
IB &Ethernet Direct attached SAS
CTL MAD AD VMM
Compute 2
Compute 1
Failover Functionality
• Cluster Shared Volumes:
• CSV allows all nodes to access the LUNs on the JBOD as long as at least one of the hosts attached to the JBOD is active
• Leverages SMB3 protocol
• Failover Details:
• One cluster across the whole appliance
• VMs are automatically migrated on host failure
• Affinity and anti-affinity maps enforce rules
• Failback continues to be through CSS
• Leverages Windows Failover Cluster Manager
• Adding Passive Unit increases HA capacity:
• Allow another VM to fail without disabling the appliance
• All hosts connected to a single JBOD cannot failover
Host 2
Host 1
Host 3
Host 4
JBOD
IB &Ethernet Direct attached SAS
CTL MAD AD VMM
Compute 2
Compute 1
Data Storage
Built for Star Schemas
Fact Sales
Dim Date
Dim Customer Dim Product
Dim Store
Two Kinds of Tables in a Data Warehouse• Dimensions – What we report “by”• Facts – What we report “on”
Replicated Tables
TableCopy
CTL
TableCopy
No
de 1
No
de
3N
od
e 4
Distributed Tables
Records0-
100
Records100-200
Records200-300
Records300-400
Records400-500
Records500-600
Records600-700
Records700-800
CTL
Records800-900
Records900-1000
Records1000-1100
Records 1100-1200
Records 1200-1300
Records 1300-1400
Records 1400-1500
Records 1500-1600
No
de 1
No
de
3N
od
e 4
Join Compatibility
No
de 3
No
de 4
No
de 5
Replicated Dim DateYears 1990-2015
Distributed Dim CustomerCustomers A-I
Distributed Fact SalesSales For 2012Customers A-Z
Replicated Dim DateYears 1990-2015
Distributed Dim CustomerCustomers J-S
Distributed Fact SalesSales For 2013Customers A-Z
Replicated Dim DateYears 1990-2015
Distributed Dim CustomerCustomers T-Z
Distributed Fact SalesSales For 2014Customers A-Z
Distributed Dim CustomerCustomers A-Z
Distributed Dim CustomerCustomers A-Z
Distributed Dim CustomerCustomers A-Z
Skew
Jan 2013 Sales
Feb 2013 Sales
Mar 2013 Sales
Apr 2013 Sales
May 2013 Sales
Jun 2013 Sales
Jul 2013 Sales
Aug 2013 Sales
CTL
Sep 2013 Sales
Oct 2013 Sales
Nov 2013 Sales
Dec 2013 Sales
Jan 2014 Sales
Feb 2014 Sales
Mar 2014 Sales
Apr 2014 Sales
No
de 1
No
de
3N
od
e 4
Skew
Jan 2013 Sales
Feb 2013 Sales
Mar 2013 Sales
Apr 2013 Sales
May 2013 Sales
Jun 2013 Sales
Jul 2013 Sales
Aug 2013 Sales
CTL
Sep 2013 Sales
Oct 2013 Sales
Nov 2013 Sales
Dec 2013 Sales
Jan 2014 Sales
Feb 2014 Sales
Mar 2014 Sales
Apr 2014 Sales
No
de 1
No
de
3N
od
e 4
xVelocity Clustered Columnstore
xVelocity Clustered Columnstore
Cu
stom
er
Sales
Co
un
try
Sup
plier
Pro
du
cts
• Updateable and clustered xVelocity columnstore
• Clustered Columnstore can save up to 91% in storage usage
• Memory-optimized for next-generation performance
• Updateable to support bulk and/or trickle loading
• Reduced maintenance by minimizing indexes
• All PDW data types are supported
xVelocity Clustered Columnstore
• Table consists of column store and row store
• “Tuple mover” converts data into columnar format once segment is full (1M of rows)
• INSERT • Always lands into delta store
• DELETE• Logical and does not physically remove row until
REBUILD is performed
• UPDATE• Logical DELETE followed by INSERT.
• BULK INSERT• if batch > 100k loads directly to columnstore
• SELECT • Unifies data from Column and Row stores
C1 C2 C3 C5 C6C4
Co
lum
nSt
ore
C1 C2 C3 C5 C6C4
Del
ta (
row
)st
ore
tup
le mo
ver
Polybase
Query Across PDW and Hadoop with Polybase
EnhancedPDW query
engine
Data Scientists
BI Users
DB Admins
T-SQL Results
PDW V2
Relational data
Traditional schema-based DW applications
Social Apps
Sensor & RFID
Mobile Apps
WebApps
Non-relational data
Hadoop
Polybase
• Allows for TSQL Queries against HDFS Data• Parallelization Affinity Between PDW and
Hadoop• Supports multiple flavors of Hadoop
• HDInsight• Hortonworks• Cloudera
Ethernet Switch
Ethernet Switch
Infiniband Switch
Infiniband Switch
Control Node
HA Server
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Scale Unit Server
Scale Unit Server
Scale Unit Storage
Ethernet Switch
Ethernet Switch
Hadoop Name Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
Hadoop Node
PDW In Action