BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Polybase challenges Hiverelational access to non-relational HDFS
Olaf Nimz
Agenda
Proposed marriage between SQL Server and Hadoop
Building Bridges to HDFS
Distributed query processing
Sensible Hybrid Scenarios
Take Home Message
1. Access to non-relational world is easier with Polybase
T-SQL only
Unstructured data still complex e.g. nested JSON stuctures
2. Hybrid solutions
Fact Extractor - IoT
Staging Area for DWH – keep entire history
Dirty data source files
Near real-time
3. Scenarios
Swiss Air - Flight Logs
SwissCom - Call Data Records
Archiving (c)old DWH Facts
Polybase
Polybase
Requirements
– Java (64-bit JRE >7.51)
– Azure storage account or Hadoop (not HDInsight)
> Hortonwork’s Data Platform (HDP 1.3, 2.0 – 2.3)
> Cloudera’s CDH (4.3, 5.1 – 5.5)
Installation Check
– SELECT SERVERPROPERTY ('IsPolybaseInstalled'); returns 1?
Configuration external data source
– sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
Data Movement Services
FeatureSQL Server
2016
Azure SQL Data
WarehouseAPS Appliance - PDW
Query Hadoop data with Transact-SQL yes no yes
Query Azure blob storage with
Transact-SQLyes yes yes
Import data from Hadoop yes no yes
Import data from Azure blob storage yes yes yes
Export data to Hadoop yes no yes
Export data to Azure blob storage yes yes yes
Run PolyBase queries from Microsoft's
BI toolsyes yes yes
Push down query computations to
Hadoopyes no yes
Feature
Objects for Polybase
2015 © Trivadis
Define external objects
CREATE MASTER KEY ENCRYPTION
BY PASSWORD = 'S0me!nfo';
CREATE DATABASE SCOPED CREDENTIAL
HadoopUser
WITH IDENTITY = '<hadoop_user_name>', SECRET = '<hadoop_password>';
CREATE EXTERNAL DATA SOURCE
HadoopCluster
WITH ( TYPE = HADOOP,
LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',
RESOURCE_MANAGER_LOCATION = '10.xxx.xx.xxx:xxxx',
CREDENTIAL = HadoopUser);
2015 © Trivadis
Define external objects
CREATE EXTERNAL FILE FORMAT
TextFileFormat
WITH ( FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE)
CREATE EXTERNAL TABLE
[dbo].[CarSensor_Data] (
[SensorKey] int NOT NULL, [CustomerKey] int NOT NULL,
[GeographyKey] int NULL, [Speed] float NOT NULL,
[YearMeasured] int NOT NULL )
WITH (LOCATION = '/Demo/',
DATA_SOURCE = HadoopCluster,
FILE_FORMAT = TextFileFormat );
2015 © Trivadis
Query external data
SELECT DISTINCT Insured_Customers.FirstName
, Insured_Customers.LastName
, Insured_Customers.YearlyIncome
, CarSensor_Data.Speed
FROM Insured_Customers
, CarSensor_Data -- cross join
WHERE Insured_Customers.CustomerKey = CarSensor_Data.CustomerKey
and CarSensor_Data.Speed > 35
ORDER BY CarSensor_Data.Speed DESC
OPTION (FORCE EXTERNALPUSHDOWN);
-- or OPTION (DISABLE EXTERNALPUSHDOWN)
2015 © Trivadis
Export Data to Hadoop
CREATE EXTERNAL TABLE [dbo].[FastCustomers2009] ( … );
Move cold data to Hadoop/Blob while keeping it query-able via an external table:
INSERT INTO dbo.FastCustomer2009
SELECT *
FROM Insured_Customers T1
JOIN CarSensor_Data T2
ON (T1.CustomerKey = T2.CustomerKey)
WHERE T2.YearMeasured = 2009
AND T2.Speed > 40;
Polybase
Objects in SSMS
Dynamic Management Views
Monitor and troubleshoot PolyBase queries using the DMVs.
longest running queries
longest running step of the distributed query
execution progress of the longest running step
- of a SQL step
- XML remote query plan
- of a DMS step
Find information about external DMS operations
- View the PolyBase query plan
- XML remote query plan (node properties)
JSON Format
Parse JSON text and read or modify values.
Transform arrays of JSON objects into table format.
Use any Transact SQL query on the converted JSON objects.
Format the results of Transact-SQL queries in JSON format.
JSON
Parse «unstructured» JSON cell content
stored in the jsonCol column:
[ { "name": "John", "skills": [ "SQL", "C#", "Azure“ ] }, { "name": "Jane", "surname": "Doe" } ]
SELECT Name, Surname,
JSON_VALUE(jsonCol, '$.info.address.PostCode') as PostCode,
JSON_VALUE(jsonCol, '$.info.address."Address Line 1"') +' '+
JSON_VALUE(jsonCol, '$.info.address."Address Line 2"') as Address,
JSON_QUERY(jsonCol, '$.info.skills') as Skills
FROM PeopleCollection
WHERE ISJSON(jsonCol) > 0
AND JSON_VALUE(jsonCol, '$.info.address.town') = 'Belgrade'
AND Status = 'Active'
ORDER BY JSON_VALUE(@jsonInfo, '$.info.address.PostCode')
Convert «unstructured» JSON to table
SET @json = '[
{ "id" : 2, "info": { "name": "John", "surname": "Smith" }, "age": 25 },
{ "id" : 5, "info": { "name": "Jane", "surname": "Smith" }, "dob": "2005-11-04T12:00:00" }
]'
SELECT *
FROM OPENJSON(@json)
WITH (id int 'strict $.id',
firstName nvarchar(50) '$.info.name', lastName nvarchar(50) '$.info.surname',
age int, dateOfBirth datetime2 '$.dob')
Performance Scaling
Take Home Message
1. Access to non-relational world is easier with Polybase
T-SQL only
Unstructured data still complex e.g. nested JSON stuctures
2. Hybrid solutions
Fact Extractor - IoT
Staging Area for DWH – keep entire history
Dirty data source files
Near real-time
3. Scenarios
Swiss Air - Flight Logs
Swisscom - Call Data Records
Archiving (c)old DWH Facts
Outlook
Table definition remains challenging
Push down computation
Scale-out the SQL Server side
– using e.g. idle Fail Over Instance
see Blob Post with Code Examples
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
THANK YOU. Trivadis AG
Olaf Nimz
Sägereistrasse 29
8152 Glattbrugg
Tel. +41-44-808 70 20
Fax +41-44-808 70 21
www.trivadis.com