2013 © Trivadis
BASEL BERN BRUGG LAUSANNE ZUERICH DUESSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
2013 © Trivadis
HDInsight in Windows Azure
Marc Schöni
Meinrad Weiss
April 2014
04.03.2014HDInsight in Windows Azure R 1.001
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
Introduction
HDInsight on Windows Azure
2
2013 © Trivadis
Big data solutions deal with complexities of:
VOLUME
(Size)
VARIETY
(Structure)
VELOCITY
(Speed)
Big Data
VALUE
Hadoop/HDInsight3
Focus
04.03.2014HDInsight in Windows Azure R 1.00
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
4
HDInsight Versions on Azure
Component V 1.6 V 2.1 V3 (Preview)
Apache Hadoop 1.0.3 01.02.2000 02.02.2000
Apache Hive 0.9.0 0.11.0 0.12.0
Apache Pig 0.9.3 0.11 0.12
Apache Sqoop 01.04.2002 01.04.2003 01.04.2004
Apache Oozie 03.02.2000 03.02.2002 4.0.0
Apache HCatalog 0.4.1 Merged with Hive Merged with Hive
Apache Templeton 0.1.4 Merged with Hive Merged with Hive
Ambari API v1.0 API v1.0
SQL Server JDBC Driver 3 No Information No Information
Source: http://www.windowsazure.com/en-us/documentation/articles/hdinsight-component-versioning/
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
5
Hadoop Zoo
2013 © Trivadis
Windows Azure Blob StorageHDFS
Hadoop Filesystem Interface
Query &
Metadata:
Data
Movement:Workflow: Monitoring:
Windows Azure HDInsight Service
04.03.2014HDInsight in Windows Azure R 1.00
6
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
Windows Azure
Blob Storage
7
2013 © Trivadis
Windows Azure HDInsight Service
04.03.2014HDInsight in Windows Azure R 1.00
8
2013 © Trivadis
Windows Azure HDInsight Service
04.03.2014HDInsight in Windows Azure R 1.00
9
Focus
2013 © Trivadis
Windows Azure Storage
Scalable, durable, and available
Anywhere at anytime access
Only pay for what the service uses
Use from Windows Azure Compute
Use from anywhere on the internet
04.03.2014HDInsight in Windows Azure R 1.00
10
2013 © Trivadis
Northern Europe
Western EuropeSouth Central US
West US East US
Datacenters and Regions
04.03.2014HDInsight in Windows Azure R 1.00
11
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
12
• Higher durability• 3 local replicas in primary location
• Local replicas – synchronously replicated
• Common failures (disk, node, rack) – use local copies to recover
• Major disasters – contact customer about potential data loss
• Reduced Price – 23-34% based on how much you store
• Turn off Geo for your storage account in portal• Non-critical data that can be recreated on major
disasters
• Application manages its own replica
• Companies have limitations on geo locations
• Highest level of durability• 3 local replicas each in primary and secondary
locations
• Local replicas – synchronously replicated
• Geo replica – asynchronously replicated
• Common failures (disk, node, rack) – use local copies to recover
• Major disasters – use geo replicated copy (400+ miles apart)
• Price remains the same as before
• Enabled by default
2013 © Trivadis
Blob Storage Concepts
• Store large amounts of unstructured text or binary data with the fastest read performance
• Highly scalable, durable, and available file system
• Blobs can be exposed publically over HTTP
• Securely lock down permissions to blobs
04.03.2014HDInsight in Windows Azure R 1.00
13
2013 © Trivadis
Azure Blob storage
Setting up the Windows Azure storage account
Azure Portal
04.03.2014HDInsight in Windows Azure R 1.00
14
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
15
Setup new Storage
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
16
Move Data to Azure Blob Storage
Azure Blob storage
Set-AzureStorageBlobContent
-File "C:...\2011\Weather2011_H1_JustData.csv"
-Container $containername
-Blob "FlightDelay/.../2011/Weather2011_H1_JustData.csv"
-context $context
Power Shell
Tool like CloudBerry
Drag&Drop
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
Windows Azure
HDInsight Service
17
2013 © Trivadis
Windows Azure HDInsight Service
04.03.2014HDInsight in Windows Azure R 1.00
18
2013 © Trivadis
Setting up the Windows Azure HDInsight cluster
Windows Azure HDInsight
Azure Blob storage
HDInsight Console
04.03.2014HDInsight in Windows Azure R 1.00
19
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
20
Setup new Cluster
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
21
Provision Cluster via PowerShell
# Create a new HDInsight cluster
$config = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes `
| Set-AzureHDInsightDefaultStorage `
-StorageAccountName "$storageAccountName_Default.blob.core.windows.net" `
-StorageAccountKey $storageAccountKey_Default `
-StorageContainerName $containerName_Default `
| Add-AzureHDInsightMetastore `
-SqlAzureServerName "$hiveSQLDatabaseServerName.database.windows.net" `
-DatabaseName $hiveSQLDatabaseName `
-Credential $hiveCreds `
-MetastoreType HiveMetastore `
| New-AzureHDInsightCluster `
-Version "3.0" `
-Name $clusterName `
-Location $location `
-Credential $clusterCreds
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
Map Reduce
22
2013 © Trivadis
Hadoop MapReduce
• Programming framework (library and runtime) for analyzing datasets stored in HDFS
• Composed of user-supplied Map and Reduce functions:• Map() - subdivide and
conquer
• Reduce() - combine and reduce cardinality
1. Divide a large problem into sub-problems.
………
2. Perform the same function on all sub-problems.
Do work()
3. Combine the output from all sub-functions.
Do work() Do work()
04.03.2014HDInsight in Windows Azure R 1.00
23
2013 © Trivadis
MapReduce
• Rapidly process vast amounts of data in parallel, on a large cluster of compute nodes
• Framework schedules and monitors tasks, and re-executes failed tasks
• Typically, both input and output are stored in file system
DataNode 1
Mapper
Data is shuffled
across the network
and sorted
Map Phase Shuffle/Sort Reduce Phase
DataNode 2
Mapper
DataNode 3
Mapper
DataNode 1
Reducer
DataNode 2
DataNode 3
Reducer
04.03.2014HDInsight in Windows Azure R 1.00
24
2013 © Trivadis
Layout Windspeed Calculation
StationID Date Windspeed
123 22.01.2012 31
124 22.01.2012 34
125 22.01.2012 22
126 22.01.2012 12
123 23.01.2012 26
124 23.01.2012 29
125 23.01.2012 46
126 23.01.2012 12
StationID Date Windspeed
123 23.01.2012 26
124 23.01.2012 29
125 23.01.2012 46
126 23.01.2012 12
Compute Node 1 Compute Node 2
StationID Date Windspeed
123 22.01.2012 31
124 22.01.2012 34
125 22.01.2012 22
126 22.01.2012 12
04.03.2014HDInsight in Windows Azure R 1.00
25
2013 © Trivadis
Layout Windspeed Calculation
StationID Date Windspeed
123 23.01.2012 26
124 23.01.2012 29
125 23.01.2012 46
126 23.01.2012 12
Data Node 1 Data Node 2
StationID Date Windspeed
123 22.01.2012 31
124 22.01.2012 34
125 22.01.2012 22
126 22.01.2012 12
Key Value
Max 34
Key Value
Max 46
Map
Key Value
Max 46
Reduce
04.03.2014HDInsight in Windows Azure R 1.00
26
2013 © Trivadis
Hadoop Streaming Process
04.03.2014HDInsight in Windows Azure R 1.00
27
2013 © Trivadis
HDInsight .NET Support for MapReduce
• “NuGet” Microsoft .NET MapReduce API for Hadoop
• Execute job through Powershell
• Collect the result on HDFS or directly into WASB storage
04.03.2014HDInsight in Windows Azure R 1.00
28
2013 © Trivadis
Creating a C# Mapper Program
• Reads in weather data through stdin
• Calculates the max windspeed
• Outputs key-value pair to stdout
HDInsight in Windows Azure R 1.0029
04.03.2014
2013 © Trivadis
Key Value
Max 46
StationID Date Windspeed
123 23.01.2012 26
124 23.01.2012 29
125 23.01.2012 46
126 23.01.2012 12Simple Console application
No special libraries needed
Creating a C# Mapper
04.03.2014HDInsight in Windows Azure R 1.00
30
2013 © Trivadis
Demo: Creating a C# Reducer Program
• Reads in key-value pairs through stdin
• Calculates the max windspeed
• Outputs the results stdout
HDInsight in Windows Azure R 1.003104.03.2014
2013 © Trivadis
Creating a C# Reducer
Key Value
Max 46
Key Value
Max 34
Key Value
Max 46Same program structure as
Mapper
04.03.2014HDInsight in Windows Azure R 1.00
32
2013 © Trivadis
Microsoft .NET SDK for Hadoop on
CodePlex
http://hadoopsdk.codeplex.com/
HDInsight Interactive JavaScript and
Hive Consoles
http://www.windowsazure.com/en-
us/manage/services/hdinsight/interactive-javascript-and-
hive-consoles/
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
Hive & HiveQL (HQL)
34
2013 © Trivadis
Hive architecture
• Built on top of Hadoop to provide data management, querying, and analysis
• Access and query data through simple SQL-like statements, called Hive queries
• In short, Hive complies, Hadoop executes
04.03.2014HDInsight in Windows Azure R 1.00
35
2013 © Trivadis
Create, load, and query Hive tables
• HiveQL includes data definition language, data import/export and data manipulation language statements
• See https://cwiki.apache.org/confluence/display/Hive/LanguageManual
04.03.2014HDInsight in Windows Azure R 1.00
36
2013 © Trivadis
Demo: Create and Load Hive Tables
Windows Azure HDInsight
Hive
Partitioned
Hive table
Bucketed
table
Hive table
Hive
table
CASE
statement
Table
partitioning
Join
Query
results
“Cluster
by” clause Query
results
PowerShell Console
04.03.2014HDInsight in Windows Azure R 1.00
37
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
38
Hive: Create Table
Define data structure on top of HDFS Files
CREATE EXTERNAL TABLE Weather_Data
(Station INT
,Date STRING
,Visibility DOUBLE
,Windspeed DOUBLE
,Latitude DOUBLE
,Longitude DOUBLE)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '$rootpart/FlightDelay/Weather/2012
External Table = Data files are not
bound to schema (Drop Table will not
delete the corresponding files)
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
39
Hive: Create Partitioned Table
CREATE EXTERNAL TABLE ExtUSWeatherTypedDataPartitioned
(Station INT
,Date DATE
,Visibility DOUBLE
,Windspeed DOUBLE
,Latitude DOUBLE
,Longitude DOUBLE)
PARTITIONED BY (year string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '$rootpart/FlightDelay/ExtUSWeatherDataPartitioned';
ALTER TABLE ExtUSWeatherTypedDataPartitioned
ADD PARTITION(year='2011')
LOCATION '$rootpart/FlightDelay/ExtUSWeatherDataPartitioned/year=2011'
Partition ‘Key’ -> Data in /year=2011/
2013 © Trivadis
A view is a purely logical object with no associated storage
As in a regular relational database
04.03.2014HDInsight in Windows Azure R 1.00
40
Create View
CREATE VIEW StrongWind AS
SELECT
Station
,Date
,Visibility
,Windspeed
,Latitude
,Longitude
WHERE Windspeed > 10
2013 © Trivadis
Using the Hive ODBC driver
• Connector to HDInsight Hive available as part of HDInsight Hadoop clusters
• Enable business intelligence, analytics, and reporting on data in Hive
04.03.2014HDInsight in Windows Azure R 1.00
41
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
42
Hive SQL Datatypes and Hive SQL Semantics
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
43
HQL Date Function
Return
Type Function Description
int year(string date)Returns the year part of a date or a timestamp string:
year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970
int month(string date)Returns the month part of a date or a timestamp string:
month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11
int day(string date) dayofmonth(date)Return the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
int hour(string date)Returns the hour of the timestamp:
hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12
int minute(string date) Returns the minute of the timestamp
int second(string date) Returns the second of the timestamp
int weekofyear(string date)Return the week number of a timestamp string:
weekofyear("1970-11-01 00:00:00") = 44, weekofyear("1970-11-01") = 44
int datediff(string enddate, string startdate)Return the number of days from startdate to enddate:
datediff('2009-03-01', '2009-02-27') = 2
string date_add(string startdate, int days) Add a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'
string date_sub(string startdate, int days) Subtract a number of days to startdate: date_sub('2008-12-31', 1) = '2008-12-30'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
2013 © Trivadis
Using the Hive ODBC driver
Microsoft Excel
PowerPivot
Hive
04.03.2014HDInsight in Windows Azure R 1.00
44
2013 © Trivadis
HDInsight PowerShell for Hivehttp://hadoopsdk.codeplex.com/wikipage?title=Job%20Sub
mission%20PowerShell%20cmdlets&referringTitle=Home
How to Connect Excel to Windows
Azure HDInsight via HiveODBC
http://www.windowsazure.com/en-
us/manage/services/hdinsight/connect-excel-with-hive-
ODBC/
2013 © Trivadis
LINQ to Hive
• Creates and compiles LINQ queries to use against Hive data
• Translates C# or F# LINQ queries into HiveQL queries and executes them on the Hadoop cluster
04.03.2014HDInsight in Windows Azure R 1.00
46
2013 © Trivadis
Working with LINQ Queries
Hive table
HDInsight
Hive
LINQ to Hive
04.03.2014HDInsight in Windows Azure R 1.00
47
2013 © Trivadis
LINQ to Hive
http://hadoopsdk.codeplex.com/
wikipage?title=LINQ%20to%20Hive&referringTitle=Home
Using the Hadoop .NET SDK with the
HDInsight Service
http://www.windowsazure.com/en-us/manage/services/
hdinsight/howto-net-libraries/
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
Sqoop
49
2013 © Trivadis
Using Sqoop to Move Data
• A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases
04.03.2014HDInsight in Windows Azure R 1.00
50
2013 © Trivadis
Using SQOOP to Copy Data
SQL Database
Windows Server HDInsight
Azure Blob storage
Hive and Sqoop
PowerShell ConsoleWindows Azure SQL Database
04.03.2014HDInsight in Windows Azure R 1.00
51
2013 © Trivadis
Apache Sqoop Reference
http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html
Hadoop on Windows Azure - Working With Data
http://www.windowsazure.com/en-
us/develop/net/tutorials/hadoop-and-data/
2013 © Trivadis
04.03.2014HDInsight in Windows Azure R 1.00
Programming
53
2013 © Trivadis
• Microsoft .NET SDK for Hadoop
– WebHDFS Client
– WebHCat
• Windows PowerShell Integration
DeveloperFriends
HDInsight in Windows Azure R 1.0054
2013 © Trivadis
Microsoft .NET SDK For Hadoop
• .NET client libraries for Hadoop
• Write MapReduce in Visual Studio using C# or F#
• Debug against local dataMicrosoft
Visual Studio
Slave Nodes
04.03.2014HDInsight in Windows Azure R 1.00
55
2013 © Trivadis
SDK components
• MapReduce library
• LINQ to Hive client library
• WebClient library – WebHDFS client library
– WebHCat client library
04.03.2014HDInsight in Windows Azure R 1.00
56
2013 © Trivadis
WebClient Libraries in .NET
• WebHDFS client library: works with files in HDFS and Windows Azure Blob storage
• WebHCat client library: manages the scheduling and execution of jobs in an HDInsight cluster
WebHDFS
• Scalable REST API
• Move files in and
out and delete
from HDFS
• Perform file and
directory functions
WebHCat
• HDInsight job
scheduling and
execution
04.03.2014HDInsight in Windows Azure R 1.00
57
2013 © Trivadis
Creating a Hive Table Using WebHDFS Client
Windows Server HDInsight
.NET Application (WebHDFS)
.NET application (WebHDFS)
to interact with
HDInsight cluster
Hive table
Load data
Copy data from
base machine
to Azure Storage
04.03.2014HDInsight in Windows Azure R 1.00
58
2013 © Trivadis
Performing a Remote Job with WebHCat
.NET application (WebHCat)
to interact with Hive tables
Query the Hive data
using .NET code
Hive table
Windows Server HDInsight
.NET Application (WebHCat)
04.03.2014HDInsight in Windows Azure R 1.00
59
2013 © Trivadis
Windows PowerShell Integration
• Manage an HDInsight cluster using a local management console
• PowerShell scripts to build projects, import data into HDFS, and run samples
• Repeatable management through scripting
04.03.2014HDInsight in Windows Azure R 1.00
60
2013 © Trivadis
Integrating PowerShell with HDInsight
Windows PowerShell
Create a Cluster
Run MapReduce Program
Delete the Customer
Windows Server HDInsight
PowerShell Integration
04.03.2014HDInsight in Windows Azure R 1.00
61
2013 © Trivadis
Microsoft .NET SDK For Hadoop
http://hadoopsdk.codeplex.com/
Managing Your HDInsight Cluster
with PowerShell
http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%
20Cmdlets%20for%20Cluster%20Management&referringTitl
e=Home
2013 © Trivadis
HDInsight in Windows Azure R 1.0063
Questions?