P R E S E N TA
WPC010Introduction to Azure Data Lake
Andrea Uggetti – Microsoft - @matusa69
Francesco Diaz – Insight - @francedit
Agenda
• Data Lake concepts
• Introduction to Azure Data Lake
• DEMO
• Q/A
Data Lake definition
“A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.”
www.wpc2016.it – [email protected] - +39 02 365738.11 4
But..."The main challenge is not creating a data lake, but taking advantage of the opportunities it presents."
www.wpc2016.it – [email protected] - +39 02 365738.11 6
Azure Data Lake
Service that enable developers, data scientists and analysts to
store data of any size, shape and speed. Perform analysis and
processing across platforms and languages
Azure Data Lake Store
Data Lake Store
A highly scalable, distributed, parallel file system in the cloud specifically designed to work with
multiple analytic frameworks
ADL StoreHDInsight
ADL Analytics
Machine Learning
Spark
R
Devices
Data Lake Store – key concept
• Store any data in its native format• Without schema and transformation
• Unstructured Semi-Structured and Structured
• Hadoop file system (HDFS) for the cloud• Built on Yarn/WebHDFS
• Expose WebHDFS via compatible Rest API
• Support for file/folder objects and operations
• Integrate with HDInsight, Hortonworks, Cloudera
• Accessible to all HDFS-compliant project (Spark, Storm, R, …)
Data Lake Store – key concept
• Enterprise Grade• High Available: automate replicate data
• No limits to Scale• Unlimited account sizes,
• From bytes to petabytes
• Optimized for analytic workload• Large throughput
• Optimized for parallel computation
• Automatically optimizes for any throughput
Data Lake Store – Manage and data security
• Auditing• Audit logs for all operations
• Stored in text format (JSON)
• Logs accessible via U-SQL scripts
• Access Control• POSIX-compliant Access Control List (ACL)
• Built-in groups (Owner, Contributor, Reader)
• Read, Write, Execute permission
• ACL on files and folders
• Integrated with Azure Active Directory
• IP Filter
• Encryption at rest• Transparent server-side encryption
• Azure-managed (Azure Key Vault)
• Customer Managed Keys
Data Lake Store – Ingress
Data can be ingested into Azure Data Lake Store from a variety of sources
ADL Store
Azure Event Hub
Apache Flume
.NET SDK
JavaScript CLI
Azure portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Built-in copy service
Azure Storage
blobs
Azure SQL DW
Azure TableTable Storage
Azure SQL DB
SQL
On-premises databases
Server logs
Custom programs
Data Lake Store – Egress
Data can be exported from Azure Data Lake Store into numerous targets/sinks
ADL Store
Apache Flume
.NET SDK
JavaScript CLI
Azure portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Azure SQL DW
Azure TableTable Storage
Azure SQL DB
SQL
On-premises databases
Server logs
Custom programs
Copying data with Data Lake
Sources Sinks
Azure SQL Database Azure SQL Data Warehouse
Azure DocumentDB Azure Data Lake Store
SQL Server File system
Oracle database MySQL database
DB2 database Teradata database
Sybase database PostgreSQL database
Azure Blob Azure Table
Azure SQL Database
Azure SQL Data Warehouse
Azure DocumentDB
Azure Data Lake Store
SQL Server
Azure Blob
Azure Table
Copying data with Data Lake
Azure Data Factory
• Copy Activity
ADL tool
• ADLCopy
Open Source Tools
• Apache Sqoop
• Runs on HDInsight clusters
• Transfer data between relational databases and ADLS
• DistCp
• Runs on HDInsight clusters
• Transfer data between Amazon S3, Azure Blob Storage and ADLS
Different ingestion way depending by on the source of the data
• Ad Hoc Data• Azure Portal; Azure PowerShell; Azure Cross-Platform CLI; Using Data Lake Tools for Visual Studio; Azure
Data Factory, AdlCopy tool, DistCp running on HDInsight Cluster
• Streamed Data• Azure Stream Analytics; Azure HDInsight Storm; EventProcessorHost
• Relational Data
• Azure Data Factory; Apache Sqoop
• Web server log data
• Azure Cross-Platform CLI; Powershell; Data Lake Store .NET SDK; Data Factory
• On-Premises files
• Data Factory; Powershell+Azure Cross-Platform CLI; Data Lake Store .NET SDK
Copying data with Data Lake
ADLCopy
• Download http://aka.ms/downloadadlcopy
SyntaxAdlCopy /Source <Blob or Data Lake Store source> /Dest <Data
Lake Store destination> /SourceKey <Key for Blob account>
/Account <Data Lake Analytics account> /Unit <Number of
Analytics units> /Pattern
Account: ADL Analytic account, used to copy Gb/Tb data via Analytic
Unit: Data Lake Analytics Units used by AdlCopy, mandatory if use /Account
Pattern: RegEx pattern that indicate which blobs/files to copy
Copying data with Data Lake
DEMO
Azure Data Lake Store• Copy file
• Security overviewSecurity overview
Data Lake Analytics:U-SQL
What is U-SQL
U-SQL is a query language specifically designed for big data type of workloads. It is the combination of the SQL-like declarative language with the extensibility and programmability provided by C#
www.wpc2016.it – [email protected] - +39 02 365738.11 22
How U-SQLhas born
• Based on Microsoft experience on SCOPE (http://www.vldb.org/pvldb/1/1454166.pdf) + (T-SQL, ANSI SQL, Hive , C#)
• The need: analyze massive data sets
The Map-Reduce model limitations
«The programmer provides a map function that performs grouping and a reduce function that performs aggregation. The underlying run-time system achieves parallelism by partitioning the data and processing different partitions concurrently using multiple machines.
However, this model has its own set of limitations. Users are forced to map their applications to the map-reduce model in order to achieve parallelism. For some applications this mapping is very unnatural. »
www.wpc2016.it – [email protected] - +39 02 365738.11 23
How U-SQL has born
• SCOPE and COSMOS (http://www.vldb.org/pvldb/1/1454166.pdf)
www.wpc2016.it – [email protected] - +39 02 365738.11 24
Data Lake Analytics – Why U-SQL
Integrate strengths of SQL and Big Data programming
• Declarativity, optimizable and parallelizability of SQL
• Extensibility, expressiveness and familiarity of C#
U-SQL makes it easy for
• Unstructured and Structured data processing
• Declarative SQL and customer imperative code (C#)
• Local and remote queries
Increase productivity and agility
• Using familiar tools (Visual Studio)
• Leverage your SQL/.NET skills
Data Lake Analytics – Usage scenarios
Achieve the same programming experience in batch or interactive
Prepare data for other users (LETS & Share)
As unstructured data
As structured data
Large-scale custom processing with custom code
Supplement big data with high-value data from where it lives
Schematizing unstructured data
(Load-Extract-Transform-Store) for analysis
U-SQL Queries: General pattern
ProcessRead Store
INSERT
OUTPUT
OUTPUT
RowSetSELECT…
FROM…
WHERE…
EXTRACT
EXTRACT
SELECT
SELECT
Azure
Data
Lake
Azure
SQL
DB
Azure
Storage
Blobs
Azure
Storage
Blobs
RowSet
Anatomy of a U-SQL query
REFERENCE ASSEMBLY WebLogExtASM;
@rs =EXTRACT
UserID string,Start DateTime,End DatetTime,Region string,SitesVisited string,PagesVisited string
FROM "swebhdfs://Logs/WebLogRecords.txt"USING WebLogExtractor();
@result = SELECT UserID,(End.Subtract(Start)).TotalSeconds AS DurationFROM @rs ORDER BY Duration DESC FETCH 10;
OUTPUT @result TO "swebhdfs://Logs/Results/top10.txt"USING Outputter.Tsv();
• U-SQL types are the same as
C# types
• The structure (schema) is first
imposed when the data is first
extracted/read from the file
(schema-on-read)
10 log records by Duration (End time minus Start time). Sort rows in descending order of Duration.
Input is read from this file in ADL
Custom function to read from
input file
C# Expression
Output is stored in this file in ADL
Built-in function that writes the output
in TSV format
Rowset: Conceptually is like
an intermediate table…
www.wpc2016.it – [email protected] - +39 02 365738.11 29
Logical Expression Tree
- U-SQL does not assign the rowset to a variable
- The variable is the name for the expression that is
being assigned to
- U-SQL is not executing each statement sequentially
- It composes bigger and bigger statements until there
is a big expression tree that can be optimized and
executed (lambda expression composition)
- Execution in parallel, no intermediate state
- Intermediate investigation using declarative
framework (e.g. Write to a file) or external
workflows
Numeric
byte, byte?
sbyte, sbyte?
int, int?
uint, unint?
long, long?
decimal, decimal?
short, short?
ushort, ushort?
ulong, unlong?
float, float?
double, double?
Textchar, char?
string
ComplexMAP<K>
ARRAY<K,T>
Temporal DateTime, DateTime?
Other
bool, bool?
Guid, Guid?
Byte[]
30
U-SQL data types
Note: Nullable types have to be declared with a question mark ‘?’
Category Types
31
Reading files with custom formats
Use built-in extractors to read CSV and TSV files,
or create custom extractors for different formats
Upload and Register Assembly
2
CREATE ASSEMBLY WebLogExtAsmFROM @”/WebLogExtAsm.dll" WITH PERMISSION_SET = RESTRICTED;
CREATE EXTRACTOR WebLogExtractorEXTERNAL NAME WebLogExtractor;
Implement IExtractor Interface
using Microsoft.SCOPE.Interfaces;
public WebLogExtractor:IExtractor{
public override IEnumerable<IRow> Extract(…){
…}…
}
1
REFERENCE ASSEMBLY WebLogExtAsm;
//now just use it like a built-in extractorSELECT * FROM @“swebhdfs://Logs/WebRecords.txt”USING WebLogExtractor();
Reference the Assembly and Use
3
32
WebLogExtractor: Implementation
using System.Collections.Generic;using System.IO;using System.Text.RegularExpressions;namespace Demo{
[SqlUserDefinedExtractor]public class MyExtractor : IExtractor {
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRowoutput) {
string line;Regex ItemRegex = new Regex(@"\<(.*?)\>", RegexOptions.Compiled);string[] tokens = { "UserId", "Start", "End", "Region", "SitesVisited",
"PagesVisited" };var reader = new StreamReader(input.BaseStream);while ((line = reader.ReadLine()) != null) {
int i = 0;foreach (Match ItemMatch in ItemRegex.Matches(line)){
output.set(tokens[i], ItemMatch.Groups[1]);i++;
}yield return output.AsReadOnly();
}}
}}
Custom extractor for
WebLogRecords.txt
file with this format
Tables
CREATE TABLECREATE TABLE T (col1 int, col2 string
, col3 SQL.MAP<string,string>, INDEX idx CLUSTERED (col1 ASC)
DISTRIBUTED BY HASH (driver_id) );
• Structured Data
• Built-in Data types only (no UDTs)
• Clustered Index (needs to be specified): row-oriented
• Fine-grained distribution (needs to be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
CREATE TABLE AS SELECTCREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and partitioning
ALTER TABLE ADD/DROP COLUMNALTER TABLE T ADD COLUMN eventName string;ALTER TABLE T DROP COLUMN col3;ALTER TABLE T ADD COLUMN result string, clientId string, payload int?;ALTER TABLE T DROP COLUMN clientId, result;
• Meta-data only operation
• Existing rows will get
• Non-nullable types: C# data type default value (e.g., int will be 0)
• Nullable types: null
34
Improve performance with TABLEs
WebLogRecords.txt
INSERT INTO LogRecordsTableSELECT UserId, Start, End , RegionFROM @rs;
Populate the table
Select only required fields
@result =
SELECT UserID, (End.Subtract(Start)).TotalSeconds ASDuration
FROM LogRecordsTable ORDER BY Duration DESC FETCH 10;
OUTPUT @result TO “adls://Logs/Results/Top10.Tsv”USING Outputters.Tsv();
Top10.TsvRun query directly
against the table
CREATE TABLE LogRecordsTable(UserId int, Start DateTime, End Datetime, Region stringINDEX idx CLUSTERED (Region ASC) PARTITIONED BY HASH (Region));
Improve performance of Query1!
Query 4
Azure Data Lake
U-SQL extensibility
Extend U-SQL with C#/.NET in Visual Studio
Built-in operators,
function, aggregates
C# expressions (in SELECT statements)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
Federated queries: Query data where it lives
Benefits
Avoid moving large amounts of data across the network between stores
Single view of data irrespective of physical location
Minimize data proliferation issues caused by maintaining multiple copies
Single query language for all data
Each data store maintains its own sovereignty
Design choices based on the need
U-SQL
QueryResult
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Easily query data in multiple Azure data stores without moving it to a single store
What can you do in the Azure Portal?
Author U-SQL scripts
Submit U-SQL jobs
Cancel running jobs
Provision users who can submit jobs
Visualize usage stats (compute hours)
Visualize job management chart
Create a new Big Data Analytics account
DEMO
• Trasform text files
• Load transformed text files into table
• Aggregate functions
• Produce an output table
• Job execution graph
• Job execution “progress” playback
Session Recap
• Data Lake concepts
• Introduction to Azure Data Lake
• DEMO
• Q/A
Additional resourcesBlogs and community page
http://usql.io
http://blogs.msdn.com/b/visualstudio/
http://azure.microsoft.com/en-us/blog/topics/big-data/
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
http://aka.ms/usql_reference
https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/
The bible: https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide
http://aka.ms/adlfeedback
https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
Q/A
GRAZIE!
[email protected]@matusa69
[email protected]@francedithttp://francescodiaz.azurewebsites.net
www.wpc2016.it – [email protected] - +39 02 365738.11 43
Contatti OverNetEducation
OverNet [email protected]
www.overneteducation.it
Tel. 02 365738
@overnete
www.facebook.com/OverNetEducation
www.linkedin.com/company/overnet-solutionswww.wpc2016.it
www.wpc2016.it – [email protected] - +39 02 365738.11 44