WPC010 Introduction to PRESENTA Azure Data Lake · Azure Data Lake Service that enable developers,...

P R E S E N TA

WPC010Introduction to Azure Data Lake

Andrea Uggetti – Microsoft - @matusa69

Francesco Diaz – Insight - @francedit

Agenda

• Data Lake concepts

• Introduction to Azure Data Lake

• DEMO

• Q/A

Data Lake concepts

www.wpc2016.it – [email protected] - +39 02 365738.11 3

Data Lake definition

“A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.”


Data Staging Areas


But..."The main challenge is not creating a data lake, but taking advantage of the opportunities it presents."



Part of Cortana Intelligence Suite

Azure Data Lake

Service that enable developers, data scientists and analysts to

store data of any size, shape and speed. Perform analysis and

processing across platforms and languages

Azure Data Lake Store

Data Lake Store

A highly scalable, distributed, parallel file system in the cloud specifically designed to work with

multiple analytic frameworks

ADL StoreHDInsight

ADL Analytics

Machine Learning

Spark

R

Devices

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwi9vND34dzLAhUD6WMKHWO5BNsQjRwIBw&url=https://azure.microsoft.com/en-us/services/data-lake-store/&bvm=bv.117868183,d.cGc&psig=AFQjCNETTWiW1LcX93YiByIxe26pUlIMpQ&ust=1459027175939918

Data Lake Store – key concept

• Store any data in its native format• Without schema and transformation

• Unstructured Semi-Structured and Structured

• Hadoop file system (HDFS) for the cloud• Built on Yarn/WebHDFS

• Expose WebHDFS via compatible Rest API

• Support for file/folder objects and operations

• Integrate with HDInsight, Hortonworks, Cloudera

• Accessible to all HDFS-compliant project (Spark, Storm, R, …)

Data Lake Store – key concept

• Enterprise Grade• High Available: automate replicate data

• No limits to Scale• Unlimited account sizes,

• From bytes to petabytes

• Optimized for analytic workload• Large throughput

• Optimized for parallel computation

• Automatically optimizes for any throughput

Data Lake Store – Manage and data security

• Auditing• Audit logs for all operations

• Stored in text format (JSON)

• Logs accessible via U-SQL scripts

• Access Control• POSIX-compliant Access Control List (ACL)

• Built-in groups (Owner, Contributor, Reader)

• Read, Write, Execute permission

• ACL on files and folders

• Integrated with Azure Active Directory

• IP Filter

• Encryption at rest• Transparent server-side encryption

• Azure-managed (Azure Key Vault)

• Customer Managed Keys

Data Lake Store – Ingress

Data can be ingested into Azure Data Lake Store from a variety of sources

ADL Store

Azure Event Hub

Apache Flume

.NET SDK

JavaScript CLI

Azure portal

Azure PowerShell

Azure Data Factory

Apache Sqoop

Built-in copy service

Azure Storage

blobs

Azure SQL DW

Azure TableTable Storage

Azure SQL DB

SQL

On-premises databases

Server logs

Custom programs

Data Lake Store – Egress

Data can be exported from Azure Data Lake Store into numerous targets/sinks

ADL Store

Apache Flume

.NET SDK

JavaScript CLI

Azure portal

Azure PowerShell

Azure Data Factory

Apache Sqoop

Azure SQL DW

Azure TableTable Storage

Azure SQL DB

SQL

On-premises databases

Server logs

Custom programs

Copying data with Data Lake

Sources Sinks

Azure SQL Database Azure SQL Data Warehouse

Azure DocumentDB Azure Data Lake Store

SQL Server File system

Oracle database MySQL database

DB2 database Teradata database

Sybase database PostgreSQL database

Azure Blob Azure Table

Azure SQL Database

Azure SQL Data Warehouse

Azure DocumentDB

Azure Data Lake Store

SQL Server

Azure Blob

Azure Table


Azure Data Factory

• Copy Activity

ADL tool

• ADLCopy

Open Source Tools

• Apache Sqoop

• Runs on HDInsight clusters

• Transfer data between relational databases and ADLS

• DistCp

• Runs on HDInsight clusters

• Transfer data between Amazon S3, Azure Blob Storage and ADLS

Different ingestion way depending by on the source of the data

• Ad Hoc Data• Azure Portal; Azure PowerShell; Azure Cross-Platform CLI; Using Data Lake Tools for Visual Studio; Azure

Data Factory, AdlCopy tool, DistCp running on HDInsight Cluster

• Streamed Data• Azure Stream Analytics; Azure HDInsight Storm; EventProcessorHost

• Relational Data

• Azure Data Factory; Apache Sqoop

• Web server log data

• Azure Cross-Platform CLI; Powershell; Data Lake Store .NET SDK; Data Factory

• On-Premises files

• Data Factory; Powershell+Azure Cross-Platform CLI; Data Lake Store .NET SDK


ADLCopy

• Download http://aka.ms/downloadadlcopy

SyntaxAdlCopy /Source <Blob or Data Lake Store source> /Dest <Data

Lake Store destination> /SourceKey <Key for Blob account>

/Account <Data Lake Analytics account> /Unit <Number of

Analytics units> /Pattern

Account: ADL Analytic account, used to copy Gb/Tb data via Analytic

Unit: Data Lake Analytics Units used by AdlCopy, mandatory if use /Account

Pattern: RegEx pattern that indicate which blobs/files to copy


http://aka.ms/downloadadlcopy

DEMO

Azure Data Lake Store• Copy file

• Security overviewSecurity overview

Data Lake Analytics:U-SQL

What is U-SQL

U-SQL is a query language specifically designed for big data type of workloads. It is the combination of the SQL-like declarative language with the extensibility and programmability provided by C#


How U-SQLhas born

• Based on Microsoft experience on SCOPE (http://www.vldb.org/pvldb/1/1454166.pdf) + (T-SQL, ANSI SQL, Hive , C#)

• The need: analyze massive data sets

The Map-Reduce model limitations

«The programmer provides a map function that performs grouping and a reduce function that performs aggregation. The underlying run-time system achieves parallelism by partitioning the data and processing different partitions concurrently using multiple machines.

However, this model has its own set of limitations. Users are forced to map their applications to the map-reduce model in order to achieve parallelism. For some applications this mapping is very unnatural. »


http://www.vldb.org/pvldb/1/1454166.pdf

How U-SQL has born

• SCOPE and COSMOS (http://www.vldb.org/pvldb/1/1454166.pdf)


http://www.vldb.org/pvldb/1/1454166.pdf

Data Lake Analytics – Why U-SQL

Integrate strengths of SQL and Big Data programming

• Declarativity, optimizable and parallelizability of SQL

• Extensibility, expressiveness and familiarity of C#

U-SQL makes it easy for

• Unstructured and Structured data processing

• Declarative SQL and customer imperative code (C#)

• Local and remote queries

Increase productivity and agility

• Using familiar tools (Visual Studio)

• Leverage your SQL/.NET skills

Data Lake Analytics – Usage scenarios

Achieve the same programming experience in batch or interactive

Prepare data for other users (LETS & Share)

As unstructured data

As structured data

Large-scale custom processing with custom code

Supplement big data with high-value data from where it lives

Schematizing unstructured data

(Load-Extract-Transform-Store) for analysis

U-SQL Queries: General pattern

ProcessRead Store

INSERT

OUTPUT

OUTPUT

RowSetSELECT…

FROM…

WHERE…

EXTRACT

EXTRACT

SELECT

SELECT

Azure

Data

Lake

Azure

SQL

DB

Azure

Storage

Blobs

Azure

Storage

Blobs

RowSet

Anatomy of a U-SQL query

REFERENCE ASSEMBLY WebLogExtASM;

@rs =EXTRACT

UserID string,Start DateTime,End DatetTime,Region string,SitesVisited string,PagesVisited string

FROM "swebhdfs://Logs/WebLogRecords.txt"USING WebLogExtractor();

@result = SELECT UserID,(End.Subtract(Start)).TotalSeconds AS DurationFROM @rs ORDER BY Duration DESC FETCH 10;

OUTPUT @result TO "swebhdfs://Logs/Results/top10.txt"USING Outputter.Tsv();

• U-SQL types are the same as

C# types

• The structure (schema) is first

imposed when the data is first

extracted/read from the file

(schema-on-read)

10 log records by Duration (End time minus Start time). Sort rows in descending order of Duration.

Input is read from this file in ADL

Custom function to read from

input file

C# Expression

Output is stored in this file in ADL

Built-in function that writes the output

in TSV format

Rowset: Conceptually is like

an intermediate table…


Logical Expression Tree

- U-SQL does not assign the rowset to a variable

- The variable is the name for the expression that is

being assigned to

- U-SQL is not executing each statement sequentially

- It composes bigger and bigger statements until there

is a big expression tree that can be optimized and

executed (lambda expression composition)

- Execution in parallel, no intermediate state

- Intermediate investigation using declarative

framework (e.g. Write to a file) or external

workflows

Numeric

byte, byte?

sbyte, sbyte?

int, int?

uint, unint?

long, long?

decimal, decimal?

short, short?

ushort, ushort?

ulong, unlong?

float, float?

double, double?

Textchar, char?

string

ComplexMAP<K>

ARRAY<K,T>

Temporal DateTime, DateTime?

Other

bool, bool?

Guid, Guid?

Byte[]

30

U-SQL data types

Note: Nullable types have to be declared with a question mark ‘?’

Category Types

31

Reading files with custom formats

Use built-in extractors to read CSV and TSV files,

or create custom extractors for different formats

Upload and Register Assembly

2

CREATE ASSEMBLY WebLogExtAsmFROM @”/WebLogExtAsm.dll" WITH PERMISSION_SET = RESTRICTED;

CREATE EXTRACTOR WebLogExtractorEXTERNAL NAME WebLogExtractor;

Implement IExtractor Interface

using Microsoft.SCOPE.Interfaces;

public WebLogExtractor:IExtractor{

public override IEnumerable<IRow> Extract(…){

…}…

}

1

REFERENCE ASSEMBLY WebLogExtAsm;

//now just use it like a built-in extractorSELECT * FROM @“swebhdfs://Logs/WebRecords.txt”USING WebLogExtractor();

Reference the Assembly and Use

3

32

WebLogExtractor: Implementation

using System.Collections.Generic;using System.IO;using System.Text.RegularExpressions;namespace Demo{

[SqlUserDefinedExtractor]public class MyExtractor : IExtractor {

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRowoutput) {

string line;Regex ItemRegex = new Regex(@"\<(.*?)\>", RegexOptions.Compiled);string[] tokens = { "UserId", "Start", "End", "Region", "SitesVisited",

"PagesVisited" };var reader = new StreamReader(input.BaseStream);while ((line = reader.ReadLine()) != null) {

int i = 0;foreach (Match ItemMatch in ItemRegex.Matches(line)){

output.set(tokens[i], ItemMatch.Groups[1]);i++;

}yield return output.AsReadOnly();

}}

}}

Custom extractor for

WebLogRecords.txt

file with this format

Tables

CREATE TABLECREATE TABLE T (col1 int, col2 string

, col3 SQL.MAP<string,string>, INDEX idx CLUSTERED (col1 ASC)

DISTRIBUTED BY HASH (driver_id) );

• Structured Data

• Built-in Data types only (no UDTs)

• Clustered Index (needs to be specified): row-oriented

• Fine-grained distribution (needs to be specified):

• HASH, DIRECT HASH, RANGE, ROUND ROBIN

CREATE TABLE AS SELECTCREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);

• Infer the schema from the query

• Still requires index and partitioning

ALTER TABLE ADD/DROP COLUMNALTER TABLE T ADD COLUMN eventName string;ALTER TABLE T DROP COLUMN col3;ALTER TABLE T ADD COLUMN result string, clientId string, payload int?;ALTER TABLE T DROP COLUMN clientId, result;

• Meta-data only operation

• Existing rows will get

• Non-nullable types: C# data type default value (e.g., int will be 0)

• Nullable types: null

34

Improve performance with TABLEs

WebLogRecords.txt

INSERT INTO LogRecordsTableSELECT UserId, Start, End , RegionFROM @rs;

Populate the table

Select only required fields

@result =

SELECT UserID, (End.Subtract(Start)).TotalSeconds ASDuration

FROM LogRecordsTable ORDER BY Duration DESC FETCH 10;

OUTPUT @result TO “adls://Logs/Results/Top10.Tsv”USING Outputters.Tsv();

Top10.TsvRun query directly

against the table

CREATE TABLE LogRecordsTable(UserId int, Start DateTime, End Datetime, Region stringINDEX idx CLUSTERED (Region ASC) PARTITIONED BY HASH (Region));

Improve performance of Query1!

Query 4

Azure Data Lake

U-SQL extensibility

Extend U-SQL with C#/.NET in Visual Studio

Built-in operators,

function, aggregates

C# expressions (in SELECT statements)

User-defined aggregates (UDAGGs)

User-defined functions (UDFs)

User-defined operators (UDOs)

Federated queries: Query data where it lives

Benefits

Avoid moving large amounts of data across the network between stores

Single view of data irrespective of physical location

Minimize data proliferation issues caused by maintaining multiple copies

Single query language for all data

Each data store maintains its own sovereignty

Design choices based on the need

U-SQL

QueryResult

Query

Azure

Storage Blobs

Azure SQL

in VMs

Azure

SQL DB

Azure Data

Lake Analytics

Easily query data in multiple Azure data stores without moving it to a single store

What can you do in the Azure Portal?

Author U-SQL scripts

Submit U-SQL jobs

Cancel running jobs

Provision users who can submit jobs

Visualize usage stats (compute hours)

Visualize job management chart

Create a new Big Data Analytics account


Code behind support

DEMO

• Trasform text files

• Load transformed text files into table

• Aggregate functions

• Produce an output table

• Job execution graph

• Job execution “progress” playback

Session Recap

• Data Lake concepts

• Introduction to Azure Data Lake

• DEMO

• Q/A

Additional resourcesBlogs and community page

http://usql.io

http://blogs.msdn.com/b/visualstudio/

http://azure.microsoft.com/en-us/blog/topics/big-data/

https://channel9.msdn.com/Search?term=U-SQL#ch9Search

http://aka.ms/usql_reference

https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/

The bible: https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide

http://aka.ms/adlfeedback

https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake

http://stackoverflow.com/questions/tagged/u-sql

http://usql.io/

http://blogs.msdn.com/b/visualstudio/

http://azure.microsoft.com/en-us/blog/topics/big-data/

https://channel9.msdn.com/Search?term=U-SQL#ch9Search

http://aka.ms/usql_reference

https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide

http://aka.ms/adlfeedback

https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake

http://stackoverflow.com/questions/tagged/u-sql

Q/A

GRAZIE!

[email protected]@matusa69

[email protected]@francedithttp://francescodiaz.azurewebsites.net


mailto:[email protected]

mailto:[email protected]

http://francescodiaz.azurewebsites.net/

Contatti OverNetEducation

OverNet [email protected]

www.overneteducation.it

Tel. 02 365738

@overnete

www.facebook.com/OverNetEducation

www.linkedin.com/company/overnet-solutionswww.wpc2016.it


Date post:	29-May-2020
Category:	Documents
Upload:	others
View:	15 times
Download:	0 times

WPC010 Introduction to PRESENTA Azure Data Lake · Azure Data Lake Service that enable developers,...

Documents