Where is my data (in the cloud) tamir dresher

Tamir Dresher

Senior Software ArchitectMay 19, 2014

Where is my Data? (In the Cloud)

About Me

• Software architect, consultant and instructor

• Software Engineering Lecturer @ Ruppin Academic Center

• Technology addict

• 10 years of experience

• .NET and Native Windows Programming

@[email protected]://www.TamirDresher.com.

http://www.twitter.com/tamir_dresher

mailto:[email protected]

http://www.tamirdresher.com/

Agenda

• Storage

• Blob

• Azure SQL Server

• Azure Tables

• HDInsight

Agenda

• Storage

• Blob


• Azure Tables

• HDInsight

Storage

Where is my data Storage

Storage Prices

6

Types of information

Where is my data Storage

North America Europe Asia Pacific

Data centers

Windows Azure Growing Global Presence

Storage SLA – 99.99%52.56 minutes per year

http://azure.microsoft.com/en-us/support/legal/sla

http://azure.microsoft.com/en-us/support/legal/sla

AZURE BLOBS

9

What is a BLOB

• BLOB – Binary Large OBject

• Storage for any type of entity such as binary files and text documents

• Distributed File Service (DFS)

– Scalability and High availability

• BLOB file is distributed between multiple server and replicated at least 3 times

Where is my data BLOB

Blob Storage Concepts

11


Blob Operations

REST


DEMOCreating a Blob

13

BLOBS

• Block blob - up to 200 GB in size

• Page blobs – up to 1 TB in size

• Total Account Capacity - 500 TB

• Pricing– Storage capacity used

– Replication option (LRS, GRS, RA-GRS)

– Number of requests

– Data egress

– http://azure.microsoft.com/en-us/pricing/details/storage/


http://azure.microsoft.com/en-us/pricing/details/storage/

SQL AZURE

15

SQL Azure

• SQL Server in the cloud

• No administrative overheads

• High Availability

• pay-as-you-grow pricing

• Familiar Development Model*

* Despite missing features and some limitations - http://msdn.microsoft.com/en-us/library/ff394115.aspx

Where is my data SQL Azure

http://msdn.microsoft.com/en-us/library/ff394115.aspx

DEMOCreating and Using SQL Azure

17

SQL Azure – Pricing


Case Study - https://haveibeenpwned.com/


https://haveibeenpwned.com/


• http://www.troyhunt.com/2013/12/working-with-154-million-records-on.html

• How do I make querying 154 million email addresses as fast as possible?

• if I want 100GB of SQL Server and I want to hit it 10 million times, it’ll cost me $176 a month (now its ~20$)



http://www.troyhunt.com/2013/12/working-with-154-million-records-on.html

AZURE TABLES

21

Table Storage Concepts

22

Where is my data Tables

Table Storage

• Not RDBMS – No relationships between entities

– NoSql

• Entity can have up to 255 properties - Up to 1MB per entity

• Mandatory Properties for every entity– PartitionKey & RowKey (only indexed properties)

• Uniquely identifies an entity

• Same RowKey can be used in different PartitionKey

• Defines the sort order

– Timestamp - Optimistic Concurrency


No Fixed Schema

24


Table Object Model

• ITableEntity interface –PartitionKey, RowKey, Timestamp, and Etag properties

– Implemented by TableEntity and DynamicTableEntity// This class defines one additional property of integer type,

// since it derives from TableEntity it will be automatically

// serialized and deserialized.

public class SampleEntity : TableEntity

{

public int SampleProperty { get; set; }

}


Sample – Inserting an Entity into a Table// You will need the following using statements

using Microsoft.WindowsAzure.Storage;

using Microsoft.WindowsAzure.Storage.Table;

// Create the table client.

CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

CloudTable peopleTable = tableClient.GetTableReference("people");

peopleTable.CreateIfNotExists();

// Create a new customer entity.

CustomerEntity customer1 = new CustomerEntity("Harp", "Walter");

customer1.Email = "[email protected]";

customer1.PhoneNumber = "425-555-0101";

// Create an operation to add the new customer to the people table.

TableOperation insertCustomer1 = TableOperation.Insert(customer1);

// Submit the operation to the table service.

peopleTable.Execute(insertCustomer1);


Retrieve

// Create the table client.

CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

CloudTable peopleTable = tableClient.GetTableReference("people");

// Retrieve the entity with partition key of "Smith" and row key of "Jeff"

TableOperation retrieveJeffSmith =

TableOperation.Retrieve<CustomerEntity>("Smith", "Jeff");

// Retrieve entity

CustomerEntity specificEntity =

(CustomerEntity)peopleTable.Execute(retrieveJeffSmith).Result;


Table Storage – Important Points

• Azure Tables can store TBs of data

• Tables Operations are fast

• Tables are distributed –PartitionKey defines the partition

– A table might be stored in different partitions on different storage devices.


Pricing






• How do I make querying 154 million email addresses as fast as possible?

• [email protected] – the domain is the partition key and the alias is the row key

• if I want 100GB of storage and I want to hit it 10 million times, it’ll cost me $8 a month

• SQL Server will cost $176 a month - 22 times more expensive



HDINSIGHT

32

Hadoop in the cloud

• Hadoop on Azure Cloud

• Some Facts:

– Bing ingests > 7 petabytes a month

– The Twitter community generates over 1 terabyte of tweets every day

– Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes

Where is my data HDInsight

Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

MapReduce – The BigData Power

• Map – takes input and output key;value pairs

(Key1,Value1)(Key2,Value2)::(Keyn,Valuen)


MapReduce – The BigData Power

• Reduce – take group of values per key and produce new group of values

Key1:[value1-1,Value1-2…]

Key2:[value2-1,Value2-2…]

Keyn:[valueN-1,ValueN-2…]

[new_value1-1,new_value1-2…]

[new_value2-1,new_value2-2…]

[new_valueN-1,new_valueN-2…]

: :


MapReduce - How Does It Work?Where is my data HDInsight

So How Does It Work?Where is my data HDInsight

Finding common friends

• Facebook shows you how many common friends you have with someone

• There were 1,310,000,000 active users in facebookwith 130 friends on average (01.01.2014)

• Calculating the mutual friends


Finding common friends

• We can represent Friend Relationship as:

• Note that a Friend relationship is Symmetrical

– if A is a friend of B then B is a friend of A


Someone [List of his\her friends]

Common Friends

Example of Friends file

• U1 -> U2 U3 U4

• U2 -> U1 U3 U4 U5

• U3 -> U1 U2 U4 U5

• U4 -> U1 U2 U3 U5

• U5 -> U2 U3 U4

Where is my data HDInsight Common Friends

Designing our MapReduce job

• Each line from the file will input line to the Mapper

• The Mapper will output key-value pairs

• Key: (user, friend)

– Sorted, friend might be before user

• value: list of friends


Designing our MapReduce job - Mapper

• Each line from the file will input line to the Mapper

• The Mapper will output key-value pairs

• Key: (user, friend)

– Sorted, friend might be before user

• value: list of friends

• Having the key sorted will help us with the reducer, same pairs will be provided together


Mapper Example


Mapper Output:Given the Line:

(U1 U2) U2 U3 U4(U1 U3) U2 U3 U4(U1 U4) U2 U3 U4

U1U2 U3 U4

Mapper Example




U1U2 U3 U4

(U1 U2) -> U1 U3 U4 U5(U2 U3) -> U1 U3 U4 U5(U2 U4) -> U1 U3 U4 U5(U2 U5) -> U1 U3 U4 U5

U2 U1 U3 U4 U5

Mapper Example – final result




U1U2 U3 U4


U2 U1 U3 U4 U5


U3 -> U1 U2 U4 U5



U4 -> U1 U2 U3 U5

(U2 U5) -> U2 U3 U4(U3 U5) -> U2 U3 U4(U4 U5) -> U2 U3 U4

U5 -> U2 U3 U4

Designing our MapReduce job - Reducer

• The input for the reducer will be structured as:

(friend1, friend2) (friend1 friends) (friend2 friends)

• The reducer will find the intersection between the lists

• Output:

(friend1, friend2) (intersection of friend1 and friend2 friends)


Reducer Example


Reducer Output:Given the Line:

(U1 U2) -> (U3 U4)(U1 U2) -> (U1 U3 U4 U5) (U2 U3 U4)(U1 U3) -> (U2 U4)(U1 U3) -> (U1 U2 U4 U5) (U2 U3 U4)(U1 U4) -> (U2 U3)(U1 U4) -> (U1 U2 U3 U5) (U2 U3 U4)(U2 U3) -> (U1 U4 U5)(U2 U3) -> (U1 U2 U4 U5) (U1 U3 U4 U5)(U2 U4) -> (U1 U3 U5)(U2 U4) -> (U1 U2 U3 U5) (U1 U3 U4 U5)(U2 U5) -> (U3 U4)(U2 U5) -> (U1 U3 U4 U5) (U2 U3 U4)(U3 U4) -> (U1 U2 U5)(U3 U4) -> (U1 U2 U3 U5) (U1 U2 U4 U5)(U3 U5) -> (U2 U4)(U3 U5) -> (U1 U2 U4 U5) (U2 U3 U4)(U4 U5) -> (U2 U3)(U4 U5) -> (U1 U2 U3 U5) (U2 U3 U4)

Creating c# MapReduce


Creating c# MapReduce - Mapper


public class CommonFriendsMapper:MapperBase{

public override void Map(string inputLine, MapperContext context){

var strings = inputLine.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries);if (strings.Any()){

var currentUser = strings[0];var friends = strings.Skip(1);foreach (var friend in friends){

var keyArr = new[] {currentUser, friend};Array.Sort(keyArr);var key = String.Join(" ", keyArr);context.EmitKeyValue(key, string.Join(" ",friends));

}}

}}

Creating c# MapReduce - Reduce


public class CommonFriendsReducer:ReducerCombinerBase{

public override void Reduce(string key,IEnumerable<string> strings,ReducerCombinerContext context)

{var friendsLists = strings

.Select(friendList => friendList.Split(' '))

.ToList();var intersection = friendsLists[0].Intersect(friendsLists[1]);

context.EmitKeyValue(key, string.Join(" ", intersection));}

}

Creating c# MapReduce – Hadoop Job


HadoopJobConfiguration myConfig = new HadoopJobConfiguration();myConfig.InputPath = "wasb:///example/data/friends/friends";myConfig.OutputFolder = "wasb:////example/data/friends/output";

Environment.SetEnvironmentVariable("HADOOP_HOME", @"c:\hadoop");Environment.SetEnvironmentVariable("Java_HOME", @"c:\hadoop\jvm");

var hadoop = Hadoop.Connect(clusterUri,clusterUserName,hadoopUserName,clusterPassword,azureStorageAccount,azureStorageKey,azureStorageContainer,createContinerIfNotExist);

var jobResult = hadoop.MapReduceJob.Execute<CommonFriendsMapper, CommonFriendsReducer>(myConfig);

int exitCode = jobResult.Info.ExitCode; // (0 – success, otherwise – failure)

Pricing


10 node cluster that will exist for 24 hours:• Secure Gateway Node - free.• head node - 15.36 USD per 24-hour day• 1 data node - 7.68 USD per 24-hour day• 10 data nodes - 76.80 USD per 24-hour day• Total: $92.16 USD

WRAP UP

53

Comparing the alternatives

Storage Type When Should you Use Implications

BLOB Unstructured dataFiles

- Application Logic Responsibility- Consider using HDInsight(Hadoop)

SQL Server Structured Relational DataACID transactionsMax 150GB (500GB in preview)

- SQL DML+DDL- Could affect scalability- BI Abilities- Reporting

Azure Tables Structured DataLoose SchemaGeo Replication (High DR)Auto Sharding

- OData, REST- Application Logic- Responsibility(Multiple Schemas)

Where is my data Wrap Up

What have we seen

• Azure Blobs

• Azure Tables


• HDinsight


What’s Next

• NoSql – MongoDB, Cassandra, CouchDB, RavenDB

• Hadoop ecosystem – Hive, Pig, SQOOP, Mahout

• http://blogs.msdn.com/b/windowsazure/

• http://blogs.msdn.com/b/windowsazurestorage/

• http://blogs.msdn.com/b/bigdatasupport/


http://blogs.msdn.com/b/windowsazure/

http://blogs.msdn.com/b/windowsazurestorage/

http://blogs.msdn.com/b/bigdatasupport/

Presenter contact detailsc: +972-52-4772946t: @tamir_dreshere: [email protected]: TamirDresher.comw: www.codevalue.net

http://www.twitter.com/tamir_dresher

mailto:[email protected]

tamirdresher.com

http://www.codevalue.net/

Date post:	12-Jul-2015
Category:	Software
Upload:	tamir-dresher
View:	86 times
Download:	0 times