+ All Categories
Home > Documents > Introduction to Big Data and Hadoop on Windows Azure - Meetup

Introduction to Big Data and Hadoop on Windows Azure - Meetup

Date post: 11-Feb-2022
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
107
Wenming Ye Sr. Research Program Manager Microsoft Research Connections Twitter: @wenmingye
Transcript
Page 1: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Wenming Ye

Sr. Research Program Manager

Microsoft Research Connections

Twitter: @wenmingye

Page 2: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 3: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 4: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 5: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 6: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 7: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 8: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 9: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 10: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 11: Introduction to Big Data and Hadoop on Windows Azure - Meetup

http://www.windowsazure.com/

en-us/develop/nodejs/how-to-

guides/command-line-tools/

Page 12: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 13: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 14: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 15: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 16: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 17: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 18: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 19: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 20: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Gallery Images Available

MicrosoftWindows Server 2008 R2

SQL Server Eval 2012

Windows Server 2012

Biztalk Server 2013 Beta

Open SourceOpenSUSE 12.2

CentOS 6.3

Ubuntu 12.04/12.10

SUSE Linux Enterprise Server 11 SP2

Page 21: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 22: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 23: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 24: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 25: Introduction to Big Data and Hadoop on Windows Azure - Meetup

VM with persistent drive

Page 26: Introduction to Big Data and Hadoop on Windows Azure - Meetup

VM with persistent drive

Page 27: Introduction to Big Data and Hadoop on Windows Azure - Meetup

VM with persistent drive

Page 28: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 29: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 30: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Server Rack 1 Server Rack 2

Page 31: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 32: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 33: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 34: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 35: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 36: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 37: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 38: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 39: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 40: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 41: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Blobs, Disks, Tables and Queues

8.5 trillion stored objects

900K request/sec on average (2.3+ trillion per month)

Page 42: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 43: Introduction to Big Data and Hadoop on Windows Azure - Meetup

# Create containerfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob_service.create_container('taskcontainer')

# Uploadfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob_service.put_blob('taskcontainer', 'task1', file('task1-upload.txt').read(), 'BlockBlob')

#Downloadfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob = blob_service.get_blob('taskcontainer', 'task1')

Page 44: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 45: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Data centers

Page 46: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Account

Container Blobs

Table Entities

Queue Messages

https://<account>.blob.core.windows.net/<container>

https://<account>.table.core.windows.net/<table>

https://<account>.queue.core.windows.net/<queue>

Page 47: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 48: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 49: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 50: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Design Goals

• “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency”, ACM Symposium on Operating System Principals (SOSP), Oct. 2011

Page 51: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Storage Stamp

LB

Storage

Location

Service

Access blob storage via the URL: http://<account>.blob.core.windows.net/

Data access

Partition Layer

Front-Ends

DFS Layer

Intra-stamp replication

Storage Stamp

LB

Partition Layer

Front-Ends

DFS Layer

Intra-stamp replication

Inter-stamp (Geo) replication

Page 52: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Index

Partition Layer

Page 53: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Partition Layer

Page 54: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Partition Layer

Page 55: Introduction to Big Data and Hadoop on Windows Azure - Meetup

• Does not move data around, only reassigns what part of the index a partition server is responsible for

Partition Layer

Index

Page 56: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Partition Layer

Page 57: Introduction to Big Data and Hadoop on Windows Azure - Meetup

• “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency”, ACM Symposium on Operating System Principals (SOSP), Oct. 2011

Page 58: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 59: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 60: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 61: Introduction to Big Data and Hadoop on Windows Azure - Meetup

and Queues (NEW)

Europe

West

North

Europe

Geo-replication

South

Central

US

North

Central

US

Geo-replication

East AsiaSouth

East Asia

Geo-replication

West US East US

Geo-replication

Page 62: Introduction to Big Data and Hadoop on Windows Azure - Meetup

East USWest US

Azure

DNShttp://account.blob.core.windows.net/

DNS lookup

Data access

Hostname IP Address

account.blob.core.windows.net West US

Failover

Update DNS

East US

Geo-replication

Page 63: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 64: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 65: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Windows

Azure

Storage

Page 66: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 67: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 68: Introduction to Big Data and Hadoop on Windows Azure - Meetup

180

182

184

186

188

190

192

194

196

198

200

660000

665000

670000

675000

680000

685000

690000

695000

700000

Average of TransactionCount

Average of TPS

Page 69: Introduction to Big Data and Hadoop on Windows Azure - Meetup

0

50

100

150

200

250

300

350

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

6/2

4/2

013

6/2

4/2

013 0

:03

6/2

4/2

013 0

:06

6/2

4/2

013 0

:09

6/2

4/2

013 0

:12

6/2

4/2

013 0

:15

6/2

4/2

013 0

:18

6/2

4/2

013 0

:21

6/2

4/2

013 0

:24

6/2

4/2

013 0

:27

6/2

4/2

013 0

:30

6/2

4/2

013 0

:33

6/2

4/2

013 0

:36

6/2

4/2

013 0

:39

6/2

4/2

013 0

:42

6/2

4/2

013 0

:45

6/2

4/2

013 0

:48

6/2

4/2

013 0

:51

6/2

4/2

013 0

:54

6/2

4/2

013 0

:57

6/2

4/2

013 1

:00

Average of TransactionCount

Average of TPS

Page 70: Introduction to Big Data and Hadoop on Windows Azure - Meetup

J S O N

Page 71: Introduction to Big Data and Hadoop on Windows Azure - Meetup

http://www.nuget.org/packages/WindowsAzure.Storage

Page 72: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 73: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 74: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 75: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 76: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 77: Introduction to Big Data and Hadoop on Windows Azure - Meetup

XL VM Uploading 512, 256MB Blobs (Total upload size = 128GB)

• C=1, P=1 => Averaged ~ 13. 2 MB/s

• C=1, P=30 => Averaged ~ 50.72 MB/s

• C=30, P=1 => Averaged ~ 96.64 MB/s

• Single TCP connection is bound by TCP

• rate control & RTT

• P=30 vs. C=30: Test completed almost

• twice as fast!

• Single Blob is bound by the limits of a

• single partition

• Accessing multiple blobs concurrently

• scales

P=1,

C=1

P=30, C

=1 P=1…

0

2000

4000

6000

8000

10000

Tim

e (

s)

Page 78: Introduction to Big Data and Hadoop on Windows Azure - Meetup

• XL VM Downloading 50, 256MB Blobs (Total download size = 12.5GB)

• C=1, P=1 => Averaged ~ 96 MB/s

• C=30, P=1 => Averaged ~ 130 MB/s

0

20

40

60

80

100

120

140

C=1, P=1 C=30, P=1Tim

e (

s)

Page 79: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 80: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 81: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Internet of thingsAudio / Video

Log Files

Text/Image

Social Sentiment

Data Market Feeds

eGov Feeds

Weather

Wikis / Blogs

Click StreamSensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising CollaborationeCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

Payables

Payroll

Inventory

Contacts

Deal Tracking

Terabytes

(10E12)

Gigabytes

(10E9)

Exabytes

(10E18)

Petabytes

(10E15)

Velocity - Variety - variability

Vo

lum

e

1980

190,000$

2010

0.07$

1990

9,000$2000

15$Storage/GB

ERP / CRM WEB

2.0

Internet of things

Page 82: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Big Data, BIG OPPORTUNITY

Big Data is a top priority for institutions

49% CEOs and CIOs are planning big data projects

Software Growth

1.82.5

3.44.6

0

5

2012 2013 2014 2015

Bil

lio

ns

$ 34% compound

annual growth

rate2

Services Growth

2.73.9

5.16.5

0

5

10

2012 2013 2014 2015

Bil

lio

ns

$ 39% compound

annual growth

rate2

1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 2012

2. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012

Page 83: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 84: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 85: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 86: Introduction to Big Data and Hadoop on Windows Azure - Meetup

How do I optimize my services

based on patterns of weather,

traffic. How do I build a

recommendation engine?

What’s the social sentiment

of my product?How do I better predict

future outcomes?

Page 87: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 88: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 89: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 90: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 91: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 92: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 93: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Distributed Storage

(HDFS)

Query

(Hive)

Distributed Processing

(MapReduce)

OD

BC

Legend

Red = Core

Hadoop

Blue = Data

processing

Purple =

Microsoft

integration

points and

value adds

Orange = Data

Movement

Green =

Packages

Page 94: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 95: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Front

endFront

end

Stream

Layer

Partition

Layer

Name Node

de

Data Node Data Node

Front end

HDFS API

DFS (1 Data Node per Worker Role)

and Compute ClusterAzure Storage (ASV)

Azure Blob Storage

Page 96: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 97: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Hive, Pig, Mahout, Cascading, Scalding, Scoobi,

Pegasus…

C#, F# Map/Reduce, LINQ to Hive, .NET

management clients

JavaScript Map/Reduce, Browser hosted console,

Node.js management clients

PowerShell, Cross Platform CLI tools

Page 98: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 100: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Deploying and Interacting With HDInsight Service

demo

Page 101: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 102: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Batch Processing Interactive analysis Stream

processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and developers Developers

Originating project Google MapReduce Google Dremel Twitter Storm

Open source project Hadoop / Spark Drill / Shark /Impala

Hbase / Cassandra

Storm / Apache S4 /Kafka

Page 103: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 104: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 105: Introduction to Big Data and Hadoop on Windows Azure - Meetup
Page 106: Introduction to Big Data and Hadoop on Windows Azure - Meetup

http://www.windowsazure.com/en-us/develop/net/

http://blogs.msdn.com/b/windowsazurestorage/

http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/20/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx

Page 107: Introduction to Big Data and Hadoop on Windows Azure - Meetup

Windows Azure Python SDKWindows AzureHow to use Service Management from Pythonhttp://www.windowsazure.com/en-us/manage/linux/other-resources/command-line-tools/http://research.microsoft.com/en-us/projects/azure/


Recommended