Introduction to Big Data and Hadoop on Windows Azure data and cloud at Microsoft.pdf · Data access...

Post on 25-Sep-2020

3 views 0 download

transcript

Wenming Ye

Sr. Research Program Manager

Microsoft Research Connections

Twitter: @wenmingye

http://www.windowsazure.com/

en-us/develop/nodejs/how-to-

guides/command-line-tools/

Gallery Images Available

MicrosoftWindows Server 2008 R2

SQL Server Eval 2012

Windows Server 2012

Biztalk Server 2013 Beta

Open SourceOpenSUSE 12.2

CentOS 6.3

Ubuntu 12.04/12.10

SUSE Linux Enterprise Server 11 SP2

VM with persistent drive

VM with persistent drive

VM with persistent drive

Server Rack 1 Server Rack 2

Blobs, Disks, Tables and Queues

8.5 trillion stored objects

900K request/sec on average (2.3+ trillion per month)

# Create containerfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob_service.create_container('taskcontainer')

# Uploadfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob_service.put_blob('taskcontainer', 'task1', file('task1-upload.txt').read(), 'BlockBlob')

#Downloadfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob = blob_service.get_blob('taskcontainer', 'task1')

Data centers

Account

Container Blobs

Table Entities

Queue Messages

https://<account>.blob.core.windows.net/<container>

https://<account>.table.core.windows.net/<table>

https://<account>.queue.core.windows.net/<queue>

Design Goals

• “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency”, ACM Symposium on Operating System Principals (SOSP), Oct. 2011

Storage Stamp

LB

Storage

Location

Service

Access blob storage via the URL: http://<account>.blob.core.windows.net/

Data access

Partition Layer

Front-Ends

DFS Layer

Intra-stamp replication

Storage Stamp

LB

Partition Layer

Front-Ends

DFS Layer

Intra-stamp replication

Inter-stamp (Geo) replication

Index

Partition Layer

Partition Layer

Partition Layer

• Does not move data around, only reassigns what part of the index a partition server is responsible for

Partition Layer

Index

Partition Layer

• “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency”, ACM Symposium on Operating System Principals (SOSP), Oct. 2011

and Queues (NEW)

Europe

West

North

Europe

Geo-replication

South

Central

US

North

Central

US

Geo-replication

East AsiaSouth

East Asia

Geo-replication

West US East US

Geo-replication

East USWest US

Azure

DNShttp://account.blob.core.windows.net/

DNS lookup

Data access

Hostname IP Address

account.blob.core.windows.net West US

Failover

Update DNS

East US

Geo-replication

Windows

Azure

Storage

180

182

184

186

188

190

192

194

196

198

200

660000

665000

670000

675000

680000

685000

690000

695000

700000

Average of TransactionCount

Average of TPS

0

50

100

150

200

250

300

350

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

6/2

4/2

013

6/2

4/2

013 0

:03

6/2

4/2

013 0

:06

6/2

4/2

013 0

:09

6/2

4/2

013 0

:12

6/2

4/2

013 0

:15

6/2

4/2

013 0

:18

6/2

4/2

013 0

:21

6/2

4/2

013 0

:24

6/2

4/2

013 0

:27

6/2

4/2

013 0

:30

6/2

4/2

013 0

:33

6/2

4/2

013 0

:36

6/2

4/2

013 0

:39

6/2

4/2

013 0

:42

6/2

4/2

013 0

:45

6/2

4/2

013 0

:48

6/2

4/2

013 0

:51

6/2

4/2

013 0

:54

6/2

4/2

013 0

:57

6/2

4/2

013 1

:00

Average of TransactionCount

Average of TPS

J S O N

http://www.nuget.org/packages/WindowsAzure.Storage

XL VM Uploading 512, 256MB Blobs (Total upload size = 128GB)

• C=1, P=1 => Averaged ~ 13. 2 MB/s

• C=1, P=30 => Averaged ~ 50.72 MB/s

• C=30, P=1 => Averaged ~ 96.64 MB/s

• Single TCP connection is bound by TCP

• rate control & RTT

• P=30 vs. C=30: Test completed almost

• twice as fast!

• Single Blob is bound by the limits of a

• single partition

• Accessing multiple blobs concurrently

• scales

P=1,

C=1

P=30, C

=1 P=1…

0

2000

4000

6000

8000

10000

Tim

e (

s)

• XL VM Downloading 50, 256MB Blobs (Total download size = 12.5GB)

• C=1, P=1 => Averaged ~ 96 MB/s

• C=30, P=1 => Averaged ~ 130 MB/s

0

20

40

60

80

100

120

140

C=1, P=1 C=30, P=1Tim

e (

s)

Internet of thingsAudio / Video

Log Files

Text/Image

Social Sentiment

Data Market Feeds

eGov Feeds

Weather

Wikis / Blogs

Click StreamSensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising CollaborationeCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

Payables

Payroll

Inventory

Contacts

Deal Tracking

Terabytes

(10E12)

Gigabytes

(10E9)

Exabytes

(10E18)

Petabytes

(10E15)

Velocity - Variety - variability

Vo

lum

e

1980

190,000$

2010

0.07$

1990

9,000$2000

15$Storage/GB

ERP / CRM WEB

2.0

Internet of things

Big Data, BIG OPPORTUNITY

Big Data is a top priority for institutions

49% CEOs and CIOs are planning big data projects

Software Growth

1.82.5

3.44.6

0

5

2012 2013 2014 2015

Bil

lio

ns

$ 34% compound

annual growth

rate2

Services Growth

2.73.9

5.16.5

0

5

10

2012 2013 2014 2015

Bil

lio

ns

$ 39% compound

annual growth

rate2

1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 2012

2. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012

How do I optimize my services

based on patterns of weather,

traffic. How do I build a

recommendation engine?

What’s the social sentiment

of my product?How do I better predict

future outcomes?

Distributed Storage

(HDFS)

Query

(Hive)

Distributed Processing

(MapReduce)

OD

BC

Legend

Red = Core

Hadoop

Blue = Data

processing

Purple =

Microsoft

integration

points and

value adds

Orange = Data

Movement

Green =

Packages

Front

endFront

end

Stream

Layer

Partition

Layer

Name Node

de

Data Node Data Node

Front end

HDFS API

DFS (1 Data Node per Worker Role)

and Compute ClusterAzure Storage (ASV)

Azure Blob Storage

Hive, Pig, Mahout, Cascading, Scalding, Scoobi,

Pegasus…

C#, F# Map/Reduce, LINQ to Hive, .NET

management clients

JavaScript Map/Reduce, Browser hosted console,

Node.js management clients

PowerShell, Cross Platform CLI tools

Deploying and Interacting With HDInsight Service

demo

Batch Processing Interactive analysis Stream

processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and developers Developers

Originating project Google MapReduce Google Dremel Twitter Storm

Open source project Hadoop / Spark Drill / Shark /Impala

Hbase / Cassandra

Storm / Apache S4 /Kafka

http://www.windowsazure.com/en-us/develop/net/

http://blogs.msdn.com/b/windowsazurestorage/

http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/20/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx

Windows Azure Python SDKWindows AzureHow to use Service Management from Pythonhttp://www.windowsazure.com/en-us/manage/linux/other-resources/command-line-tools/http://research.microsoft.com/en-us/projects/azure/