+ All Categories
Home > Technology > Windows Azure HDInsight Service

Windows Azure HDInsight Service

Date post: 11-Nov-2014
Category:
Upload: neil-mackenzie
View: 4,078 times
Download: 1 times
Share this document with a friend
Description:
Describe the Hadoop features provided in Windows Azure HDInsight
Popular Tags:
21
NEIL MACKENZIE Windows Azure HDInsight Service Hadoop on Windows Azure
Transcript
Page 1: Windows Azure HDInsight Service

NEIL MACKENZIE

Windows Azure HDInsight Service

Hadoop on Windows Azure

Page 2: Windows Azure HDInsight Service

Who Am I?

Neil MackenzieWindows Azure Architect @ Satory Global

Windows Azure MVPBlog: http://convective.wordpress.com/Twitter: @mknz

Book:Microsoft Windows Azure Development Cookbook

Page 3: Windows Azure HDInsight Service

Goals and Agenda

Goals Introduce Windows Azure HDInsight Service to the

Windows Azure developer Introduce Windows Azure to the Hadoop user Not a tutorial on how to use Hadoop features

Agenda Big Data Windows Azure Windows Azure HDInsight Service

Page 4: Windows Azure HDInsight Service

Big Data

Problem: How do we create value from enormous amounts of

low-value data?

Solution: Analyze it using a lot of commodity hardware.

Page 5: Windows Azure HDInsight Service

Three Vs of Big Data

Volume How much data is there?

Variety What are the sources of the data?

Velocity How fast is the data being generated?

Page 6: Windows Azure HDInsight Service

MapReduce

Distributed computational model for data analysis. Map function:

Processes a key-value pair to generate intermediate pairs Reduce function:

Merges all intermediate values with the same intermediate key.

Map and reduce functions allocated to many compute nodes with data stored locally.

Raw MapReduce functions are written in Java.

Page 7: Windows Azure HDInsight Service

Apache Hadoop

Modules: Hadoop Distributed File System (HDFS) MapReduce

Related projects: HBase – scalable, distributed database Hive – data warehouse infrastructure Mahout – scalable machine learning library Pig – high-level data-flow language

Other: Sqoop –import and export to relational database

Page 8: Windows Azure HDInsight Service

Windows Azure

Compute PaaS: Cloud Services, Windows Azure Web Sites IaaS: Virtual Machines

Storage Windows Azure Storage Service: blobs, tables, queues Windows Azure SQL Database IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc.

Connectivity HTTP, TCP, UDP, Site-to-Site VPN

Administration Portal, Service Management API

Page 9: Windows Azure HDInsight Service

Windows Azure HDInsight Service

Components: HadoopCore – v1.0.1 HDFS & ASV Pig – v0.9.3 Hive – v0.8.1 Sqoop – v1.4.2 Excel/Hive

Note: this was formerly known as Hadoop on Azure.

Page 10: Windows Azure HDInsight Service

Hadoop Administration

Portal http://www.hadooponazure.com Apply to join preview Create and manage Hadoop cluster

3 nodes for 5 days Access the Interactive console

Hive Invoke Hive statements

JavaScript Invoke HDFS commands Invoke Hive & Pig statements

Page 11: Windows Azure HDInsight Service

Distributed File Systems

HDFS Contents deleted when cluster deleted

ASV Azure Storage Vault Data stored in Windows Azure Blob Storage Configured on Hadoop on Azure portal Contents survive deletion of Hadoop cluster Supports multi-level structure, e.g.:

containername/input/file1

Page 12: Windows Azure HDInsight Service

Pig

Hadoop feature to perform data-flow operations: Execution environment Language: Pig Latin

Execution Environment Local in local JVM or distributed on Hadoop cluster

Pig Latin High-level language Describes data-flow operations Automatically invokes MapReduce jobs Much simpler than using MapReduce directly

Page 13: Windows Azure HDInsight Service

Pig Example

records = LOAD 'asv://flightdata/input/flightdata.txt'AS (year:int, month:int, day:int, carrier:chararray, origin:chararray, dest:chararray, depdelay:int, arrdelay:int);

modified_records = FOREACH recordsGENERATE origin, depdelay;

STORE modified_recordsINTO 'my_output' using PigStorage(',');

Page 14: Windows Azure HDInsight Service

Hive

Hadoop feature to perform data warehouse operations

HiveQL high-level, SQL-like language Supports equi-joins Schema on read NOT schema on write Automatically invokes MapReduce jobs Much simpler than using MapReduce directly

Metadata store Contains descriptions of tables

Page 15: Windows Azure HDInsight Service

Hive Example

FROM flightdata_asv

INSERT OVERWRITE TABLE origin_countsSELECT origin, COUNT(*)GROUP BY origin

INSERT OVERWRITE TABLE dest_countsSELECT dest, COUNT(*)GROUP BY dest

Page 16: Windows Azure HDInsight Service

Sqoop

Feature allowing import and export from SQL databases Uses JDBC connector Works with Windows Azure SQL Database Table must exist before export

Page 17: Windows Azure HDInsight Service

Sqoop Example

Exporting a table:sqoop.cmd export –connect"jdbc:sqlserver://sql_database_server.database.windows.net:1433;database=sql_database_instance;user=sqoop_login@sql_database_server;password=sqoop_login_password"--table sql_database_table--export-dir "/user/hive/warehouse/hive_table"--input-fields-terminated-by "\001"

Page 18: Windows Azure HDInsight Service

Excel and Hadoop on Azure

Example of Microsoft business intelligence strategy Expose Hadoop to existing tools

HiveODBC connector for Excel Create Hive queries from Excel Invoke them from Excel

Page 19: Windows Azure HDInsight Service

More Information

Sign up for preview:http://www.hadooponazure.com

Support:http://social.msdn.microsoft.com/Forums/en-US/hdinsight

Avkash Chauhan’s blog:http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop

Roger Jennings’ blog:http://oakleafblog.blogspot.com/2012/04/using-data-in-windows-azure-blobs-with.html

Page 20: Windows Azure HDInsight Service

Summary

Hadoop: De-facto solution to the Big Data problem

Windows Azure HDInsight Service Native Hadoop implementation Managed Hadoop service for Windows Azure Currently in preview

Page 21: Windows Azure HDInsight Service

Recommended