Windows Azure HDInsight Service

Post on 11-Nov-2014

4,078 views 1 download

Tags:

description

Describe the Hadoop features provided in Windows Azure HDInsight

transcript

NEIL MACKENZIE

Windows Azure HDInsight Service

Hadoop on Windows Azure

Who Am I?

Neil MackenzieWindows Azure Architect @ Satory Global

Windows Azure MVPBlog: http://convective.wordpress.com/Twitter: @mknz

Book:Microsoft Windows Azure Development Cookbook

Goals and Agenda

Goals Introduce Windows Azure HDInsight Service to the

Windows Azure developer Introduce Windows Azure to the Hadoop user Not a tutorial on how to use Hadoop features

Agenda Big Data Windows Azure Windows Azure HDInsight Service

Big Data

Problem: How do we create value from enormous amounts of

low-value data?

Solution: Analyze it using a lot of commodity hardware.

Three Vs of Big Data

Volume How much data is there?

Variety What are the sources of the data?

Velocity How fast is the data being generated?

MapReduce

Distributed computational model for data analysis. Map function:

Processes a key-value pair to generate intermediate pairs Reduce function:

Merges all intermediate values with the same intermediate key.

Map and reduce functions allocated to many compute nodes with data stored locally.

Raw MapReduce functions are written in Java.

Apache Hadoop

Modules: Hadoop Distributed File System (HDFS) MapReduce

Related projects: HBase – scalable, distributed database Hive – data warehouse infrastructure Mahout – scalable machine learning library Pig – high-level data-flow language

Other: Sqoop –import and export to relational database

Windows Azure

Compute PaaS: Cloud Services, Windows Azure Web Sites IaaS: Virtual Machines

Storage Windows Azure Storage Service: blobs, tables, queues Windows Azure SQL Database IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc.

Connectivity HTTP, TCP, UDP, Site-to-Site VPN

Administration Portal, Service Management API

Windows Azure HDInsight Service

Components: HadoopCore – v1.0.1 HDFS & ASV Pig – v0.9.3 Hive – v0.8.1 Sqoop – v1.4.2 Excel/Hive

Note: this was formerly known as Hadoop on Azure.

Hadoop Administration

Portal http://www.hadooponazure.com Apply to join preview Create and manage Hadoop cluster

3 nodes for 5 days Access the Interactive console

Hive Invoke Hive statements

JavaScript Invoke HDFS commands Invoke Hive & Pig statements

Distributed File Systems

HDFS Contents deleted when cluster deleted

ASV Azure Storage Vault Data stored in Windows Azure Blob Storage Configured on Hadoop on Azure portal Contents survive deletion of Hadoop cluster Supports multi-level structure, e.g.:

containername/input/file1

Pig

Hadoop feature to perform data-flow operations: Execution environment Language: Pig Latin

Execution Environment Local in local JVM or distributed on Hadoop cluster

Pig Latin High-level language Describes data-flow operations Automatically invokes MapReduce jobs Much simpler than using MapReduce directly

Pig Example

records = LOAD 'asv://flightdata/input/flightdata.txt'AS (year:int, month:int, day:int, carrier:chararray, origin:chararray, dest:chararray, depdelay:int, arrdelay:int);

modified_records = FOREACH recordsGENERATE origin, depdelay;

STORE modified_recordsINTO 'my_output' using PigStorage(',');

Hive

Hadoop feature to perform data warehouse operations

HiveQL high-level, SQL-like language Supports equi-joins Schema on read NOT schema on write Automatically invokes MapReduce jobs Much simpler than using MapReduce directly

Metadata store Contains descriptions of tables

Hive Example

FROM flightdata_asv

INSERT OVERWRITE TABLE origin_countsSELECT origin, COUNT(*)GROUP BY origin

INSERT OVERWRITE TABLE dest_countsSELECT dest, COUNT(*)GROUP BY dest

Sqoop

Feature allowing import and export from SQL databases Uses JDBC connector Works with Windows Azure SQL Database Table must exist before export

Sqoop Example

Exporting a table:sqoop.cmd export –connect"jdbc:sqlserver://sql_database_server.database.windows.net:1433;database=sql_database_instance;user=sqoop_login@sql_database_server;password=sqoop_login_password"--table sql_database_table--export-dir "/user/hive/warehouse/hive_table"--input-fields-terminated-by "\001"

Excel and Hadoop on Azure

Example of Microsoft business intelligence strategy Expose Hadoop to existing tools

HiveODBC connector for Excel Create Hive queries from Excel Invoke them from Excel

More Information

Sign up for preview:http://www.hadooponazure.com

Support:http://social.msdn.microsoft.com/Forums/en-US/hdinsight

Avkash Chauhan’s blog:http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop

Roger Jennings’ blog:http://oakleafblog.blogspot.com/2012/04/using-data-in-windows-azure-blobs-with.html

Summary

Hadoop: De-facto solution to the Big Data problem

Windows Azure HDInsight Service Native Hadoop implementation Managed Hadoop service for Windows Azure Currently in preview