Galaxy of bits

Galaxy of bitsSurviving the flood of information

Michał Żyliński, Microsoft ([email protected])

2Sources: The Economist, Feb ‘10; IDC

By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days - more than Sloan

acquired in 10 years

In 2000 the Sloan Digital Sky Survey collected more data in its 1st week than was collected in the entire history of

Astronomy

The Large Hadron Collider at CERN generates 40 terabytes of data every second

3Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes

The Twitter community generates over 1 terabyte of tweets every day

Bing ingests > 7 petabyte a month

1,800,000,000,000,000,000,000 bytes

1,8 ZBThe size of Digital Universe in 2011

2010 2011 2012 20150123456789

Within 24 months # of intelligent devices > traditional IT devices

Sources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast

In 2015 nearly

20% of the information will be touched by cloud

But...

How real is it?

Modeling True RiskThreat AnalysisFraud DetectionTrade SurveillanceCredit scoring and analysis

Financial Services

Retail

Point of Sales Transaction AnalysisCustomer Churn AnalysisSentiment Analysis

Recommendation EnginesAd TargetingSearch QualityAbuse and click fraud detection

E-Commerce

TelecommunicationsCustomer Churn PreventionNetwork Performance optimizationCall Detail Record (CDR) AnalysisAnalyzing Network to Predict Failure

A day in life of typical e-commerce site

New exploratory e-commerce data flow

So how does it work?FIRST, STORE THE DATA

RUNTIME

So how does it work?SECOND, TAKE THE PROCESSING TO THE DATA

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

CodeCodeCodeCode

Hadoop in detailAnalysis of semi and unstructured data distributed across a commodity cluster

Based on Google’s MapReduce paper and Google File system (GFS)

Programs = Sequence of “map” and “reduce” tasks.

Simplify writing distributed applications

Highly fault tolerant – multiple copies

Move computation close to data

Implemented in Java and optimized for Linux

HDFSDEMO BREAK

VS

Traditional RDBMS MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Hadoop Ecosystem

MapReduce (Job Scheduling/Execution System)

HDFS(Hadoop Distributed File System)

HBase (Column DB)

Pig (Data Flow)

Hive (Warehouse and Data

Access)

Oozie(Workflow)

Sqoop

Traditional BI ToolsHBase / Cassandra

(Columnar NoSQL Databases)

Avro

(Seria

lizatio

n)

Zookeep

er

(C

oord

inati

on)

Apache Mahout

Karmasphere

(Development Tool)

Hadoop = MapReduce + HDFS

Flume

Hadoop + MicrosoftOur own distribution of Hadoop

• Submit changes back to Apache Foundation

• Download for free

Optimized for Windows & Azure

• AD & Systems Center integration

• Hadoop-as-a-service-on-Azure

Focus on .NET Developers

• Integration with Visual Studio

• Support for C#

Differentiation through Enterprise Readiness

• Performance and Scale• High Availability• Ease of use

Why Hadoop as a Service?• Task based billing• Easy admin• Zero install• Support a wide variety of job types–Machine Learning (mahout), Graph

Mining (Pegasus), HIVE, Pig, Java, JS, etc.

• Greatly simplified UI

cheap fast

HADOOP ON AZUREDEMO BREAK

UNIX Pipes

cat [input_file] | [mapper] | sort | [reducer] >[output_file]

Hadoop Streaminghadoop jar lib\hadoop-streaming.jar

-input directory -output directory -mapper any script or executable -reducer any script or executable

wordcount.js

FIRST STEPS IN MAP/REDUCE

DEMO BREAK

PIGDEMO BREAK

HIVE & EXCEL INTEGRATION

DEMO BREAK

Big Data Candies

Data Market integration

Integration with third-party data and services

Mashing up of internal and public data sets

Integration with Windows Azure Marketplace through ODATA

Sharing of data and insights through Windows Azure Marketplace

Bene

fits

Key

Feat

ures

Some other fancy stuff...

Microsoft Codename "Social Analytics"

Integration of social media data with business applications

Integration with social media sites

Stronger customer relationships

Models augmented with publicly available data from social media sites

Bene

fits

Key

Feat

ures

Wrapping up...

NON RELATIONAL

100111

DATA MANAGEMENT

SHAREAND GOVERN

DISCOVERAND RECOMMEND

TRANSFORMAND CLEAN

ANALYTICS

DATA ENRICHMENT

OPERATIONAL

SELF-SERVICE MOBILE

PREDICTIVE

REAL-TIME

COLLABORATIVE

MARKETPLACE

Exte

rnal

Data

and

Serv

ices

RELATIONAL MULTIDIMENSIONAL STREAMING

Reality check A.D. 2012

Microsoft BI Tools

Hadoop Distribution

24 TB Cube

Use Case:

• Extremely large volume of unstructured web log analysis

• Ad hoc analysis of unstructured web logs to prototype patterns

• Hadoop data feeds large 24TB Cube

V3

Thank you!

[email protected]

Date post:	01-Nov-2014
Category:	Technology
Upload:	michal-zylinski
View:	906 times
Download:	1 times