Shawndra Hill Upenn Jasonalb Big Data WK11

Post on 13-Apr-2015

79 views 0 download

description

WK11

transcript

3/24/2013

1

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu

Big Data

Jason Albert

University of Pennsylvania

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu

Big Data

PERSPECTIVES

3/24/2013

2

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 3

What is Big Data?

high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner)

1 Terabyte = 1024 Gigabytes

1 Petabyte = 1024 Terabytes

1 Exabyte = 1024 Petabytes

1 Zettabyte = 1024 Petabytes

1 ZB (1,099,511,627,776GB) * 7.9 = 8,686,141,859,430GB

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 4

How do we handle Big Data?

“MAD” Information Management is the approach:

Must be Magnetic, attracting all data sources

Must be Agile for easy accommodation of data at a rapid pace

Must provide sophisticated statistical methods for its Deep data repository

Why is MAD a departure from traditional Data Warehouse?

3/24/2013

3

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 5

What is the Scope of the Solution?

An End to End Solution must be Considered:

Consume: Volume, Velocity, Variety

Store: Gigabytes, Terabytes, Petabytes

Process: Cluster, Classify, Predict

Present: Visualize, Interact, Evaluate

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 6

Perspectives on Big Data

Does it handle Big Data? Volume Velocity Variety

Is it considered MAD? Magnetic Agile Deep

Is it an End-to-End Solution? Consume Store Process Present

3/24/2013

4

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 7

Options to Consider

Two promising options with low market penetration (Gartner)

MapReduce and alternatives

In-memory Computing

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu

Big Data

MAP REDUCE

3/24/2013

5

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 9

Hadoop = MapReduce + HDFS

Open Source, Batch Oriented, Data Intensive general purpose framework for creating distributed applications that process big data i.e. Volume, Velocity, Variety

Hadoop Distrbuted File System (HDFS)

Data distributed and replicated over multiple systems

Block oriented

MapReduce

Map function processes intermediate key/value pairs

Reduce function merges intermediate values

Facilitates parallel processing of multi-terabytes of data on large clusters of commodity platforms

Scale Out • Fully depreciated • Repurposed • Low Cost

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 10

MapReduce Workflow

1. Input data is distributed 2. Map Tasks work on a split of data

Map(key, value) for each word x in value: output.collect(x,1)

3. Mapper outputs intermediate data 4. Data is exchanged between nodes 5. Intermediate data of same key goes

to same reducer Reduce(keyword, (listOfValues)) for each x in (listOfValues):

sum += x; output.collect(keyword, sum);

6. Reducer output is stored

$ hadoop jar wordcount.jar WordCount /usr/input /usr/output

1. Jack be nimble, Jack be quick, Jack jump over the candlestick.

2. (0, "Jack be nimble,") (15, “Jack be quick") (28, " Jack jump over the candlestick")

3. (“Jack”, 1), (“be”,1), (“nimble,”,1), (“Jack”,1), (“be”, 1),(“quick,”, 1), (“Jack”,1), (“jump”, 1),(“over”, 1),(“the”, 1), (“candlestick.”, 1)

4. …

5. (“Jack”, (1,1,1)), (“be”, (1,1)), (“nimble,”,(1)), (“quick”, (1)), (“jump”, (1)),(“over”, (1)), (“the”, (1)), (“candlestick.”, (1))

6. (“Jack”, 3), (“be”, 2), (“nimble,”,1), (“quick”, 1), (“jump”, 1), (“over”, 1), (“the”, 1), (“candlestick.”, 1)

3/24/2013

6

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 11

Scale-Out: MapReduce + HDFS

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 12

Case Study: Recommendations

1) 9 TB of W3C Extended Log File Format data

2) MapReduce program: sessionExtractor

Session Person Person

SDF92MGSLOK4M23K B041Q3EV N23KFMWE

ASD90K23MOLFWQIE EM9IU67Y

Example: LinkedIn “People you may know” Application • Behavior Analytics • Risk & Fraud Analysis • Social Network "Connectedness“

• Text Analysis • Regressions (Financial)

3/24/2013

7

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 13

Supplemental Case Study

Product Sentiment Analysis over Time 1 Month of Twitter Feeds and Opinion Boards onto HDFS

Process using Word Count example of Positive and Negative words associated with a Product over time

This type of analysis is being done with some success

http://techcrunch.com/2012/05/18/study-twitter-sentiment-mirrored-facebooks-stock-price-today/ http://www.cs.ucr.edu/~vagelis/publications/wsdm2012-microblog-financial.pdf

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 14

MapReduce is Different

MapReduce handles processing differently: Distributed Programming Fault Tolerant MapReduce handles modeling differently: Schema-less Orientated toward exploration and discovery MapReduce handles data differently: Mostly unstructured data objects Vast number of attributes and data sources Data sources added and/or updated frequently Quality is unknown

External References http://developer.yahoo.com/hadoop/ http://code.google.com/edu/parallel/mapreduce-tutorial.html

3/24/2013

8

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 15

MapReduce…

…handle Big Data? …considered MAD?

Magnetism

Agile

MapReduce requires algorithm development

Deep

…and End to End Solution?

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu

Big Data

IN-MEMORY COMPUTING

3/24/2013

9

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 17

In-Memory Computing

Overview All relevant structured data in-memory

Cache aware memory organization (current bottleneck between CPU and main memory)

Data partitioning for parallel execution

Computation

Computation

Application Stack

Database Stack

Current Methodology

Future Methodology

Optimized for disk access on platforms with limited main memory and slow disk I/O.

Leveraging current innovations in Hardware & Software to move computations into the Database

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 18

In-Memory Workflow

In-memory computing applies a combination of:

Optimization: for Query Pruning and Data Distribution

Execution: SQL statement plan for computational parallelization

Stores: Column store with partitioning/compression (5-30x ratio)

Persistence: Temporal Tables and MVCC

http://ark.intel.com/

IBM x3850 x5 QPI Scaling Or Max5 Tray

2,3,4TB RAM 2-4 CPU @ 10 Cores/each > 4 TB @ 8x HDD

3/24/2013

10

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 19

Scale-Out Strategy for In-Memory

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 20

Capturing and Presenting

Data Provisioning IM-DBMS does not currently accommodate transaction workloads

Trigger Replication new transactions to replicate to an in-memory DB facilitating real time operational analysis, planning, and simulation.

Extraction using ETL (Extract, Transform, Load) tools with a large variety of external and internal source system support handles other data sources in near real-time but require job scheduling

e.g. SAP HANA

3/24/2013

11

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 21

Case Study: Sales Analysis

1) Load 1.1 Billion PoS in < 1 sec 3) Drill Down Into Category < 1 sec

4) Plan/Actuals as Schema & Visualize

2) Identify Top Selling Categories

Link to Video: PoS from HANA using Business Objects Explorer

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 22

Examples of Performance Gains

Report on Product Dimensions 120 million line items Standard ERP solution: several minutes on pre-aggregated

dataset; more for drilldown In-Memory: less than 1 second on line item level data;

minute delay for drilldown

Genome Analysis: Optimized Data Warehouse: Sequence Alignment 81 minutes

+ Variant Calling: 65 minutes In-Memory: Sequence Alignment 15 minutes + Variant Calling

19.5 minutes (6.5 min estimated) Approximately 2hr savings

3/24/2013

12

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 23

In-Memory Computing…

…handle Big Data?

…considered MAD?

Magnetism

Unstructured data still requires pre-processing

Agile

Deep

Unsupervised

Supervised

…an End to End Solution?

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu

Big Data

HDFS + MAP REDUCE + IN MEMORY

3/24/2013

13

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 25

Case Study: Recommendations

1) 9 TB of W3C Extended Log File Format data

2) MapReduce program: sessionExtractor

Session Product Product

SDF92MGSLOK4M23K B041Q3EV N23KFMWE

ASD90K23MOLFWQIE EM9IU67Y

Hadoop-HANA Connector

18M Records

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 26

Scale-Out: MapReduce + HDFS

Recall this slide as the Foundation

3/24/2013

14

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 27

+ Case Study: Predictive Analysis

1) Add Connection Details to all Data Reader Component

4) K-Means Cluster of Sessions

2) Retrieves records

5) Write back to database for persistence

3) Join 1.1B PoS records to Session Data

4) Explore Outcome

6) Use to provide Recommendations for Future Website Visitors

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 28

Scale-Out Strategy for In-Memory

Recall this slide as the Foundation

3/24/2013

15

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 29

Better together

…handle Big Data?

MapReduce Enables Magnetism preprocesses unstructured data

In-Memory Enables Agility Data Provisioning

Replication

Extraction

Both MapReduce and In-Memory Enable Deep Analysis During MapReduce preprocessing

Unsupervised & Supervised for In-Memory

…an End to End Solution?

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 30

SAP HANA + Intel Distribution of Hadoop

This is New News February 27, 2013

http://www.sap.com/corporate-en/news.epx?PressID=20498

3/24/2013

16

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu 31

MAD Improvement Focus

Transformative potential in five domains U.S. Healthcare

E.U. Public Sector administration

Retail

Manufacturing

Personal Location

Most significant constraint: Shortage of talent to take advantage of the insights gained from large datasets Deep analytical talent with technical skills in statistics to provide insights

Data-savvy analysts to interpret/challenge/base decisions on results

Support personnel who develop/implement/maintain the architecture

Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute

Jason P. Albert, University of Pennsylvania jasonalb@wharton.upenn.edu

Big Data

QUESTIONS?