+ All Categories
Home > Documents > Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data...

Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data...

Date post: 08-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
31
1 © 2014 The MathWorks, Inc. Data Analytics with MATLAB Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big” data? Any collection of data sets so large and complex that it becomes difficult to process using … traditional data processing applications.” Wikipedia
Transcript
Page 1: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

1© 2014 The MathWorks, Inc.

Data Analytics with MATLAB

Tackling the Challenges of Big Data

Francesca Perino

Application Engineering Team

How big is big? What characterises “big” data?

“Any collection of data sets so large and complex that it becomes difficult to

process using … traditional data processing applications.”

Wikipedia

Page 2: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

2

MATLAB Application Development Landscape

Prototyping Programming Deployment

Page 3: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

3

MATLAB Application Development Landscape

Prototyping Programming Deployment

Page 4: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

4© 2014 The MathWorks, Inc.

Data Analytics with MATLAB

Tackling the Challenges of Big Data

Francesca Perino

Application Engineering Team

How big is big? What characterises “big” data?

“Any collection of data sets so large and complex that it becomes difficult to

process using … traditional data processing applications.”

Wikipedia

Page 5: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

5

Data-Driven Decisions and Data-Driven Design

Measurement Devices Big Data

Compute Power

Page 6: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

6

Need is Across Many Application Areas

Signal

Processing

Image

Processing

Model-Based

Design

System

Design

Advanced driver

assistance

system

Hybrid and

electric

vehicles

Sound quality

analysis

Engine

Calibration

Data

Analysis

Portfolio risk

optimization

Page 7: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

7

Source: Information Warfare

Edward Waltz – 1998

Physical

Sensors

Data

Information

Knowledge

Action

Data Analytics in MATLABMoving up the Information Hierarchy

Page 8: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

8

Data Analytics in MATLABMoving up the Information Hierarchy

Data acquisition

Instruments

Imaging devices

Flat files, Excel, Web

Databases

Data warehouses

HDFS (Hadoop)

Physical

Sensors

Data

Information

Knowledge

Action

• Sensing

• Collecting

• Measurement

• Data Acquisition

OBSERVATION

Page 9: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

9

Data Analytics in MATLABMoving up the Information Hierarchy

Data Processing

Exploratory Analysis

Filtering

Physical

Sensors

Data

Information

Knowledge

Action

• Sensing

• Collecting

• Measurement

• Data Acquisition

• Preprocessing

• Calibration

• Filtering

• Data Reduction

ORGANIZATION

Page 10: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

10

Data Analytics in MATLABMoving up the Information Hierarchy

Machine Learning Decision Tree

Ensemble

Method

Neural Network

Support Vector

Machine

Classification

Linear

Non-linear

Non-parametric

Regression

Prediction0 20 40 60 80 100 120 140 160 180 200

0.5

0.6

0.7

0.8

0.9

1

time secs

active p

ow

er

per-

unit

NN

measured

0 5 10 15 20 25 30 35 40 450

0.2

0.4

0.6

0.8

1

1.2x 10

-4

turbine number

MS

E

Physical

Sensors

Data

Information

Knowledge

Action

• Sensing

• Collecting

• Measurement

• Data Acquisition

• Preprocessing

• Calibration

• Filtering

• Data Reduction

• Analysis

• Visualization

• Modeling

• Prediction

UNDERSTANDING

Visualization

Page 11: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

11

Physical

Sensors

Data

Information

Knowledge

Action

Data Analytics in MATLABMoving up the Information Hierarchy

• Sensing

• Collecting

• Measurement

• Data Acquisition

• Preprocessing

• Calibration

• Filtering

• Data Reduction

• Analysis

• Visualization

• Modeling

• Prediction

• Reporting

• Apps

• Scalable Deployment

• Integration

Reports

MATLAB

Applications

Integration into

Existing Systems

Excel

Feedback for Design

and Operations

APPLICATION

Page 12: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

12

Large Data Analytics

Explore

Prototype

Data Share/Deploy

Work on the desktop

Scale capacity as needed

Scale

Page 13: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

13

Large Data Analytics on the Desktop

Explore

Prototype

Access Share/Deploy

Collections of Text Files datastore

Databases Database Toolbox

Binary Files memmapfileAccess big data from

your desktop

Scale

Page 14: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

14

Example: Airline Flight Distance

Data

– BTS/RITA Airline On-Time Statistics

– 123.5M records, 29 fields

Task

– Find the maximum distance travelled by

commercial airlines based upon flight

operations performance data

CSV Data

22 files

12GB

Page 15: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

15

Standard Workflow (up to R2014a)

files = {'1987.csv', '1988.csv', '1989.csv', '1990.csv', ...

'1991.csv', '1992.csv', '1993.csv', '1994.csv', '1995.csv', ...

'1996.csv', '1997.csv', '1998.csv', '1999.csv', '2000.csv', ...

'2001.csv', '2002.csv', '2003.csv', '2004.csv', '2005.csv', ...

'2006.csv', '2007.csv', '2008.csv'};

fmtspec = ['%*q%*q%*q%*q%*q%*q%*q%*q%*q%*q' ...

'%*q%*q%*q%*q%*q%*q%*q%*q%f%*q' ...

'%*q%*q %*q%*q%*q%*q%*q%*q%*q'];

maxdist = -Inf;

for i = 1 : numfiles

filei = fopen(files{i});

data = textscan(filei, fmtspec, ...);

fclose(filei);

maxi = max(data{:});

maxdist = max(maxdist,maxi);

end

Location

Compute

Combine

Format

Read data

Page 16: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

16

New Workflow with datastore (in R2014b)

airdata = datastore('*.csv');

airdata.SelectedVariableNames = {'Distance'};

airdata.SelectedFormats = {'%f'};

maxdist = -Inf;

while hasdata(airdata)

data = read(airdata);

maxi = max(data.Distance);

maxdist = max(maxdist, maxi);

end

Compute

Combine

Format

Location

Read data

Page 17: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

17

datastore

Easily specify data set

– Single text file (or collection of text files)

– Database (using Database Toolbox)

Preview data structure and format

Select data to import

using column names

Incrementally read

subsets of the data

airdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

Page 18: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

18

Large Data Analytics on the Desktop

Expand workspace 64 bit processor support – increased in-memory data set handling

Access portions of data too big to fit into memory Memory mapped variables – huge binary file

Datastore – huge text file or collections of text files

Database – query portion of a big database table

Variety of programming constructs System Objects – analyze streaming data

MapReduce – process text files that won’t fit into memory

Increase analysis speed Parallel for loops – use with multicore/multi-process machines

GPU Arrays

Page 19: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

19

Scaled Large Data Analytics

Explore

Prototype

Access Share/Deploy

ComplexityEmbarrassingly

Parallel

Non-

Partitionable

Distributed Memory

SPMD

Load,

Analyze,

Discard

datastore, parfor

Scale

out-of-memory

in-memory

Page 20: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

20

Example: Airline Delay Analysis

Data

– BTS/RITA Airline On-Time Statistics

– 123.5M records, 29 fields

Tasks

– Calculate delay patterns

– Visualize summaries

– Estimate & evaluate predictive models

Resources

– Amazon S3 data store

– Amazon EC2 cluster

Page 21: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

21

Airline Delay Analysis: Framework

Cluster/Grid/Cloud environment

1987 1988 1989 1990 1991 1992

Instr

uctio

ns

Reduced D

ata

Client

Page 22: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

22

Scaling Big Data Capacity

MATLAB supports a number of programming

constructs for use with clusters

General compute clusters Parallel for loops – embarrassingly parallel algorithms

SPMD – distributed processing

Hadoop clusters MapReduce – analyze data stored in the Hadoop Distributed File System

Page 23: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

23

Scaled Large Data Analytics

Explore

Prototype

Access Share/Deploy

ComplexityEmbarrassingly

Parallel

Non-

Partitionable

MapReduce

Distributed Memory

SPMD

Load,

Analyze,

Discard

datastore, parfor

Scale

out-of-memory

in-memory

Page 24: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

24

1503 UA LAX -5 -10 2356

540 PS BUR 13 5 186

1920 DL BOS 10 32 1876

1840 DL SFO 0 13 568

272 US BWI 4 -2 359

784 PS SEA 7 3 176

796 PS LAX -2 2 237

1525 UA SFO 3 -5 1867

632 PS SJC 2 -4 245

1610 UA MIA 60 34 1365

2032 DL EWR 10 16 789

2134 DL DFW -2 6 914

1503 UA LAX -5 -10 2356

540 PS BUR 13 5 186

1920 DL BOS 10 32 1876

1840 DL SFO 0 13 568

272 US BWI 4 -2 359

784 PS SEA 7 3 176

796 PS LAX -2 2 237

1525 UA SFO 3 -5 1867

632 US SJC 2 -4 245

1610 UA MIA 60 34 1365

2032 DL EWR 10 16 789

2134 DL DFW -2 6 914

UA

PS

DL

DL

2356

186

1876

568

US

PS

PS

UA

US

UA

DL

DL

245

1365

789

914

359

176

237

1867

UA 2356

PS 186

PS 237

UA 1867

UA 1365

DL 1876

DL 914

US 359

US 245

Data Store Map Reduce

mapreduce (in R2014b)

Page 25: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

25

Workflow with mapreduce% Specify and format the data

indata = datastore('*.csv');

indata.SelectedVariables = 'Distance';

indata.SelectedFormats = '%f';

function mapfun(data,~,intermed)

% Compute and save intermediate result

maxi = max(data.Distance);

add(intermed,'maxi',maxi);

function reducefun(~,intermed,output)

maxdist = -Inf;

while hasnext(intermed)

maxi = getnext(intermed);

% Combine intermediate results

maxdist = max(maxdist,maxi);

end

add(output,'maxdist',maxdist);

outdata = mapreduce(indata,@mapfun,@reducefun)

Data Store

Map

Reduce

Page 26: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

26

mapreduce

Use the powerful MapReduce programming

technique to analyze big data

– Multiple items (keys) to organize and process

– Intermediate results do not fit in memory

On the desktop

– Analyze big database tables (Database Toolbox)

– Increase compute capacity (Parallel Computing Toolbox)

– Access data on HDFS to develop algorithms for use on Hadoop

With Hadoop

– Run on Hadoop using MATLAB Distributed Computing Server

– Deploy applications and libraries for Hadoop using MATLAB Compiler

********************************

* MAPREDUCE PROGRESS *

********************************

Map 0% Reduce 0%

Map 20% Reduce 0%

Map 40% Reduce 0%

Map 60% Reduce 0%

Map 80% Reduce 0%

Map 100% Reduce 25%

Map 100% Reduce 50%

Map 100% Reduce 75%

Map 100% Reduce 100%

Page 27: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

27

Data Analytics Landscape

easily

partitioned;

independent

tasks

iterative

all data needed in

memory at once

SMALL Increasing Data Size

SIMPLE

COMPLEX

Algorithm

complexity

More programming

effort required

Built-in

numerical & statistical

algorithms

spmddistributed

arrays

gpuarray

parfor

vectorisationmapreduce

Page 28: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

28

Strengths of MATLAB for Large Data Analytics

Challenge MATLAB Solution

Getting started Easy access to data from your desktopTools for accessing typical big data sets consisting of text or binary

files, contained in database tables or stored on Hadoop

Rapid data exploration All the tools to explore and visualize data• Easy to try different methods

• Ideal environment for developing your own methods

Development of scalable

algorithms

Work on the desktop and scale to clustersTools for use in analyzing big data on your desktop, which scale for use

on clusters, including Hadoop, if needed

Use within business

systems

Ease of deployment and leveraging enterprisePush-button deployment into production including support for Hadoop

Page 29: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

29

Strengths of MATLAB for Large Data Analytics

Challenge MATLAB Solution

Getting started Easy access to data from your desktop• Tools for accessing typical data sets consisting of text or binary

files, Excel files, contained in database tables.

• Data import from instruments

Rapid data exploration All the tools to explore and visualize data• Easy to try different methods

• Ideal environment for developing your own methods

Development of scalable

algorithms

Work on the desktop and scale to clustersTools for use in analyzing big data on your desktop, which scale for use

on clusters, including cloud, if needed

Use within business

systems

Ease of deployment and leveraging enterprisePush-button deployment into production framework

Page 30: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

30

MATLAB Application Development Landscape

Prototyping Programming Deployment

Page 31: Data Analytics with MATLAB Tackling the Challenges of Big …Tackling the Challenges of Big Data Francesca Perino Application Engineering Team How big is big? What characterises “big”

31

MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See

www.mathworks.com/trademarks for a list of additional trademarks. Other

product or brand names may be trademarks or registered trademarks of their

respective holders. © 2014 The MathWorks, Inc.


Recommended