Big Analytics Without Big Hassles 04/10/14 Webinar

Post on 18-Dec-2014

327 views 0 download

description

Data scientists just want to do fast, interactive exploratory analytics on all kinds of data—without thinking about whether data fits in-memory, about parallelism, force-fitting it into a table, or pulling it out of a file and formatting it for math packages. You’d also like to use your favorite analytical language and have it transparently scale up to Big Data volumes. Paradigm4 presents a webinar about SciDB—the open source, array database with native scalable complex analytics, programmable from R and Python. Learn how SciDB enables you to: •Explore rich data sets interactively •Do complex math in-database—without being constrained by memory limitations •Perform fast multi-dimensional windowing, filtering, and aggregation •Offload large computations to a commodity hardware cluster—on-premise or in a cloud •Use R and Python to analyze SciDB arrays as if they were R or Python objects •Share data among users, with multi-user data integrity guarantees and version control

transcript

Big Analytics without Big Hassles

Bryan Lewis Chief Data Scientist

Alex Poliakov

Solutions Architect

Paradigm4’s SciDB

MPP Database

Array data model

Complex analytics

Commodity clusters or cloud

R & Python

Big analytics without big hassles

© P

arad

igm

4 3

Using WebEx

•  Ask questions using the Q&A window

•  This webinar is being recorded

•  Replays will be available

© P

arad

igm

4 4

Agenda

1.  Brief Introduction to SciDB

2.  Demos

3.  Q & A

© P

arad

igm

4 5

Paradigm4 develops & supports SciDB

Force behind many major advances in databases

Postgres Vertica Paradigm4 Illustra VoltDB Streambase DataTamer

Mike Stonebraker CTO & Co-founder MIT Professor ISTC Big Data at MIT

© P

arad

igm

4 6

Presenters

Bryan Lewis, Chief Data Scientist Applied Math Ph.D. Founder Rocketcalc; RevolutionAnalytics CRAN contributor

Alex Poliakov, Solutions Architect Decade developing database internals (Netezza, Paradigm4) Solutions: e-commerce, pharma/biotech, insurance, satellite imagery

© P

arad

igm

4 7

Three pillars of SciDB

MPP Database

Array data model

Complex analytics

Commodity clusters or cloud

R & Python

Big analytics without big hassles

© P

arad

igm

4 8

SciDB Powers NIH NCBI’s 1000 Genomes Project

Running 24 x 7 since Fall 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

© P

arad

igm

4 9

Some commercial use cases

Pharma, Biotech, Healthcare

Join public & private data Integrate many data sources Scale up & speed up math

Quant Finance

Fast data window selection & scalable math

Image & Sensor Analytics

E-commerce

SVD on sparse matrices 50M x 50M Powering recommendation engine

Integrate diverse data with different spatial and temporal resolutions

© P

arad

igm

4 1

0

•  ACID support means multiple users can

simultaneously read / write / analyze data

•  FAST JOINs

data in files

SciDB is a Database

© P

arad

igm

4 1

1

Arrays are a natural data model

Sen

sor

/ car

/ p

ho

ne

time

longitude

Event

other dimensions ….

latitude

Exc

han

ge

Stock_ID

Time

other dimensions ….

© P

arad

igm

4 1

2

Native Array DBs vs. Relational DBs

Spatially close data in the coordinate system are stored close to each other on disk Important for ordered data and analysis

© P

arad

igm

4 1

3

Array storage supports fast multi-dimensional SELECTs

Illustration credit: Andrei Pandre

© P

arad

igm

4 1

4

SciDB does scalable complex analytics

•  No more ETL hassles to a separate math package •  Data not constrained to fit in memory

Parallel linear algebra Principal component analysis Clustering GLM Machine learning and more

© P

arad

igm

4 1

5

•  Program SciDB from R or Python •  Naturally reference & manipulate data in SciDB •  Large computations run on SciDB cluster

–  Go beyond the scalability limitations of R & Python

Analyst-Friendly Interfaces

We also support AQL and JDBC

© P

arad

igm

4 1

6

Shared-Nothing Cluster Architecture

SciDB Coordinator

SciDB …

SciDB 1

SciDB 2

R + SciDB-R

Python + SciDB-Py

JDBC

Web Browser

K-replication for redundancy Scale out horizontally

© P

arad

igm

4 1

7

SciDB Arrays

Each cell in a SciDB array consists of a fixed number of typed attributes (variables). Here is an example cell with four attributes

Price Volume Symbol usec 450.61 150 “AAPL” 36013008713

© P

arad

igm

4 1

8

SciDB Arrays D

imen

sion

i

Attributes Price Volume Symbol usec

1 450.61 150 “AAPL” 36013008713 2 450.73 200 “AAPL” 36013008915 3 450.84 10 “AAPL” 36013208113 4 36.57 75 “MSFT” 36019008713 5 36.20 100 “MSFT” 36003200113

A 1-D array looks like an R or Pandas data frame.

This picture shows five cells, each with four attributes.

© P

arad

igm

4 1

9

SciDB Arrays

The same data “redimensioned” into a 2D array

Dim

ensi

on u

sec

“AAPL” “MSFT”

Price Volume Price Volume

36003200113 36.20 100

36013008713 450.61 150

36013008915 450.73 200

36013208113 450.84 10

36019008713 36.57 75

Dimension Symbol .

© P

arad

igm

4 2

0

SciDB Array Schema

CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ];

Attributes v1, v2, v3

Dimensions I, J

Dimension size * is unbounded

Chunk size

Chunk overlap

© P

arad

igm

4 2

1

Arrays are distributed with overlap

Supports constant time moving window aggregates and feature detection …even when data cross node boundaries

0.02 0.01 0.01

0.01 0.01 0.50

0.01 0.02 0.01

0.01 0.01 0.02

0.01 0.50 0.02

0.02 0.01 0.01

0.01 0.01 0.50

0.01 0.02 0.01

0.02 0.01 0.02

0.01 0.50 0.02

0.02 0.01 0.01

0.01 0.02 0.02

© P

arad

igm

4 2

2

Live demonstrations

1)  Airline data •  Select •  Aggregate lateness •  Heatmap

2) Netflix-like data •  SVD

3) Zipcode (lat,long) and population by zipcode •  Join •  Compute distance-weighted population by zipcode •  Plot histogram

4) Satellite and point-of-interest data •  Select region •  Regrid and plot •  Overlay another dataset: shopping mall locations

© P

arad

igm

4 2

3

Demonstration Cluster

Running on modest 4 node cluster Each node has

16 cores 128 GB RAM 4 x 1TB disks Connected by 1Gbit Ethernet

Also runs on public clouds

© P

arad

igm

4 2

4

Registration Poll Results

Excel,'15%'MATLAB,'6%'

Other,'20%'

Python,'17%'

R,'42%'

What'mathemaAcal'and'staAsAcal'compuAng'soGware'do'you'use?'''''

n'='340'

Please respond to live poll

© P

arad

igm

4 2

5

Try It

Quick Start •  scidb.org/forum •  Download a VM or EC2 AMI

Community Edition Enterprise Edition Open Source; Active forum Commercial license Unrestricted & fully scalable Unrestricted & fully scalable

More math functions Intel MKL support Failover & fault tolerance System management tools

Take Away: Less coding, more analysis

ACID database Array data model In-database complex math Automatic scale-out & speed-up Programmable from R and Python

www.paradigm4.com

© Paradigm4 Inc. 27

Questions?

Tell us about your application •  info@paradigm4.com

Try our Quick Start •  scidb.org/forum •  Download a VM or EC2 AMI

www.paradigm4.com

Thanks for your interest!