Big Analytics Without Big Hassles 04/10/14 Webinar

transcript

Big Analytics without Big Hassles

Bryan Lewis Chief Data Scientist

Alex Poliakov

Solutions Architect

Paradigm4’s SciDB

MPP Database

Array data model

Complex analytics

Commodity clusters or cloud

R & Python

Big analytics without big hassles

Using WebEx

•  Ask questions using the Q&A window

•  This webinar is being recorded

•  Replays will be available

Agenda

1.  Brief Introduction to SciDB

2.  Demos

3.  Q & A

Paradigm4 develops & supports SciDB

Force behind many major advances in databases

Postgres Vertica Paradigm4 Illustra VoltDB Streambase DataTamer

Mike Stonebraker CTO & Co-founder MIT Professor ISTC Big Data at MIT

Presenters

Bryan Lewis, Chief Data Scientist Applied Math Ph.D. Founder Rocketcalc; RevolutionAnalytics CRAN contributor

Alex Poliakov, Solutions Architect Decade developing database internals (Netezza, Paradigm4) Solutions: e-commerce, pharma/biotech, insurance, satellite imagery

Three pillars of SciDB

MPP Database

Array data model

Complex analytics

Commodity clusters or cloud

R & Python

Big analytics without big hassles

SciDB Powers NIH NCBI’s 1000 Genomes Project

Running 24 x 7 since Fall 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

Some commercial use cases

Pharma, Biotech, Healthcare

Join public & private data Integrate many data sources Scale up & speed up math

Quant Finance

Fast data window selection & scalable math

Image & Sensor Analytics

E-commerce

SVD on sparse matrices 50M x 50M Powering recommendation engine

Integrate diverse data with different spatial and temporal resolutions

•  ACID support means multiple users can

simultaneously read / write / analyze data

•  FAST JOINs

data in files

SciDB is a Database

Arrays are a natural data model

longitude

other dimensions ….

latitude

Stock_ID

other dimensions ….

Native Array DBs vs. Relational DBs

Spatially close data in the coordinate system are stored close to each other on disk Important for ordered data and analysis

Array storage supports fast multi-dimensional SELECTs

Illustration credit: Andrei Pandre

SciDB does scalable complex analytics

•  No more ETL hassles to a separate math package •  Data not constrained to fit in memory

Parallel linear algebra Principal component analysis Clustering GLM Machine learning and more

•  Program SciDB from R or Python •  Naturally reference & manipulate data in SciDB •  Large computations run on SciDB cluster

–  Go beyond the scalability limitations of R & Python

Analyst-Friendly Interfaces

We also support AQL and JDBC

Shared-Nothing Cluster Architecture

SciDB Coordinator

SciDB …

SciDB 1

SciDB 2

R + SciDB-R

Python + SciDB-Py

Web Browser

K-replication for redundancy Scale out horizontally

SciDB Arrays

Each cell in a SciDB array consists of a fixed number of typed attributes (variables). Here is an example cell with four attributes

Price Volume Symbol usec 450.61 150 “AAPL” 36013008713

SciDB Arrays D

Attributes Price Volume Symbol usec

1 450.61 150 “AAPL” 36013008713 2 450.73 200 “AAPL” 36013008915 3 450.84 10 “AAPL” 36013208113 4 36.57 75 “MSFT” 36019008713 5 36.20 100 “MSFT” 36003200113

A 1-D array looks like an R or Pandas data frame.

This picture shows five cells, each with four attributes.

SciDB Arrays

The same data “redimensioned” into a 2D array

“AAPL” “MSFT”

Price Volume Price Volume

36003200113 36.20 100

36013008713 450.61 150

36013008915 450.73 200

36013208113 450.84 10

36019008713 36.57 75

Dimension Symbol .

SciDB Array Schema

CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ];

Attributes v1, v2, v3

Dimensions I, J

Dimension size * is unbounded

Chunk size

Chunk overlap

Arrays are distributed with overlap

Supports constant time moving window aggregates and feature detection …even when data cross node boundaries

0.02 0.01 0.01

0.01 0.01 0.50

0.01 0.02 0.01

0.01 0.01 0.02

0.01 0.50 0.02

0.02 0.01 0.01

0.01 0.01 0.50

0.01 0.02 0.01

0.02 0.01 0.02

0.01 0.50 0.02

0.02 0.01 0.01

0.01 0.02 0.02

Live demonstrations

1)  Airline data •  Select •  Aggregate lateness •  Heatmap

2) Netflix-like data •  SVD

3) Zipcode (lat,long) and population by zipcode •  Join •  Compute distance-weighted population by zipcode •  Plot histogram

4) Satellite and point-of-interest data •  Select region •  Regrid and plot •  Overlay another dataset: shopping mall locations

Demonstration Cluster

Running on modest 4 node cluster Each node has

16 cores 128 GB RAM 4 x 1TB disks Connected by 1Gbit Ethernet

Also runs on public clouds

Registration Poll Results

Excel,'15%'MATLAB,'6%'

Other,'20%'

Python,'17%'

R,'42%'

What'mathemaAcal'and'staAsAcal'compuAng'soGware'do'you'use?'''''

n'='340'

Please respond to live poll

Try It

Quick Start •  scidb.org/forum •  Download a VM or EC2 AMI

Community Edition Enterprise Edition Open Source; Active forum Commercial license Unrestricted & fully scalable Unrestricted & fully scalable

More math functions Intel MKL support Failover & fault tolerance System management tools

Take Away: Less coding, more analysis

ACID database Array data model In-database complex math Automatic scale-out & speed-up Programmable from R and Python

www.paradigm4.com

Questions?

Tell us about your application •  info@paradigm4.com

Try our Quick Start •  scidb.org/forum •  Download a VM or EC2 AMI

www.paradigm4.com

Thanks for your interest!

Big Analytics Without Big Hassles 04/10/14 Webinar

Technology