Analytics in the cloud

Post on 16-Apr-2017

353 views 0 download

transcript

Analytics in the CloudNatalino Busa - Head of Data Science

2 Natalino Busa - @natbusa

Distributed computing Machine Learning

Statistics Big/Fast Data Streaming Computing

Head of Applied Data Science at Teradata

On most networks:

@natbusa

3 Natalino Busa - @natbusa

Let’s define Cloud Services

4 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

5 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

IAAS: Virtual Resources

6 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

7 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes

8 Natalino Busa - @natbusa

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes

DAAAS: Data Analytics as a Service

Watson ServicesAzure ML

GoogleCloud MLBigML

Analytics in the cloud: stacking layers

9 Natalino Busa - @natbusa

Analytics in the cloud: today’s talk

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes

DAAAS: Data Analytics as a Service

10 Natalino Busa - @natbusa

“we live in an age of open source datacenters, so we can stack all these things together and we have open source from the ground to ceiling.”

Sam Ramji, CEO of Cloud Foundry

https://www.youtube.com/watch?v=7oCSFcUW-Qk

11 Natalino Busa - @natbusa

Containers vs VMs

12 Natalino Busa - @natbusa

Techs based on Containers

YARN

13 Natalino Busa - @natbusa

Containers as a Service

https://aws.amazon.com/ecs/

For example: Amazon ECS

14 Natalino Busa - @natbusa

CaaS: 6 offerings

https://www.linux.com/news/5-container-service-tools-you-should-know-about

Project Magnum

Amazon ECSDocker DataCenterGoogle

Container Engine

15 Natalino Busa - @natbusa

Most new PaaS solutions are containerized

16 Natalino Busa - @natbusa

PaaS: Big Data SQL Queries

Batch OrientedLarge Aggregations

Interactive QueriesData Exploration

Interactive QueriesMachine Learning

Streaming:Micro-batching

Interactive QueriesMachine Learning

Streaming:Event-driven

17 Natalino Busa - @natbusa

Advanced Analytics: models and algorithms

18 Natalino Busa - @natbusa

PaaS: Advanced Analytics

Graph analytics:

- Cluster items- Extract similarities- Detect patterns

19 Natalino Busa - @natbusa

PaaS: Advanced Analytics

Text analytics:

- Sentiment Analysis- Language Detection- Summarization- Entity extraction

20 Natalino Busa - @natbusa

PaaS: Advanced Analytics

Machine Learning:

- Classification- Regression- Clustering- Forecasting- Anomaly detection

21 Natalino Busa - @natbusa

PaaS: Advanced Analytics

AI and Deep Learning- Unstructured Data- Object Detection- Natural Language Processing- Video Summarization- Speech Recognition

22 Natalino Busa - @natbusa

PaaS: Advanced Analytics

SQL + Graph + Text + Machine Learning + Voice/Image/Video

23 Natalino Busa - @natbusa

dPaaS: Machine (deep) Learning

… this are just a few examples ...

24 Natalino Busa - @natbusa

Analytics Everywhere

Public Cloud Managed Cloud Private Cloud Private Infra

25 Natalino Busa - @natbusa

iPaas: Components for Analytics in the Cloud

SQL : Big Data Data Warehousing

NoSQL

Machine LearningObjects Stores

Streaming Computing

SQL: RelationalTransactional DB

26 Natalino Busa - @natbusa

iPaas, dPaaS:

Objects Stores

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

27 Natalino Busa - @natbusa

iPaas, dPaaS:

NoSQLObjects Stores

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

28 Natalino Busa - @natbusa

iPaas, dPaaS:

NoSQLObjects Stores

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

29 Natalino Busa - @natbusa

iPaas, dPaaS:

SQL : Big Data Data Warehousing

NoSQLObjects Stores

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

HivePrestoSpark SQLImpala

Redshift (AWS)BigQuery (GCP)Big SQL (IBM)

Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

30 Natalino Busa - @natbusa

iPaas, dPaaS:

SQL : Big Data Data Warehousing

NoSQL Machine Learning

Objects Stores

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

HivePrestoSpark SQLImpala

Redshift (AWS)BigQuery (GCP)Big SQL (IBM)

Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

Spark MLH2OFlinkAreosolveTheanoTensorflowXGboost

Azure MLAWS MLGoogle MLIBM Watson

31 Natalino Busa - @natbusa

iPaas, dPaaS:

SQL : Big Data Data Warehousing

NoSQL Machine Learning

Objects Stores

Streaming Computing

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

HivePrestoSpark SQLImpala

Redshift (AWS)BigQuery (GCP)Big SQL (IBM)

Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

Spark MLH2OFlinkAreosolveTheanoTensorflowXGboost

Azure MLAWS MLGoogle MLIBM Watson

Heron (Storm)NiFiSpark StreamingFlinkKafka StreamsLogstashStreamSQL

Google DataFlow (GCP)

32 Natalino Busa - @natbusa

iPaaS: Selecting your Analytical Stack

� Flexible. Powerful.- Combinations for this example:

8 * 3 * 4 * 8 * 7 * 7 = 37632

� Right tool for the right job- Fit for purpose- Multi-Genre Analytics

Hard to maintain and upgrade:- Extended Skills and Know-how- Components upgrades must be compatible

Hard to configure: - no matter if cloud or bare or vms- complex stacks with many tools and services

33 Natalino Busa - @natbusa

iPaaS: Deploy & Manage your own Analytics

How to simplify? Select a bundle!

34 Natalino Busa - @natbusa

iPaaS: bundled recipes & stacks

Select a recipe:- Hortonworks Data Platform- Cloudera Data Platform- Reactive Platform - Smack Stack- Pancake Stack- ELK Stack- Select your own

35 Natalino Busa - @natbusa

iPaaS: my favs analytical stacks

Objects Stores

NoSQL SQL : Big Data Data Warehousing

Machine Learning Streaming Computing

All Hadoop (5) HDFS Hbase Hive Spark Storm

Smack stack (2) Cassandra Cassandra Spark Spark Spark

Elastic (5) HDFS ElasticSearch Hive H2O Kafka

Data Science (8) HDFS ElasticSearch Hive, Presto Spark, H2O, Tensorflow Flink

Real Time (2) Cassandra Cassandra Flink Flink Flink

36 Natalino Busa - @natbusa

dPaaS: Managed Analytics

This is hard ! Can we access it as a service?

37 Natalino Busa - @natbusa

dPaaS: Managed Hadoop & Spark

HDInsight: Hadoop, Spark, and R as services

Managed Spark Clusters, BigInsight (Hadoop)

DataFlow and DataProc: Flink, Spark and Hadoop Clusters as a Service

EMR: Hadoop components a la carte

38 Natalino Busa - @natbusa

PaaS: Analytical clusters

Ephemeral

Create then Dispose

Clusters are Short-Lived

Data Exploration

Isolated, Personal

Simple Access Management

Interactive Analytics

Permanent

Clusters are Long Lived

Scheduled Operations

Production ETL

Co-Ordinated

Complex Access Management

Batch Analytics

vs

39 Natalino Busa - @natbusa

DAaaS: Microsoft’s Cortana and ML Studio

40 Natalino Busa - @natbusa

DAaaS: IBM Watson

41 Natalino Busa - @natbusa

DAaaS: Google ML and AI as a service

Cloud Computing forDeep Neural Networks > Train, Score, Data

AI and ML models for:

● Speech (audio)● Language (text)● Vision (images/video)

42 Natalino Busa - @natbusa

Summary

• Analytics in the Cloud:

The dawn of a new computing era

• IPaas, dPaas:

complexity vs flexibility, it’s a tradeoff

• Computing clusters:

Ephemeral and Persistent

43 Natalino Busa - @natbusa

Head of Applied Data Science at Teradata

Distributed computing Machine Learning

Statistics Big/Fast Data Streaming Computing

Linkedin and Twitter:

natbusa