+ All Categories
Home > Technology > The Heterogeneous Data lake

The Heterogeneous Data lake

Date post: 16-Apr-2017
Category:
Upload: dataworks-summithadoop-summit
View: 444 times
Download: 0 times
Share this document with a friend
33
DREMIO The Heterogeneous Data Lake Tomer Shiran, Co-Founder & CEO at Dremio [email protected] | @tshiran Hadoop Summit Europe 2016 April 13, 2016
Transcript
Page 1: The Heterogeneous Data lake

DREMIO

The Heterogeneous Data Lake

Tomer Shiran, Co-Founder & CEO at [email protected] | @tshiranHadoop Summit Europe 2016

April 13, 2016

Page 2: The Heterogeneous Data lake

DREMIO

Company Background

Jacques NadeauFounder & CTO

• Recognized SQL & NoSQL expert• Apache Arrow & Drill PMC Chair• Quigo (AOL); Offermatica (ADBE);

aQuantive (MSFT)

Tomer ShiranFounder & CEO

• MapR (VP Product); Microsoft; IBM Research

• Apache Drill Founder• Carnegie Mellon, Technion

Julien Le DemArchitect

• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data

Pipeline); Yahoo! (Architect)

Top Silicon Valley VCs• Stealth data analytics startup

• Founded in 2015

• Led by experts in Big Data and open source

Page 3: The Heterogeneous Data lake

DREMIO

The Rise of Heterogeneous Data Infrastructure

1980 2016

Page 4: The Heterogeneous Data lake

DREMIO

Can’t Simply Connect a BI Tool…

• Too slow for interactive analysis

• Manual process to map data to relational model

• NoSQL data often inconsistent & unclean (eg, mixed types)

X

Page 5: The Heterogeneous Data lake

DREMIO

Can’t Simply ETL the Data Into One System…

DWRDBMS RDBMS

RDBMS

RDBMS

RDBMSRDBMS

RDBMS RDBMS

• ETL between similar systems• SQL -> SQL• Flat -> flat

• Small & slowly evolving data• Even then, ETL was hard!

DWS3

HDFSSolr S3

Oracle

MongoDB

SQL Server

HBase

Elastic HDFS

• ETL between very different systems • Search -> SQL• Complex –> flat

• Big & rapidly evolving data• ETL is now much harder…

The Relational World Today

Page 6: The Heterogeneous Data lake

DREMIO

Page 7: The Heterogeneous Data lake

DREMIO

Towards a Heterogeneous Data Lake…

• A platform that enables data analysis across disparate data sources

• Storage-agnostic– The data can live anywhere– Join across disparate data sources– Leverage the strengths of each data source

• There’s a reason it was chosen to store that data…

• Client-agnostic– Tableau, Qlik, Power BI, Excel, R, …

• Scalability & performance– It’s the era of Big Data…

• Simple & complex analysis

Page 8: The Heterogeneous Data lake

DREMIO

Apache Arrow: Columnar In-Memory Execution

Arrow is backed by the lead developers of the major open source Big Data technologies

10-100x speedup on modern CPUs

High-performance sharing & interchange

High-speed Python and R integration

Apache Arrow is the new standard for columnar in-memory execution technology

Data Sources:

Execution:

Data Science:

Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra

Drill, Spark, Impala, Storm

Pandas (Python), R, Ibis

Page 9: The Heterogeneous Data lake

DREMIO

Arrow Enables High Performance Interchange

Pre-Arrow With Arrow

• Each system has its own internal memory format

• 70-80% CPU wasted on serialization and deserialization

• Similar functionality implemented in multiple projects

• All systems utilize the same memory format

• No overhead for cross-system communication

• Projects can share functionality (eg, Parquet-to-Arrow reader)

Page 10: The Heterogeneous Data lake

DREMIO

Arrow is Designed for CPU Efficiency

TraditionalMemory Buffer

ArrowMemory Buffer

• Cache locality

• Super-scalar & vectorizedoperation

• Minimal structure overhead

• Constant value access

• Operate directly on columnar compressed data

Page 11: The Heterogeneous Data lake

DREMIO

Apache Drill: A Storage-Agnostic Query Engine

Tableau, Excel, Qlik, … Custom Applications

MongoDB*

CLI

HBase Elasticsearch* MapR

HDFS NAS Local Files Amazon S3

* Currently being developed/enhanced

RDBMS*

Azure Blob Storage

Apache Drill

Query any data source as if it’s a relational database

Join data from multiple data sources in a single query

1 2

Page 12: The Heterogeneous Data lake

DREMIO

Omni-SQL (“SQL-on-Everything”)

Drill: Omni-SQLWhereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements.

Page 13: The Heterogeneous Data lake

DREMIO

ARCHITECTURE

Page 14: The Heterogeneous Data lake

DREMIO

Everything Starts With a Drillbit…

• High performance query executor• In-memory columnar execution• Directly interacts with data, acquiring

knowledge as it reads• Built to leverage large amounts of memory• Networked or not• Exposes ODBC, JDBC, REST• Built-in Web UI and CLI• Extensible

Single process (daemon or CLI)

drillbit

Page 15: The Heterogeneous Data lake

DREMIO

Data Lake, More Like Data Maelstrom

Clustered Services Desktops

HDFS HDFS HDFS

HBase HBase HBase

HDFS HDFS HDFS

ES ES ES

MongoDB MongoDB MongoDB

Cloud Services

DynamoDB

Amazon S3

Linux

Mac

Windows

MongoDB Cluster

Elasticsearch Cluster

Hadoop Cluster

HBase Cluster

Page 16: The Heterogeneous Data lake

DREMIO

Run Drill Co-Located with the Data, or Not

Clustered Services Desktops

HDFS HDFS HDFS

HBase HBase HBase

HDFS HDFS HDFS

ES ES ES

MongoDB MongoDB MongoDB

Cloud Services

DynamoDB

Amazon S3

Linux

Mac

Windows

drillbit drillbit drillbit

drillbit drillbit drillbit

drillbit drillbit drillbit

drillbit drillbit drillbit

drillbit drillbit

drillbit drillbit

drillbit drillbit

drillbit drillbit

drillbit

drillbit

drillbit

Page 17: The Heterogeneous Data lake

DREMIO

Extensible Datastore Architecture

Storage Plugin API

MongoDBPlugin

File Plugin

Execution Engine

Format Plugin APIFileSystem API

HD

FS

S3

Ma

pR

-FS

Pa

rqu

et

JSO

N

CS

V

HBasePlugin

HivePlugin

Chapter 2: Connecting to Datastores

KuduPlugin

PhoenixPlugin

Page 18: The Heterogeneous Data lake

DREMIO

QUERYING DATA

Page 19: The Heterogeneous Data lake

DREMIO

Referencing a Table

SELECT * FROM production.website.users;

Chapter 3: The Universal Namespace

Datastore Workspace Table

Page 20: The Heterogeneous Data lake

DREMIO

Run Your First Query

> SELECT name FROM mongo.yelp.business LIMIT 1;+--------------------+| name |+--------------------+| Eric Goldberg, MD |+--------------------+

> SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json` LIMIT 1;+--------------------+| name |+--------------------+| Eric Goldberg, MD |+--------------------+

Page 21: The Heterogeneous Data lake

DREMIO

Namespaces & Tables

Storage Plugin Type Workspace Table

mongo Database Collection

hive Database Table

hbase Namespace Table

file (HDFS cluster, S3, …) Directory File or directory

… … …

User defines these in the datastore configuration

Page 22: The Heterogeneous Data lake

DREMIO

> SELECT *FROM dfs.root.`yelp/review.json` r,

mongo.yelp.business bWHERE r.business_id = b.business_id;

Joining Across Datastores is Easy!

Alias to a specific file system (S3, HDFS, local, NAS)

Alias to a specific MongoDB cluster

Page 23: The Heterogeneous Data lake

DREMIO

> SELECT b.name AS name, COUNT(*) AS reviewsFROM dfs.yelp.`review.json` r,

mongo.yelp.business bWHERE r.business_id = b.business_idGROUP BY b.business_id, b.nameORDER BY reviews DESCLIMIT 3;

+-------------------+----------+| name | reviews |+-------------------+----------+| Mon Ami Gabi | 3695 || Earl of Sandwich | 3263 || Wicked Spoon | 3011 |+-------------------+----------+

What Business Has the Most Reviews on Yelp?

Page 24: The Heterogeneous Data lake

DREMIO

Native JSON Data Model

Access Arrays

SELECT categories[0]

{ "business_id": 123, "name": "McDonalds", "categories": ["restaurant", "fast food"],"attributes": {

"family friendly": true,"fast": true,"romantic": false

}}

Access Maps

WHERE t.attributes.romantic IS TRUE

Flatten Arrays

SELECT name, FLATTEN(categories)

Extract Keys

SELECT name, KVGEN(attributes)

Flatten Maps

SELECT name, FLATTEN(KVGEN(attributes))

Access Embedded JSON Blobs

SELECT d.address.stateFROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)

Page 25: The Heterogeneous Data lake

DREMIO

Accessing Array Elements

> SELECT categories FROM business LIMIT 2;+-------------------------------------------+| categories |+-------------------------------------------+| ["American (Traditional)","Restaurants"] || ["Chinese","Restaurants"] |+-------------------------------------------+

> SELECT categories[0] FROM business LIMIT 2;+-------------------------+| EXPR$0 |+-------------------------+| American (Traditional) || Chinese |+-------------------------+

Page 26: The Heterogeneous Data lake

DREMIO

FLATTEN

• FLATTEN converts single record with array field into multiple records– One output record for each array element

• Non FLATTENed fields are repeated in each of the output records

> SELECT categoriesFROM business LIMIT 2;

+-------------------------------------------+| categories |+-------------------------------------------+| ["American (Traditional)","Restaurants"] || ["Chinese","Restaurants"] |+-------------------------------------------+

> SELECT FLATTEN(categories)FROM business LIMIT 4;

+-------------------------+| EXPR$0 |+-------------------------+| American (Traditional) || Restaurants || Chinese || Restaurants |+-------------------------+

Page 27: The Heterogeneous Data lake

DREMIO

Non-FLATTENed Fields are Repeated

> SELECT name, categories FROM business LIMIT 2;+------------------------------+-------------------------------------------+| name | categories |+------------------------------+-------------------------------------------+| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] || Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] |+------------------------------+-------------------------------------------+

> SELECT name, FLATTEN(categories) FROM business LIMIT 4;+------------------------------+-------------------------+| name | EXPR$1 |+------------------------------+-------------------------+| Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants || Chang Jiang Chinese Kitchen | Chinese || Chang Jiang Chinese Kitchen | Restaurants |+------------------------------+-------------------------+

Page 28: The Heterogeneous Data lake

DREMIO

ODBC and JDBC

• Drill includes standard ODBC/JDBC drivers– ODBC for native apps– JDBC for Java apps

• User installs the driver on the client– The same machine as

the BI tool

• Driver communicates with Drill cluster(s)

• Make sure driver and cluster are compatible versions

Drill Cluster

Drill JDBC Driver

TIBCO Spotfire

Client

Drill ODBC Driver

Tableau

Client (eg, Laptop)

Page 29: The Heterogeneous Data lake

DREMIO

DEMO TIME!

Page 30: The Heterogeneous Data lake

DREMIO

Thank You

• Learn about Apache Arrow• Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/• Apache Arrow website: arrow.apache.org

• Download Apache Drill: drill.apache.org

• Reach out to learn more about the Dremio private beta• Email me: [email protected]• Sign up on the site: www.dremio.com

Page 31: The Heterogeneous Data lake

DREMIO

APPENDIX

Page 32: The Heterogeneous Data lake

DREMIO

Page 33: The Heterogeneous Data lake

DREMIO

Questions

• User trends based on yelping_since (Mongo)

• Top business categories, with coloring by state

• Which businesses are gross? (Elastic<->Mongo)

• Which of those had the most website clicks?– distinct(business_id) on elastic, mongo.business,

hdfs.default.click


Recommended