The Heterogeneous Data lake

DREMIO

The Heterogeneous Data Lake

Tomer Shiran, Co-Founder & CEO at [email protected] | @tshiranHadoop Summit Europe 2016

April 13, 2016

mailto:[email protected]

DREMIO

Company Background

Jacques NadeauFounder & CTO

• Recognized SQL & NoSQL expert• Apache Arrow & Drill PMC Chair• Quigo (AOL); Offermatica (ADBE);

aQuantive (MSFT)

Tomer ShiranFounder & CEO

• MapR (VP Product); Microsoft; IBM Research

• Apache Drill Founder• Carnegie Mellon, Technion

Julien Le DemArchitect

• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data

Pipeline); Yahoo! (Architect)

Top Silicon Valley VCs• Stealth data analytics startup

• Founded in 2015

• Led by experts in Big Data and open source

DREMIO

The Rise of Heterogeneous Data Infrastructure

1980 2016

DREMIO

Can’t Simply Connect a BI Tool…

• Too slow for interactive analysis

• Manual process to map data to relational model

• NoSQL data often inconsistent & unclean (eg, mixed types)

X

DREMIO

Can’t Simply ETL the Data Into One System…

DWRDBMS RDBMS

RDBMS

RDBMS

RDBMSRDBMS

RDBMS RDBMS

• ETL between similar systems• SQL -> SQL• Flat -> flat

• Small & slowly evolving data• Even then, ETL was hard!

DWS3

HDFSSolr S3

Oracle

MongoDB

SQL Server

HBase

Elastic HDFS

• ETL between very different systems • Search -> SQL• Complex –> flat

• Big & rapidly evolving data• ETL is now much harder…

The Relational World Today

DREMIO

DREMIO

Towards a Heterogeneous Data Lake…

• A platform that enables data analysis across disparate data sources

• Storage-agnostic– The data can live anywhere– Join across disparate data sources– Leverage the strengths of each data source

• There’s a reason it was chosen to store that data…

• Client-agnostic– Tableau, Qlik, Power BI, Excel, R, …

• Scalability & performance– It’s the era of Big Data…

• Simple & complex analysis

DREMIO

Apache Arrow: Columnar In-Memory Execution

Arrow is backed by the lead developers of the major open source Big Data technologies

10-100x speedup on modern CPUs

High-performance sharing & interchange

High-speed Python and R integration

Apache Arrow is the new standard for columnar in-memory execution technology

Data Sources:

Execution:

Data Science:

Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra

Drill, Spark, Impala, Storm

Pandas (Python), R, Ibis

DREMIO

Arrow Enables High Performance Interchange

Pre-Arrow With Arrow

• Each system has its own internal memory format

• 70-80% CPU wasted on serialization and deserialization

• Similar functionality implemented in multiple projects

• All systems utilize the same memory format

• No overhead for cross-system communication

• Projects can share functionality (eg, Parquet-to-Arrow reader)

DREMIO

Arrow is Designed for CPU Efficiency

TraditionalMemory Buffer

ArrowMemory Buffer

• Cache locality

• Super-scalar & vectorizedoperation

• Minimal structure overhead

• Constant value access

• Operate directly on columnar compressed data

DREMIO

Apache Drill: A Storage-Agnostic Query Engine

Tableau, Excel, Qlik, … Custom Applications

MongoDB*

CLI

HBase Elasticsearch* MapR

HDFS NAS Local Files Amazon S3

* Currently being developed/enhanced

RDBMS*

Azure Blob Storage

Apache Drill

Query any data source as if it’s a relational database

Join data from multiple data sources in a single query

1 2

DREMIO

Omni-SQL (“SQL-on-Everything”)

Drill: Omni-SQLWhereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements.

“

”

DREMIO

ARCHITECTURE

DREMIO

Everything Starts With a Drillbit…

• High performance query executor• In-memory columnar execution• Directly interacts with data, acquiring

knowledge as it reads• Built to leverage large amounts of memory• Networked or not• Exposes ODBC, JDBC, REST• Built-in Web UI and CLI• Extensible

Single process (daemon or CLI)

drillbit

DREMIO

Data Lake, More Like Data Maelstrom

Clustered Services Desktops

HDFS HDFS HDFS

HBase HBase HBase

HDFS HDFS HDFS

ES ES ES

MongoDB MongoDB MongoDB

Cloud Services

DynamoDB

Amazon S3

Linux

Mac

Windows

MongoDB Cluster

Elasticsearch Cluster

Hadoop Cluster

HBase Cluster

DREMIO

Run Drill Co-Located with the Data, or Not

Clustered Services Desktops

HDFS HDFS HDFS

HBase HBase HBase

HDFS HDFS HDFS

ES ES ES

MongoDB MongoDB MongoDB

Cloud Services

DynamoDB

Amazon S3

Linux

Mac

Windows

drillbit drillbit drillbit




drillbit drillbit

drillbit drillbit

drillbit drillbit

drillbit drillbit

drillbit

drillbit

drillbit

DREMIO

Extensible Datastore Architecture

Storage Plugin API

MongoDBPlugin

File Plugin

Execution Engine

Format Plugin APIFileSystem API

HD

FS

S3

Ma

pR

-FS

Pa

rqu

et

JSO

N

CS

V

HBasePlugin

HivePlugin

Chapter 2: Connecting to Datastores

KuduPlugin

PhoenixPlugin

DREMIO

QUERYING DATA

DREMIO

Referencing a Table

SELECT * FROM production.website.users;

Chapter 3: The Universal Namespace

Datastore Workspace Table

DREMIO

Run Your First Query

> SELECT name FROM mongo.yelp.business LIMIT 1;+--------------------+| name |+--------------------+| Eric Goldberg, MD |+--------------------+

> SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json` LIMIT 1;+--------------------+| name |+--------------------+| Eric Goldberg, MD |+--------------------+

DREMIO

Namespaces & Tables

Storage Plugin Type Workspace Table

mongo Database Collection

hive Database Table

hbase Namespace Table

file (HDFS cluster, S3, …) Directory File or directory

… … …

User defines these in the datastore configuration

DREMIO

> SELECT *FROM dfs.root.`yelp/review.json` r,

mongo.yelp.business bWHERE r.business_id = b.business_id;

Joining Across Datastores is Easy!

Alias to a specific file system (S3, HDFS, local, NAS)

Alias to a specific MongoDB cluster

DREMIO

> SELECT b.name AS name, COUNT(*) AS reviewsFROM dfs.yelp.`review.json` r,

mongo.yelp.business bWHERE r.business_id = b.business_idGROUP BY b.business_id, b.nameORDER BY reviews DESCLIMIT 3;

+-------------------+----------+| name | reviews |+-------------------+----------+| Mon Ami Gabi | 3695 || Earl of Sandwich | 3263 || Wicked Spoon | 3011 |+-------------------+----------+

What Business Has the Most Reviews on Yelp?

DREMIO

Native JSON Data Model

Access Arrays

SELECT categories[0]

{ "business_id": 123, "name": "McDonalds", "categories": ["restaurant", "fast food"],"attributes": {

"family friendly": true,"fast": true,"romantic": false

}}

Access Maps

WHERE t.attributes.romantic IS TRUE

Flatten Arrays

SELECT name, FLATTEN(categories)

Extract Keys

SELECT name, KVGEN(attributes)

Flatten Maps

SELECT name, FLATTEN(KVGEN(attributes))

Access Embedded JSON Blobs

SELECT d.address.stateFROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)

DREMIO

Accessing Array Elements

> SELECT categories FROM business LIMIT 2;+-------------------------------------------+| categories |+-------------------------------------------+| ["American (Traditional)","Restaurants"] || ["Chinese","Restaurants"] |+-------------------------------------------+

> SELECT categories[0] FROM business LIMIT 2;+-------------------------+| EXPR$0 |+-------------------------+| American (Traditional) || Chinese |+-------------------------+

DREMIO

FLATTEN

• FLATTEN converts single record with array field into multiple records– One output record for each array element

• Non FLATTENed fields are repeated in each of the output records

> SELECT categoriesFROM business LIMIT 2;

+-------------------------------------------+| categories |+-------------------------------------------+| ["American (Traditional)","Restaurants"] || ["Chinese","Restaurants"] |+-------------------------------------------+

> SELECT FLATTEN(categories)FROM business LIMIT 4;

+-------------------------+| EXPR$0 |+-------------------------+| American (Traditional) || Restaurants || Chinese || Restaurants |+-------------------------+

DREMIO

Non-FLATTENed Fields are Repeated

> SELECT name, categories FROM business LIMIT 2;+------------------------------+-------------------------------------------+| name | categories |+------------------------------+-------------------------------------------+| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] || Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] |+------------------------------+-------------------------------------------+

> SELECT name, FLATTEN(categories) FROM business LIMIT 4;+------------------------------+-------------------------+| name | EXPR$1 |+------------------------------+-------------------------+| Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants || Chang Jiang Chinese Kitchen | Chinese || Chang Jiang Chinese Kitchen | Restaurants |+------------------------------+-------------------------+

DREMIO

ODBC and JDBC

• Drill includes standard ODBC/JDBC drivers– ODBC for native apps– JDBC for Java apps

• User installs the driver on the client– The same machine as

the BI tool

• Driver communicates with Drill cluster(s)

• Make sure driver and cluster are compatible versions

Drill Cluster

Drill JDBC Driver

TIBCO Spotfire

Client

Drill ODBC Driver

Tableau

Client (eg, Laptop)

DREMIO

DEMO TIME!

DREMIO

Thank You

• Learn about Apache Arrow• Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/• Apache Arrow website: arrow.apache.org

• Download Apache Drill: drill.apache.org

• Reach out to learn more about the Dremio private beta• Email me: [email protected]• Sign up on the site: www.dremio.com

http://www.dremio.com/blog/Apache-Arrow/

arrow.apache.org

http://drill.apache.org

mailto:[email protected]

http://www.dremio.com/

DREMIO

APPENDIX

DREMIO

DREMIO

Questions

• User trends based on yelping_since (Mongo)

• Top business categories, with coloring by state

• Which businesses are gross? (Elastic<->Mongo)

• Which of those had the most website clicks?– distinct(business_id) on elastic, mongo.business,

hdfs.default.click

Date post:	16-Apr-2017
Category:	Technology
Upload:	dataworks-summithadoop-summit
View:	444 times
Download:	0 times

The Heterogeneous Data lake

Technology