Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | dataworks-summithadoop-summit |
View: | 444 times |
Download: | 0 times |
DREMIO
The Heterogeneous Data Lake
Tomer Shiran, Co-Founder & CEO at [email protected] | @tshiranHadoop Summit Europe 2016
April 13, 2016
DREMIO
Company Background
Jacques NadeauFounder & CTO
• Recognized SQL & NoSQL expert• Apache Arrow & Drill PMC Chair• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer ShiranFounder & CEO
• MapR (VP Product); Microsoft; IBM Research
• Apache Drill Founder• Carnegie Mellon, Technion
Julien Le DemArchitect
• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source
DREMIO
The Rise of Heterogeneous Data Infrastructure
1980 2016
DREMIO
Can’t Simply Connect a BI Tool…
• Too slow for interactive analysis
• Manual process to map data to relational model
• NoSQL data often inconsistent & unclean (eg, mixed types)
X
DREMIO
Can’t Simply ETL the Data Into One System…
DWRDBMS RDBMS
RDBMS
RDBMS
RDBMSRDBMS
RDBMS RDBMS
• ETL between similar systems• SQL -> SQL• Flat -> flat
• Small & slowly evolving data• Even then, ETL was hard!
DWS3
HDFSSolr S3
Oracle
MongoDB
SQL Server
HBase
Elastic HDFS
• ETL between very different systems • Search -> SQL• Complex –> flat
• Big & rapidly evolving data• ETL is now much harder…
The Relational World Today
DREMIO
DREMIO
Towards a Heterogeneous Data Lake…
• A platform that enables data analysis across disparate data sources
• Storage-agnostic– The data can live anywhere– Join across disparate data sources– Leverage the strengths of each data source
• There’s a reason it was chosen to store that data…
• Client-agnostic– Tableau, Qlik, Power BI, Excel, R, …
• Scalability & performance– It’s the era of Big Data…
• Simple & complex analysis
DREMIO
Apache Arrow: Columnar In-Memory Execution
Arrow is backed by the lead developers of the major open source Big Data technologies
10-100x speedup on modern CPUs
High-performance sharing & interchange
High-speed Python and R integration
Apache Arrow is the new standard for columnar in-memory execution technology
Data Sources:
Execution:
Data Science:
Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra
Drill, Spark, Impala, Storm
Pandas (Python), R, Ibis
DREMIO
Arrow Enables High Performance Interchange
Pre-Arrow With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-to-Arrow reader)
DREMIO
Arrow is Designed for CPU Efficiency
TraditionalMemory Buffer
ArrowMemory Buffer
• Cache locality
• Super-scalar & vectorizedoperation
• Minimal structure overhead
• Constant value access
• Operate directly on columnar compressed data
DREMIO
Apache Drill: A Storage-Agnostic Query Engine
Tableau, Excel, Qlik, … Custom Applications
MongoDB*
CLI
HBase Elasticsearch* MapR
HDFS NAS Local Files Amazon S3
* Currently being developed/enhanced
RDBMS*
Azure Blob Storage
Apache Drill
Query any data source as if it’s a relational database
Join data from multiple data sources in a single query
1 2
DREMIO
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQLWhereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements.
“
”
DREMIO
ARCHITECTURE
DREMIO
Everything Starts With a Drillbit…
• High performance query executor• In-memory columnar execution• Directly interacts with data, acquiring
knowledge as it reads• Built to leverage large amounts of memory• Networked or not• Exposes ODBC, JDBC, REST• Built-in Web UI and CLI• Extensible
Single process (daemon or CLI)
drillbit
DREMIO
Data Lake, More Like Data Maelstrom
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
MongoDB Cluster
Elasticsearch Cluster
Hadoop Cluster
HBase Cluster
DREMIO
Run Drill Co-Located with the Data, or Not
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit
drillbit
drillbit
DREMIO
Extensible Datastore Architecture
Storage Plugin API
MongoDBPlugin
File Plugin
Execution Engine
Format Plugin APIFileSystem API
HD
FS
S3
Ma
pR
-FS
Pa
rqu
et
JSO
N
CS
V
HBasePlugin
HivePlugin
Chapter 2: Connecting to Datastores
KuduPlugin
PhoenixPlugin
DREMIO
QUERYING DATA
DREMIO
Referencing a Table
SELECT * FROM production.website.users;
Chapter 3: The Universal Namespace
Datastore Workspace Table
DREMIO
Run Your First Query
> SELECT name FROM mongo.yelp.business LIMIT 1;+--------------------+| name |+--------------------+| Eric Goldberg, MD |+--------------------+
> SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json` LIMIT 1;+--------------------+| name |+--------------------+| Eric Goldberg, MD |+--------------------+
DREMIO
Namespaces & Tables
Storage Plugin Type Workspace Table
mongo Database Collection
hive Database Table
hbase Namespace Table
file (HDFS cluster, S3, …) Directory File or directory
… … …
User defines these in the datastore configuration
DREMIO
> SELECT *FROM dfs.root.`yelp/review.json` r,
mongo.yelp.business bWHERE r.business_id = b.business_id;
Joining Across Datastores is Easy!
Alias to a specific file system (S3, HDFS, local, NAS)
Alias to a specific MongoDB cluster
DREMIO
> SELECT b.name AS name, COUNT(*) AS reviewsFROM dfs.yelp.`review.json` r,
mongo.yelp.business bWHERE r.business_id = b.business_idGROUP BY b.business_id, b.nameORDER BY reviews DESCLIMIT 3;
+-------------------+----------+| name | reviews |+-------------------+----------+| Mon Ami Gabi | 3695 || Earl of Sandwich | 3263 || Wicked Spoon | 3011 |+-------------------+----------+
What Business Has the Most Reviews on Yelp?
DREMIO
Native JSON Data Model
Access Arrays
SELECT categories[0]
{ "business_id": 123, "name": "McDonalds", "categories": ["restaurant", "fast food"],"attributes": {
"family friendly": true,"fast": true,"romantic": false
}}
Access Maps
WHERE t.attributes.romantic IS TRUE
Flatten Arrays
SELECT name, FLATTEN(categories)
Extract Keys
SELECT name, KVGEN(attributes)
Flatten Maps
SELECT name, FLATTEN(KVGEN(attributes))
Access Embedded JSON Blobs
SELECT d.address.stateFROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)
DREMIO
Accessing Array Elements
> SELECT categories FROM business LIMIT 2;+-------------------------------------------+| categories |+-------------------------------------------+| ["American (Traditional)","Restaurants"] || ["Chinese","Restaurants"] |+-------------------------------------------+
> SELECT categories[0] FROM business LIMIT 2;+-------------------------+| EXPR$0 |+-------------------------+| American (Traditional) || Chinese |+-------------------------+
DREMIO
FLATTEN
• FLATTEN converts single record with array field into multiple records– One output record for each array element
• Non FLATTENed fields are repeated in each of the output records
> SELECT categoriesFROM business LIMIT 2;
+-------------------------------------------+| categories |+-------------------------------------------+| ["American (Traditional)","Restaurants"] || ["Chinese","Restaurants"] |+-------------------------------------------+
> SELECT FLATTEN(categories)FROM business LIMIT 4;
+-------------------------+| EXPR$0 |+-------------------------+| American (Traditional) || Restaurants || Chinese || Restaurants |+-------------------------+
DREMIO
Non-FLATTENed Fields are Repeated
> SELECT name, categories FROM business LIMIT 2;+------------------------------+-------------------------------------------+| name | categories |+------------------------------+-------------------------------------------+| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] || Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] |+------------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) FROM business LIMIT 4;+------------------------------+-------------------------+| name | EXPR$1 |+------------------------------+-------------------------+| Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants || Chang Jiang Chinese Kitchen | Chinese || Chang Jiang Chinese Kitchen | Restaurants |+------------------------------+-------------------------+
DREMIO
ODBC and JDBC
• Drill includes standard ODBC/JDBC drivers– ODBC for native apps– JDBC for Java apps
• User installs the driver on the client– The same machine as
the BI tool
• Driver communicates with Drill cluster(s)
• Make sure driver and cluster are compatible versions
Drill Cluster
Drill JDBC Driver
TIBCO Spotfire
Client
Drill ODBC Driver
Tableau
Client (eg, Laptop)
DREMIO
DEMO TIME!
DREMIO
Thank You
• Learn about Apache Arrow• Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/• Apache Arrow website: arrow.apache.org
• Download Apache Drill: drill.apache.org
• Reach out to learn more about the Dremio private beta• Email me: [email protected]• Sign up on the site: www.dremio.com
DREMIO
APPENDIX
DREMIO
DREMIO
Questions
• User trends based on yelping_since (Mongo)
• Top business categories, with coloring by state
• Which businesses are gross? (Elastic<->Mongo)
• Which of those had the most website clicks?– distinct(business_id) on elastic, mongo.business,
hdfs.default.click