+ All Categories
Home > Documents > Data Discover y At Databricks With Amundsen

Data Discover y At Databricks With Amundsen

Date post: 01-Oct-2021
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
45
Data Discovery At Databricks With Amundsen Tao Feng Tianru Zhou
Transcript
Page 1: Data Discover y At Databricks With Amundsen

Data Discovery At Databricks With Amundsen

Tao FengTianru Zhou

Page 2: Data Discover y At Databricks With Amundsen

Who

Tao Feng▪ Engineer at Databricks▪ Co Creator of Amundsen ▪ Apache Airflow PMC▪ Previously worked at Lyft, Linkedin,

Oracle

Tianru Zhou▪ Engineer at Databricks▪ Previously worked at AWS

Elasticsearch

Page 3: Data Discover y At Databricks With Amundsen

Data Discovery & Challenges

Page 4: Data Discover y At Databricks With Amundsen

Data-Driven Decisions

Analysts Data Scientists GeneralManagers

Engineers ExperimentersProductManagers

● Axiom: Good decisions are based on data

● Who needs Data? Anyone who wants to make good decisions

○ HR wants to ensure salaries are competitive with market

○ Politician wants to optimize campaign strategy

Page 5: Data Discover y At Databricks With Amundsen

Data-Driven Decisions

1. Data is Collected

2. Analyst Finds the Data

3. Analyst Understands the Data

4. Analyst Creates Report

5. Analyst Shares the Results

6. Someone Makes a Decision

Page 6: Data Discover y At Databricks With Amundsen

Data Discovery Not Productive

● Data Scientists spend up to 30% of their

time in Data Discovery

● Data Discovery in itself provides little to

no intrinsic value. Impactful work

happens in Analysis.

● The answer to these problems is

Metadata / Data Catalog

Page 7: Data Discover y At Databricks With Amundsen

Data Catalog to the rescue

• Ease of documentation and discoverability‒ Single searchable portal‒ Display dependencies / lineages between data entities ( tables,

dashboards)• Help to answer questions like:

‒ Where can I find data about ___?‒ What is the context about the data?‒ Who are the owners that I can ask for access? ‒ How is the data created? Is the data trustable?‒ How should i use the data? Any sample query, statistics around the

column?‒ How frequently does the data refresh?‒ ...

Page 8: Data Discover y At Databricks With Amundsen

Introducing Amundsen

Page 9: Data Discover y At Databricks With Amundsen

What is Amundsen• In a nutshell, Amundsen is an open-source data discovery and metadata

platform for improving the productivity of data analysts, data scientists, and engineers when interacting with data.

• Amundsen is currently hosted at Linux Foundation Data & AI (fromer LFAI) as its incubation project with open governance and RFC process. (e.g blog post)

Page 10: Data Discover y At Databricks With Amundsen

Amundsen homepage

Page 11: Data Discover y At Databricks With Amundsen

Dataset detail page

Page 12: Data Discover y At Databricks With Amundsen

Lineage between dashboards and dataset

Page 13: Data Discover y At Databricks With Amundsen

Search for existing dashboards/reports

Page 14: Data Discover y At Databricks With Amundsen

Dashboard detail page

Page 15: Data Discover y At Databricks With Amundsen

Search for co-workers

Page 16: Data Discover y At Databricks With Amundsen

User Profile page

Page 17: Data Discover y At Databricks With Amundsen

Announcement page

• Plugin client to support new feature or new datasets

Page 18: Data Discover y At Databricks With Amundsen

Central data quality issue portal• Central portal for users to

report data issues.

• Users could see all the past issues as well.

• Users could request further context / descriptions from owners through the portal.

Page 19: Data Discover y At Databricks With Amundsen

Data Preview

• Supports data preview for datasets.

• Plugin client with different BI Viz tools (e.g Apache Superset, Bigquery).

Page 20: Data Discover y At Databricks With Amundsen

Amundsen @ Databricks

Page 21: Data Discover y At Databricks With Amundsen

5000+Across the globe

CUSTOMERS

LakehouseOne simple platform to unify all of

your data, analytics, and AI workloads

The Data and AI Company

ORIGINAL CREATORS

Page 22: Data Discover y At Databricks With Amundsen

Databricks Lakehouse

BI Reports & Dashboards

Data Science

Workspace

Machine Learning Lifecycle

Structured, Semi-Structured and Unstructured Data

DELTA ENGINE

Structured transaction layer

High performance query engine

Page 23: Data Discover y At Databricks With Amundsen

Internal dataset discovery at Databricks

● Static maintained wiki page for golden tables of the central workspace

● Metadata easily becomes stale

● Amundsen for the rescue!

Page 24: Data Discover y At Databricks With Amundsen

Databricks Deployment

Page 25: Data Discover y At Databricks With Amundsen

Deployment(detailed)vpn

Control plane

amundsen ns

Load balancer

amundsen-frontend

amundsen-search amundsen-metadata

neo4j

LB

Data plane

Databricks notebook

Databricks job service

Amazon RDS to store connections

Page 26: Data Discover y At Databricks With Amundsen

Development

Open source amundsen (git submodule)

Private changes

Private changes

Base layer

Layer m

Layer n

Page 27: Data Discover y At Databricks With Amundsen

Notebook version control

Databricks private repoDatabricks notebook

Generate & grant access token

Syncing changes

Page 28: Data Discover y At Databricks With Amundsen

Metadata surfaced in amunden

• Downstream/Upstream tables• Downstream jobs• Downstream users of the table• Job that writes the table• Writer of the table

• Column stats• Dataset frequent users

• Delta table extended metadata• Redash Dashboards• Sample data

Lineage information

Statistics

Extended information

Page 29: Data Discover y At Databricks With Amundsen

Lineage information

Jobs that write the table

Writer of table

Main lineage info

Page 30: Data Discover y At Databricks With Amundsen

What is table lineage

Page 31: Data Discover y At Databricks With Amundsen

How is the lineage table generated?

Raw lineage pipeline Raw -> processed lineageUsage_logs

ReadEventTable (reads)WriteEventTable (writes)

Insights_tableCleaning + workload aggregation

GraphRead <-> Workload <-> Write

Raw Lineage tableWith raw table paths

dbfs:/user/hive/… → db.tableString processing

Paths → View conversionGet Delta metadata (Describe Extended) + String processing + heuristics

Mount point → Blob pathGet mount points (dbutils.fs.mounts()) + String processing

Processed Lineage tableWith table Names

Page 32: Data Discover y At Databricks With Amundsen

Statistics information

Column statistics for numeric data type

Frequent users

Raw usage data also comes from usage_logs table

analyze {table} compute statistics for column col1, col2

describe extended {db}.{table} `{column name}`Get column stats

Page 33: Data Discover y At Databricks With Amundsen

Delta table extended metadata

For delta table, we can run:

describe detail table_name

For delta table view, we can run:

describe detail table_name

Extract extended metadata

Page 34: Data Discover y At Databricks With Amundsen

Notebook structure

Open source delta_lake_metadata_extractor can be extended easily.

Page 35: Data Discover y At Databricks With Amundsen

Notebook structureStep 1. Extract delta lake metadata + Publish to Neo4j

Step 2. Publish to Elasticsearch

Page 36: Data Discover y At Databricks With Amundsen

Notebook structure

Step 3. Cleanup stale data

Page 37: Data Discover y At Databricks With Amundsen

Redash dashboards

All redash dashboards that use this table

Page 38: Data Discover y At Databricks With Amundsen

Redash dashboards

View in redash

Copy button

Page 39: Data Discover y At Databricks With Amundsen

Sample data Sample data tab

Example:

Page 40: Data Discover y At Databricks With Amundsen

WAU

Page 41: Data Discover y At Databricks With Amundsen

Amundsen Open Source

Page 42: Data Discover y At Databricks With Amundsen

Amundsen Open Source

1500+

Community members

2k+

Stars for the repo

30+

Companies using in production

Also part of top 20 most popular OSS data projects in 2021 based on data council survey

Page 44: Data Discover y At Databricks With Amundsen

Summary

Page 45: Data Discover y At Databricks With Amundsen

Summary

● Solve data discovery challenges with Amundsen● Integrate Amundsen with Databricks infrastructure● Amundsen OSS adoptions significantly growing●


Recommended