Data Discover y At Databricks With Amundsen

transcript

Data Discovery At Databricks With Amundsen

Tao FengTianru Zhou

Tao Feng▪ Engineer at Databricks▪ Co Creator of Amundsen ▪ Apache Airflow PMC▪ Previously worked at Lyft, Linkedin,

Oracle

Tianru Zhou▪ Engineer at Databricks▪ Previously worked at AWS

Elasticsearch

Data Discovery & Challenges

Data-Driven Decisions

Analysts Data Scientists GeneralManagers

Engineers ExperimentersProductManagers

● Axiom: Good decisions are based on data

● Who needs Data? Anyone who wants to make good decisions

○ HR wants to ensure salaries are competitive with market

○ Politician wants to optimize campaign strategy

Data-Driven Decisions

1. Data is Collected

2. Analyst Finds the Data

3. Analyst Understands the Data

4. Analyst Creates Report

5. Analyst Shares the Results

6. Someone Makes a Decision

Data Discovery Not Productive

● Data Scientists spend up to 30% of their

time in Data Discovery

● Data Discovery in itself provides little to

no intrinsic value. Impactful work

happens in Analysis.

● The answer to these problems is

Metadata / Data Catalog

Data Catalog to the rescue

• Ease of documentation and discoverability‒ Single searchable portal‒ Display dependencies / lineages between data entities ( tables,

dashboards)• Help to answer questions like:

‒ Where can I find data about ___?‒ What is the context about the data?‒ Who are the owners that I can ask for access? ‒ How is the data created? Is the data trustable?‒ How should i use the data? Any sample query, statistics around the

column?‒ How frequently does the data refresh?‒ ...

Introducing Amundsen

What is Amundsen• In a nutshell, Amundsen is an open-source data discovery and metadata

platform for improving the productivity of data analysts, data scientists, and engineers when interacting with data.

• Amundsen is currently hosted at Linux Foundation Data & AI (fromer LFAI) as its incubation project with open governance and RFC process. (e.g blog post)

Amundsen homepage

Dataset detail page

Lineage between dashboards and dataset

Search for existing dashboards/reports

Dashboard detail page

Search for co-workers

User Profile page

Announcement page

• Plugin client to support new feature or new datasets

Central data quality issue portal• Central portal for users to

report data issues.

• Users could see all the past issues as well.

• Users could request further context / descriptions from owners through the portal.

Data Preview

• Supports data preview for datasets.

• Plugin client with different BI Viz tools (e.g Apache Superset, Bigquery).

Amundsen @ Databricks

5000+Across the globe

CUSTOMERS

LakehouseOne simple platform to unify all of

your data, analytics, and AI workloads

The Data and AI Company

ORIGINAL CREATORS

Databricks Lakehouse

BI Reports & Dashboards

Data Science

Workspace

Machine Learning Lifecycle

Structured, Semi-Structured and Unstructured Data

DELTA ENGINE

Structured transaction layer

High performance query engine

Internal dataset discovery at Databricks

● Static maintained wiki page for golden tables of the central workspace

● Metadata easily becomes stale

● Amundsen for the rescue!

Databricks Deployment

Deployment(detailed)vpn

Control plane

amundsen ns

Load balancer

amundsen-frontend

amundsen-search amundsen-metadata

Data plane

Databricks notebook

Databricks job service

Amazon RDS to store connections

Development

Open source amundsen (git submodule)

Private changes

Base layer

Layer m

Layer n

Notebook version control

Databricks private repoDatabricks notebook

Generate & grant access token

Syncing changes

Metadata surfaced in amunden

• Downstream/Upstream tables• Downstream jobs• Downstream users of the table• Job that writes the table• Writer of the table

• Column stats• Dataset frequent users

• Delta table extended metadata• Redash Dashboards• Sample data

Lineage information

Statistics

Extended information

Lineage information

Jobs that write the table

Writer of table

Main lineage info

What is table lineage

How is the lineage table generated?

Raw lineage pipeline Raw -> processed lineageUsage_logs

ReadEventTable (reads)WriteEventTable (writes)

Insights_tableCleaning + workload aggregation

GraphRead <-> Workload <-> Write

Raw Lineage tableWith raw table paths

dbfs:/user/hive/… → db.tableString processing

Paths → View conversionGet Delta metadata (Describe Extended) + String processing + heuristics

Mount point → Blob pathGet mount points (dbutils.fs.mounts()) + String processing

Processed Lineage tableWith table Names

Statistics information

Column statistics for numeric data type

Frequent users

Raw usage data also comes from usage_logs table

analyze {table} compute statistics for column col1, col2

describe extended {db}.{table} `{column name}`Get column stats

Delta table extended metadata

For delta table, we can run:

describe detail table_name

For delta table view, we can run:

describe detail table_name

Extract extended metadata

Notebook structure

Open source delta_lake_metadata_extractor can be extended easily.

Notebook structureStep 1. Extract delta lake metadata + Publish to Neo4j

Step 2. Publish to Elasticsearch

Notebook structure

Step 3. Cleanup stale data

Redash dashboards

All redash dashboards that use this table

Redash dashboards

View in redash

Copy button

Sample data Sample data tab

Example:

Amundsen Open Source

Community members

Stars for the repo

Companies using in production

Also part of top 20 most popular OSS data projects in 2021 based on data council survey

Notable RFCs / PRs

● AWS Neptune metadata datastore (RFC#13)● Mysql metadata datastore (RFC#019, RFC#021, RFC#023)● Lineage frontend and backend (RFC#025, RFC#032)● ETL push model paradigm (PR)● Other rfcs could be found in here

Summary

● Solve data discovery challenges with Amundsen● Integrate Amundsen with Databricks infrastructure● Amundsen OSS adoptions significantly growing●

Data Discover y At Databricks With Amundsen

Documents