Data Discovery At Databricks With Amundsen
Tao FengTianru Zhou
Who
Tao Feng▪ Engineer at Databricks▪ Co Creator of Amundsen ▪ Apache Airflow PMC▪ Previously worked at Lyft, Linkedin,
Oracle
Tianru Zhou▪ Engineer at Databricks▪ Previously worked at AWS
Elasticsearch
Data Discovery & Challenges
Data-Driven Decisions
Analysts Data Scientists GeneralManagers
Engineers ExperimentersProductManagers
● Axiom: Good decisions are based on data
● Who needs Data? Anyone who wants to make good decisions
○ HR wants to ensure salaries are competitive with market
○ Politician wants to optimize campaign strategy
Data-Driven Decisions
1. Data is Collected
2. Analyst Finds the Data
3. Analyst Understands the Data
4. Analyst Creates Report
5. Analyst Shares the Results
6. Someone Makes a Decision
Data Discovery Not Productive
● Data Scientists spend up to 30% of their
time in Data Discovery
● Data Discovery in itself provides little to
no intrinsic value. Impactful work
happens in Analysis.
● The answer to these problems is
Metadata / Data Catalog
Data Catalog to the rescue
• Ease of documentation and discoverability‒ Single searchable portal‒ Display dependencies / lineages between data entities ( tables,
dashboards)• Help to answer questions like:
‒ Where can I find data about ___?‒ What is the context about the data?‒ Who are the owners that I can ask for access? ‒ How is the data created? Is the data trustable?‒ How should i use the data? Any sample query, statistics around the
column?‒ How frequently does the data refresh?‒ ...
Introducing Amundsen
What is Amundsen• In a nutshell, Amundsen is an open-source data discovery and metadata
platform for improving the productivity of data analysts, data scientists, and engineers when interacting with data.
• Amundsen is currently hosted at Linux Foundation Data & AI (fromer LFAI) as its incubation project with open governance and RFC process. (e.g blog post)
Amundsen homepage
Dataset detail page
Lineage between dashboards and dataset
Search for existing dashboards/reports
Dashboard detail page
Search for co-workers
User Profile page
Announcement page
• Plugin client to support new feature or new datasets
Central data quality issue portal• Central portal for users to
report data issues.
• Users could see all the past issues as well.
• Users could request further context / descriptions from owners through the portal.
Data Preview
• Supports data preview for datasets.
• Plugin client with different BI Viz tools (e.g Apache Superset, Bigquery).
Amundsen @ Databricks
5000+Across the globe
CUSTOMERS
LakehouseOne simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS
Databricks Lakehouse
BI Reports & Dashboards
Data Science
Workspace
Machine Learning Lifecycle
Structured, Semi-Structured and Unstructured Data
DELTA ENGINE
Structured transaction layer
High performance query engine
Internal dataset discovery at Databricks
● Static maintained wiki page for golden tables of the central workspace
● Metadata easily becomes stale
● Amundsen for the rescue!
Databricks Deployment
Deployment(detailed)vpn
Control plane
amundsen ns
Load balancer
amundsen-frontend
amundsen-search amundsen-metadata
neo4j
LB
Data plane
Databricks notebook
Databricks job service
Amazon RDS to store connections
Development
Open source amundsen (git submodule)
Private changes
Private changes
Base layer
Layer m
Layer n
Notebook version control
Databricks private repoDatabricks notebook
Generate & grant access token
Syncing changes
Metadata surfaced in amunden
• Downstream/Upstream tables• Downstream jobs• Downstream users of the table• Job that writes the table• Writer of the table
• Column stats• Dataset frequent users
• Delta table extended metadata• Redash Dashboards• Sample data
Lineage information
Statistics
Extended information
Lineage information
Jobs that write the table
Writer of table
Main lineage info
What is table lineage
How is the lineage table generated?
Raw lineage pipeline Raw -> processed lineageUsage_logs
ReadEventTable (reads)WriteEventTable (writes)
Insights_tableCleaning + workload aggregation
GraphRead <-> Workload <-> Write
Raw Lineage tableWith raw table paths
dbfs:/user/hive/… → db.tableString processing
Paths → View conversionGet Delta metadata (Describe Extended) + String processing + heuristics
Mount point → Blob pathGet mount points (dbutils.fs.mounts()) + String processing
Processed Lineage tableWith table Names
Statistics information
Column statistics for numeric data type
Frequent users
Raw usage data also comes from usage_logs table
analyze {table} compute statistics for column col1, col2
describe extended {db}.{table} `{column name}`Get column stats
Delta table extended metadata
For delta table, we can run:
describe detail table_name
For delta table view, we can run:
describe detail table_name
Extract extended metadata
Notebook structure
Open source delta_lake_metadata_extractor can be extended easily.
Notebook structureStep 1. Extract delta lake metadata + Publish to Neo4j
Step 2. Publish to Elasticsearch
Notebook structure
Step 3. Cleanup stale data
Redash dashboards
All redash dashboards that use this table
Redash dashboards
View in redash
Copy button
Sample data Sample data tab
Example:
WAU
Amundsen Open Source
Amundsen Open Source
1500+
Community members
2k+
Stars for the repo
30+
Companies using in production
Also part of top 20 most popular OSS data projects in 2021 based on data council survey
Notable RFCs / PRs
● AWS Neptune metadata datastore (RFC#13)● Mysql metadata datastore (RFC#019, RFC#021, RFC#023)● Lineage frontend and backend (RFC#025, RFC#032)● ETL push model paradigm (PR)● Other rfcs could be found in here
Summary
Summary
● Solve data discovery challenges with Amundsen● Integrate Amundsen with Databricks infrastructure● Amundsen OSS adoptions significantly growing●