+ All Categories
Home > Documents > ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Date post: 20-Jul-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
44
ASCEND Intelligent Orchestration: Data’s missing link Sean Knapp, Founder @ Ascend.io
Transcript
Page 1: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

ASCENDIntelligent Orchestration:Data’s missing link

Sean Knapp, Founder @ Ascend.io

Page 2: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Topics

● Quick Intro

● The State of Data Architectures

● Why Pipelines Suck

● Building a new Control Plane

● Making it Scale

● SaaS-ifying

● Future Topics

Page 3: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

About Ascend

● Founded 2015

● Team of 30

● We <3 Data Pipelines

About Me (Sean Knapp) 👋🏻

● 15 years building data platforms & teams

● Search Frontend TL @Google: first MapReduce in 2004

● Founder & CTO @Ooyala: 4B+ events/day

● Founder & CEO @Ascend: 1T+ events/day

Quick Intro

Page 4: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Smarter Pipelines.Less Code.

Page 5: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Despite advancements in every other part of the data lifecycle...

Building. Pipelines. SUCKS.

Page 6: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

The Current State of Data Architectures

Compute

Store

Redshift BigQuery

Data Warehouse Analytics & BI

Orchestrate

S3 GCS

Airflow Glue Data Factory Data Fusion

Raw Clean Enriched Curated

Collect Normalize Augment Refine

Access

Model

Query

Publish

Machine Learning

Data Science/Adv. Analytics

Interaction Data

Transaction Data Data Replication Analytics & BI

Machine Learning

Data Science/Adv. Analytics

Redshift BigQuery

Data Warehouse

Store

S3 GCS

Compute

Data FusionOrchestrate

Airflow Glue Data Factory

Pipelines: 90% of Time & Code

Page 7: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Why Pipelines Suck

Page 8: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Why Pipelines SuckPipelines

...becomes 1,000s of lines of this...

We are manually creating a query plan, for every stage, of every pipeline

● Monitoring For New Data● Ingest & Reformat Data● Profile & Partition Data● Analyze Downstream Dependencies● Incremental Processing & Updates● Intermediate Persistence● Data & Task Deduplication● Spark Parameterization & Tuning● Data Consistency & Integrity Checks● Error-Handling, Classification & Recovery● Data Lineage & Privacy Compliance● Garbage Collection & Lifecycle Management

Results

Databases & WarehousesWhere this...

1 SELECT date, country, gender, SUM(clicks)2 FROM user_events3 WHERE date >= DATE_SUB(NOW(), INTERVAL 30 DAY)

Database Engine

(1.5M lines of code) (25M lines of code)

Page 9: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

1.0 2.0Hosting Model Roll-Your-Own SaaS

Code Generation Manual Templatized

Interaction Model Code Code + GUI

Control System Scheduler Scheduler

Programming Model Imperative Imperative

Examples

Azure Data Factory

GoogleData Fusion

Evolution of Pipeline Orchestration

AWS Glue

Page 10: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

We looked for ideas in adjacent spaces...

Page 11: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Who here has used a Database?

Page 12: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Who here has heard ofReact?

Page 13: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Who here uses Kubernetes?

Page 14: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

What do they allhave in common?

They’re Declarative.

Page 15: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Declarative programming is a programming paradigm that expresses the logic of a computation without describing its control flow…

… [in an] attempt to minimize or eliminate side effects by describing what the program must accomplish…

… rather than describe how to accomplish it.

Page 16: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Declarative vs. Imperative

Page 17: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

The desired outcome

Page 18: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

What vs How

Declarative Imperative

Page 19: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

The usual outcome

Declarative Imperative

Page 20: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Declarative vs. ImperativeDeclarative Imperative

Pros Less Code (like.. A LOT less)

Faster Dev Cycles

Adaptive to Changes

Less Maintenance

Flexible

High Levels of Control

Cons ● Domain specific

● Difficult to manually optimize

● Require annotations to override automated behaviors

● State Management

● Stale assumptions in code

● Manual Optimizations

● Integrity Checks

● Failure Management

Page 21: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Imperative gives you the ability to do anything, and the responsibility to do everything.

— Steven Parkes, CTO @ Ascend

“”

Page 22: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Building a Control System for Declarative Pipelines

Page 23: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Our master plan...

Declarative Pipelines

Page 24: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

What should agood control systemdo?

Page 25: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Lineage Tracking

Global optimization + local namespacing /

sandboxing

Storage

Optimization

Data-aware/ProfilingData Backfill Metadata CentricDynamic Partitioning

Functional (Immutable of Blocks)

Static Type CheckingSLA-driven Scheduling

Graph Algorithms

Pipeline Rewritin

g

Pipeline EvolutionGraph Rewriting

Dynamic Scheduling

(time + trigger +

auto)

Time series Optimizations

Garbage Collection

Data Consistency &

Integrity Checks

--> Data Repair

Inter-transform OptimizationProvenance

Multi-Cloud + Region + Zone + etc.

Configuration Management

Declarative

Dataflow Scheduling

Static Type Influence

Aggregate Data Types

Semantically Transparent

Asynchronous Scheduling

SQL + Other Languages

(Python & Pyspark)

Data & Task

Deduplication

Separation of Logic & Execution

Hierarchical Components

Functional

Decomposition

Error Classification & Intelligent Retries

Page 26: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Then we spent 1, 2, 3 years building it!

Page 27: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

So... how does it work?

Page 28: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Separation of Logic & Control

User defined logic

Logic Plane

Dynamic task generation to achieve desired state

Control Plane

Fully managed, portable cloud services

Data Plane

Page 29: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

1. Is there anything I need to do?

2. What is the current state of my world?

3. What should it be?

4. What doesn’t match?

5. How do I “fix” it?5

4

3

2

1

The Control System Answers

Page 30: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

1) Is there anything I need to do?

Transform (SQL)● Filter● Join / Enrich

Source: files in S3● Location● Creds● Schema

Transform (SQL)● Daily Aggregation

Write to DB● Hostname● Creds● Update Rules

Logic Plane

A Simple Analytics Pipeline

Page 31: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

2) What is the current state of my world?

Transform (SQL)● Filter● Join / Enrich

Source: files in S3● Location● Creds● Schema

Transform (SQL)● Daily Aggregation

Write to DB● Hostname● Creds● Update Rules

gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/_ASCEND_METADATA.jsongs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000000.parquet...gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000010.parquet...

Logic Plane

Data Plane

A Simple Analytics Pipeline

Fragments● Stored [bucket]/[uuid]/● Metadata○ P-SHA○ UUID○ Data SHA○ Data Profile○ Job Statistics

P-SHA● Partition-level SHA

p_sha = data_sha # for source fragmentsp_sha = sha( transform, u.p_sha for u in input_fragments)

Control Plane

Page 32: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Fragments● Stored [bucket]/[uuid]/● Metadata○ P-SHA○ UUID○ Data SHA○ Data Profile○ Job Statistics

3) What should it be?

Transform (SQL)● Filter● Join / Enrich

Source: files in S3● Location● Creds● Schema

Transform (SQL)● Daily Aggregation

Write to DB● Hostname● Creds● Update Rules

gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/_ASCEND_METADATA.jsongs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000000.parquet...gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000010.parquet...

Logic Plane

Data Plane

A Simple Analytics Pipeline

Control Plane

3) Generate Expected P-SHA Set● For each Component, determine Partitions● Analyze Transform

→ Map, Partial, Full Reduction?● Load Upstream Data Profile

→ Determine Input Partitions → Calculate p_sha

P-SHA● Partition-level SHA

p_sha = data_sha # for source fragmentsp_sha = sha( transform, u.p_sha for u in input_fragments)

Page 33: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Fragments● Stored [bucket]/[uuid]/● Metadata○ P-SHA○ UUID○ Data SHA○ Data Profile○ Job Statistics

P-SHA● Partition-level SHA

p_sha = data_sha # for source fragmentsp_sha = sha( transform, u.p_sha for u in input_fragments)

4) What doesn’t match?

Transform (SQL)● Filter● Join / Enrich

Source: files in S3● Location● Creds● Schema

Transform (SQL)● Daily Aggregation

Write to DB● Hostname● Creds● Update Rules

gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/_ASCEND_METADATA.jsongs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000000.parquet...gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000010.parquet...

Logic Plane

Data Plane

A Simple Analytics Pipeline

Control Plane

3) Generate Expected P-SHA Set● For each Component, determine Partitions● Analyze Transform

→ Map, Partial, Full Reduction?● Load Upstream Data Profile

→ Determine Input Partitions → Calculate p_sha

4) Identify Missing P-SHAs● Input P-SHAs missing?→ Queue for update

● Self P-SHA missing?→ Queue for update

Page 34: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Fragments● Stored [bucket]/[uuid]/● Metadata○ P-SHA○ UUID○ Data SHA○ Data Profile○ Job Statistics

P-SHA● Partition-level SHA

p_sha = data_sha # for source partitionsp_sha = sha( transform, u.p_sha for u in input_partitions)

5) How do I “fix” it?

Transform (SQL)● Filter● Join / Enrich

Source: files in S3● Location● Creds● Schema

Transform (SQL)● Daily Aggregation

Write to DB● Hostname● Creds● Update Rules

gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/_ASCEND_METADATA.jsongs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000000.parquet...gs://ascend-io-dev-sean-dev-sean-record-fragments/21dd3c96_b039_4cc1_9566_2cfbb3ebe091/part-00000010.parquet...

Logic Plane

Data Plane

A Simple Analytics Pipeline

Control Plane

3) Generate Expected P-SHA Set● For each Component, determine Partitions● Analyze Transform

→ Map, Partial, Full Reduction?● Load Upstream Data Profile

→ Determine Input Partitions → Calculate p_sha

4) Identify Missing P-SHAs● Input P-SHAs missing?→ Queue for update

● Self P-SHA missing?→ Queue for update

5) Fix It● Generate Spark Job● Analyze input transform & data● Dynamically generate Spark params● Monitor job for success● Commit new P-SHA● … repeat ...

Page 35: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Scaling to 1B Partitions and 1T+ records per day

Page 36: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

● SHAs: lots and lots of SHAs (and SHAs of SHAs)● Caching: lots and lots of caching● Trees, not lists

○ Leverage time-series partitions○ Aggregate metadata○ SHAs at each node, not just leaves

● Be Lazy: Only do as much work as is useful right now

Scaling the Control Plane

Page 37: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

● Do less work

○ Intermediate Data Persistence

○ Data & Task De-duplication

● Do the right kind of work

○ Small file aggregation

○ Small job optimizations (local mode)

○ Specialized compute pools for different tasks

● Do it efficiently

○ Auto-Scaling Spark on Kubernetes w/ Spot/Preemptible Instances

○ Single-zone clusters (reduce network costs)

Scaling the Data Plane

Page 38: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

SaaS-ifying the Control Plane

MetadataStore

Data Plane (k8s)

Azure

Object Store

Load Balancer

Spark Jobs

Driver

Executor

Executor

Executor

AWS GCP

Control Plane (k8s)

High Perf WorkersAPI

Authn

Authz

Event Notification

KubeSpark

Records

Redis

Scheduler

Worker (Supervise)

nginx

Frontend

Page 39: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

● Garbage Collection: background task

● Multi-cloud abstractions: k8s, MinIO

● Data repair: failure to retrieve data → delete p-sha → self-heal

● Part files: similarities & differences with other fragments

● Resource Management: capacity-aware Control Plane

● SLA Driven Scheduling & Job Prioritization: per-component priority + upstream inheritance

● Scaling Spark on Kubernetes

● Scaling to 100+ environments: terraform, templates, monitoring, & automation

What we didn’t discuss...

Come ask us @ Office Hours!!!(4th floor in 476a)

Page 40: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Declarative programming is a paradigm that expresses the logic of a computation without describing its control flow…

… [in an] attempt to minimize or eliminate side effects by describing what the program must accomplish…

… rather than describe how to accomplish it.

Page 41: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Declarative Imperative

The What

Logic + Data → Tasks

The How

State + Tasks → Data

tl;dr

Page 42: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Declarative Pipelines

Less CodeFaster Dev

Fewer Breaks==

PROFIT!

Intelligent Control Plane

Page 43: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

Smarter Pipelines. Less Code.

Smarter Pipelines.Less Code.

Page 44: ASCEND - datacouncil.ai Council... · About Ascend Founded 2015 Team of 30 We

See You in Office Hours

● Right after this, 4th floor, 476a

● Ask me anything

● Meet our CTO, Steven Parkes

● Visit our booth at the Partner Spotlight gallery for a live demo & free swag

● Visit us @ www.ascend.io


Recommended