Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | rolf-patterson |
View: | 217 times |
Download: | 1 times |
The Evolution of Big Data Platform@
Netflix
Eva TseJuly 22, 2015
Our biggest challenge is scale
Netflix Key Business Metrics
65+ millionmembers
50 countries 1000+ devices
supported
10 billionhours / quarter
Global Expansion200 countries by end of 2016
Big Data SizeTotal ~20 PB DW on S3 Read ~10% DW dailyWrite ~10% of read data daily
~ 500 billion events daily
~ 350 active users
Our traditional BI stack is our competition
How do we meet the functionality bar and yet make it scale?
How do we make big data bite-size again?
Our North Star
• Infrastructure– No undifferentiated heavy lifting
• Architecture– Scalable and sustainable
• Self-serve– Ecosystem of tools
Cloudapps
Suro/Kafka Ursula
CassandraAegisthus
Dimension Data
Event Data
15 min
Daily
AWS S3
SS Tables
Data Pipelines
Parquet FF
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
(Hadoop clusters)
Job/Cluster perfvisualization
Data lineage
Data quality
Storage Compute Service Tools
(Federated execution service)
AWS S3
Analytics
ETL
Interactive data exploration
Interactive slice & dice
RT analytics & iterative/ML algo
Evolving Big Data Processing Needs
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
Job/Cluster perfvisualization
Data lineage
Data quality
Service Tools
(Federated execution service)
Big Data Portal
API Portal
Big Data APIEvolving Services/Tools Ecosystem
AWS S3 as our DW Storage• S3 as single source of truth (not HDFS)• 11 9’s durability and 4 9’s availability• Separate compute and storage• Key enablement to
– multiple clusters– easy upgrade via r/b deployment
Evolution of Big Data Processing Systems
• Analytics• Hive-QL is close to ANSI SQL syntax• Hive metastore serves as single source
of truth for metadata for big data
• ETL• Better language construct for ETL • Contributions since 0.11• Customization
– Integration with Metacat to Hive Metastore
– Integration with S3
• Interactive data exploration and experimentation• Why we like presto?
– Integration with Hive metastore– Easy integration with S3– Works at petabyte scale– ANSI SQL for usability– Fast
• Our contributions– S3 file system– Query optimizations– Complex types support – Parquet file format integration– Working on predicate pushdown
Parquet
• Columnar file format• Supported across Hive, Pig, Presto, Spark• Performance benefits across different processing engines• Working on vectorized read, lazy load and lazy
materialization
• Interactive dashboard for slicing and dicing• Column-based in-memory data store for time series data• Serves a specific use case very well
• ETL, RT analytics, ML algorithms• Why we like Spark?
– Cohesive environment – batch and ‘stream’ processing– Multiple language support – Scala, Python– Performance benefits– Run on top of YARN for multi-tenancy– Community momentum
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
Job/Cluster perfvisualization
Data lineage
Data quality
Service Tools
(Federated execution service)
Big Data Portal
API Portal
Big Data APIEvolution of Services/Tools
Ecosystem
• Federated execution engine• Expose [your fave big data engine] as a
service • Flexible data model to support future job
types• Cluster configuration management
Metacat• Federated metadata catalog for the whole data platform
– Proxy service to different metadata sources
• Data metrics, data usage, ownership, categorization and retention policy …
• Common interface for tools to interact with metadata
• To be open sourced in 2015 on Netflix OSS
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
Job/Cluster perfvisualization
Data lineage
Data quality
Service Tools
(Federated execution service)
Big Data Portal
API Portal
Big Data API dd
Big Data API• Integration layer for our ecosystem of tools and services• Python library (called Kragle)• Building block for our ETL workflow• Building block for Big Data Portal
Big Data Portal• One stop shop for all big data related tools and services• Built on top of Big Data API
Open source is an integral part of our strategy to achieve scale
Big Data Processing Systems
Services/Tools Ecosystem
Why use Open Source?• Collaborate with other internet scale tech companies• Unchartered area/scale, lock-in is not desirable• Need the flexibility to achieve scalabilityBUT…• Lots of choices• White box approach
Why contribute back?
• Non IP or trade secret • Help shape direction of projects • Don’t want to fork and diverge• Attract top talent
Why contribute our own tool?
• Share our goodness• Set industry standard• Community can help evolve the tool
Is open source right for you?
Measuring big data - understanding data by usage
By Charles Smith, NetflixTomorrow @ 1:40-2:20pm