Enterprise Data Platform and role of Apache Spark
Seshu Adunuthula (@SeshuAd)
78%of GMB is fixed price$29.99
42%GMV VIA MOBILE
291MMOBILE APP DOWNLOADS
GLOBALLY
1.3BLISTINGS CREATED
VIA MOBILE EVERY WEEK
162MACTIVE BUYERS
25MACTIVE SELLERS
800MACTIVE LISTINGS
8.8MNEW LISTINGS
EVERY WEEK
Q4 2015 Q4 2015
eBay Marketplace
*Q3 2015 data
Items sold are new
79%Items are fixed
price
84%
Evolving from our Auction Roots..
Transactions offer free shipping
63%
Data is eBay’s Most Important Asset
Data Sciences
PersonalizationUser propensity modeling based on 5 quarters behavioral data. Cluster based unstructured data. User (Badges, Activity Synopsis, Word Cloud), Deals Personalization
Merchandising Similar items - recommending similar items on key placements on site and mobile. Powered among other things by co-clicked items.
Structured DataDeal discovery leverages machine learning model.
TrustPredictive machine learning models for fraud prevention, account take-over, prevent loss, bad buying experience prevention, buyer/seller risk prediction
Shipping Delivery experience: Building a model to predict more accurate and shorter delivery estimates.
BI & Analytical Apps
Search Backend950M items/ ~15TB indexes in 2.5 hours on a daily basis.2M item/11TB index subsets generated near real-time in 3 minutes
Search Sciences Search ranking/spam/recall factors or data (like phrase table, query-rewrite table, etc.) preparation, including many pipelines built on top of search Scala platform
s
Structured DataConvert 800M Item listings into Product pages that are automatically curated and persisted
Traffic – Paid Search10% efficiency lift for paid search (Amber model)Identify low performing items (Google search)
Data Preparation. Bot detection, data transformation, sessionization for user behavior and core data sets
ExperimentationA/B Testing for new features and user Experiences released on the site.
Behavioral Data
Search
SPD: Provides global B2C and C2C seller performance overview
Nous, DNA: Provide self service product experiences analysis on product health, behavior analytics and product domain specific reporting
System Services
Sherlock Monitoring: Applications logs (CAL logs) are stored and processed on Hadoop.
Data is eBay’s DNA
Enterprise Data Platform
ENTERPRISE DATA PLATFORM
9
Agile Data Warehouse Data Streams
BatchHumans
Sets of data
StreamsSystems
Sets of data
Data Services
ServicesApplicationsSpecific calls
PopulatedUsed by
How
Enterprise
PopulatedUsed by
How
Enterprise Data Platform
Structured DataSQL
Interactive Relational Analysis
Semi Structured DataJava/Scala…
Batch and Ad-Hoc Algorithmic Analysis
Relational AnalysisProgrammatic Exploration
EDW Hadoop
10
Analysts/BU PM/Executives/Tools Analysts/Scientists/Tools Scientists/Engineers
Agile Data Warehouse
11
Simplify Access to DW
Cross Platform VDMs
Geo Distributed Caches
Data Virtualization
Apache Kylin: Extreme OLAP Engine
12
Cube Build Engine
SQL
Low Latency - Seconds
Mid Latency - Minutes Routing
3rd Party App(Web App, Mobile…)
Metadata
SQL-Based Tool(BI Tools: Tableau…)
Query Engine
HadoopHive
REST API JDBC/ODBC
Star Schema Data Key Value Data
Data Cube
OLAPCube
(HBase)
SQL
REST Server
MOLAP Cubes
ANSI SQL on Hadoop
Interactive Query on Billionsof Rows
Apache Kylin: Extreme OLAP Engine
Data Streams Platform
Apache Kafka
Behavior TXN User
Streaming Apps
Sandbox
Stream Processing Clusters OpenSample Streams Needs
Approval
Staging Pool1 Pool2
Rheos App Manager
Configuration
Deployment
GitHub
Tora ETL
Connectors
Hadoop Teradata
Pool3
Risk Real-time RepresentationOf EDW Datasets
Centralized Shared DataStreams
DW populated using theStreams Platform
Data Streams Platform
Q3 2015
Data Search & Discovery
Collaborative Analytics
Data Governance
Data ServicesData Services
•Execution environment tailored for Spark
•Governance of Big Data Apps with a PaaS layer
•Application life cycle spanning development to deployment
Spades – Spark Provisioning & Deployment Environment
Data Pipelines
Ingest ServersListeners
AnalystsAnalytics Platform & DeliveryCALApplicationServereBay Visitors
Application servers
SingularityHadoopCentral App Logging
BEHAVIORAL DATA PIPELINE Behavioral Data Pipeline
18
Behavior Data: A/B TestingBehavioral Data: A/B Testing
19
Transactional Data Pipelines
• Initial pattern
• More data available faster on Hadoop• Leverage Hadoop SQL / Spark
• Subsequent pattern
• >10% datasets available on Hadoop• New innovation avenues via OpenSource
• Give humans more Teradata capacity • Teradata data available no later than before
• <95% extract / daily batch• >5% stream / frequent batch
Transactional Data Pipelines