Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
3
Why Oozie?
Out-of-box support for multiple job types Java, shell, distcp Mapreduce
• Pipes, streaming pig, hive, spark
Highly scalable High availability
Hot-Hot with rolling upgrades Load balanced
Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCatalog
Scale at Yahoo
4
Deployed on all clusters (production, non-production)One instance per cluster
75 products / 2000 + projects255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)Between 1-8 actions :Avg. 4 actions/workflowExtreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Data Pipelines
6
Ad ExchangeAd LatencySearch Advertising
Content ManagementContent OptimizationContent PersonalizationFlickr Video
Audience TargetingBehavioral TargetingPartner TargetingRetargetingWeb Targeting
Advertisement Content Targeting
Data Pipelines
7
Anti SpamContentRetargeting
ResearchDashboards & ReportsForecasting
Email Data Intelligence Data Management
Audience Pipeline
Use Case - Data pipeline
8
9
Oozie Coordinator<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://localhost:9000/tmp/workflows</app-path> </workflow> </action> </coordinator-app>
Current limitation of Oozie coordinator
• All dataset are required• All instance are forced• We can’t combine datasets from multiple provider• There is no way to assign priority among datasets
10
11
Complex dependencies
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
12
Oozie Coordinator with input logic<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <input-logic> <or name=“input1ORinput2”> <data-in dataset=“input1”/> <data-in dataset=“input2"/> </or> </input-logic>…...............
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.
<input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or></input-logic>
13
14
Minimum availability processing
Some time, we want to process even if partial data is available.
<input-logic><data-in dataset=“A" min=”4”/>
</input-logic>
15
Optional feeds
Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.
<input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>
16
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time.
<input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or></input-logic>
17
Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.
<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance></data-in>
<data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance></data-in>
<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>
18
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
20
MiniOozie
MiniOozie HCat Pig Hive Spark
MiniOozieClient To communicate with oozie server.
21
Oozie unit Yamlname: TestCoordinatorjob: properties: raw_logs_path: "/tmp/test/input" aggregated_logs_path: "/user/test/output” oozie.coord.application.path: src/test/resources/coordinator-test.xmlhdfs: touchz: - /tmp/test/input/2010/02/01/09/_SUCCESS - /tmp/test/input/2010/02/01/10/_SUCCESS mkdir: - /user/test/outputvalidations: validate_job: sleep: 6000 coordinator_actions: - coordinator_action : "@2" not_status: WAITING nominal_time: 2010-02-01T11:00Z
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
Spark Action
Yahoo Confidential & Proprietary
• Oozie native support for Apache Spark jobs
• Introduced last year in Apache Oozie 4.2.0
Example
Yahoo Confidential & Proprietary
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Spark-FileCopy</name>
<class>org.apache.oozie.example.SparkFileCopy</class>
<jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
<file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file>
<archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive>
<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts>
<arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg>
<arg>${nameNode}/${examplesRoot}/output-data/spark</arg>
</spark>
PySpark Example
Yahoo Confidential & Proprietary
Automatically sets up pyspark.zip and py4j-src.zip from Sharelib
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>PySparkExample</name>
<jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar>
<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts>
</spark>
Modes supported
Yahoo Confidential & Proprietary
• For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting any properties for Driver, property should be prefixed with oozie.launcher.
• For ex, oozie.launcher.mapreduce.map.memory.mb and oozie.launcher.mapreduce.map.java.opts should be modified for increasing driver memory.
Master Mode
local[*]
yarn client
yarn cluster
Recent enhancements
Yahoo Confidential & Proprietary
• Support for PySpark jobs
• Show Spark Job URLs in Oozie UI under Child Jobs Tab
• Automatically include spark-defaults.conf from Sharelib
• Support for <file> and <archive>
• Faster job launch time• Simplify setting up of classpath
• Avoid re-uploading jars for localization by reusing hdfs paths in mapreduce.job.cache.files
• Couple of bug fixes
Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
29
Future Work
Oozie Unit testing framework No unit tests now. Directly tested by running in staging
Coordinator Dependency management Better reprocessing
Aperiodic processing Managed through workarounds