+ All Categories
Home > Documents > Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd....

Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd....

Date post: 15-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
19
Faster ETL Workflows using Apache Pig & Spark Praveen Rachabattuni, Sigmoid @praveenr019
Transcript
Page 1: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Faster ETL Workflows using Apache Pig & Spark

-­ Praveen Rachabattuni, Sigmoid

@praveenr019

Page 2: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

About me

Apache Pig committer and Pig on Spark project lead.

OUR CUSTOMERS

Page 3: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Why pig on spark ?

Spark shell (scala), Spark SQL, Dataframes API

Much familiar language with developers/analysts and

easier to debug

Page 4: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Why pig on spark ? -­ contd.

Targeted users

Analysts

Large pig script codebase projects

Cost saving for organizations in training new frameworks

Rich operator library

Page 5: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

How Spark plugs into Pig ?

Logical Plan

Physical Plan

MR Plan

MR Exec Engine

Logical Plan

Physical Plan

Spark Exec Engine

Conversions

Page 6: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Pig Operator Spark Operator

Load newAPIHadoopFile

Store saveAsNewAPIHadoopFile

Filter filter transformation

GroupBy groupby & map

Join CoGroup

ForEach mapPartitions

Sort sortByKey + map

Operator Mapping

Page 7: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

How Spark plugs into Pig ? – contd.

Logical Plan

Physical Plan

MR Plan

MR Exec Engine

Logical Plan

Physical Plan

Spark Plan

Spark Exec Engine

Conversions

Page 8: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Simple script

A = LOAD './wiki' USING PigStorage(' ') as (hour:chararray, pcode:chararray, pagename:chararray, pageviews:chararray, pagebytes:chararray);;

B = FILTER A BY pageviews >= 50000;;DUMP B;;

Input data:en Main_Page 242332 4737756101

ak Italy 400 73160

en Main_Page 242332 4737756101

Page 9: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Load operator

@Overridepublic RDD<Tuple> convert(List<RDD<Tuple>> predecessorRdds, POLoad poLoad)

throws IOException JobConf loadJobConf = SparkUtil.newJobConf(pigContext);;configureLoader(physicalPlan, poLoad, loadJobConf);;

RDD<Tuple2<Text, Tuple>> hadoopRDD = sparkContext.newAPIHadoopFile(poLoad.getLFile().getFileName(), PigInputFormatSpark.class,Text.class, Tuple.class, loadJobConf);;

// map to get just RDD<Tuple>return hadoopRDD.map(TO_TUPLE_FUNCTION,

SparkUtil.getManifest(Tuple.class));;

Page 10: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Load operator (cont..)

private static class ToTupleFunction extendsAbstractFunction1<Tuple2<Text, Tuple>, Tuple> implementsFunction1<Tuple2<Text, Tuple>, Tuple>, Serializable

@Overridepublic Tuple apply(Tuple2<Text, Tuple> v1) return v1._2();;

Page 11: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Filter operator

@Overridepublic RDD<Tuple> convert(List<RDD<Tuple>> predecessors,

POFilter physicalOperator) SparkUtil.assertPredecessorSize(predecessors, physicalOperator, 1);;RDD<Tuple> rdd = predecessors.get(0);;FilterFunction filterFunction = new FilterFunction(physicalOperator);;return rdd.filter(filterFunction);;

Page 12: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Filter operator (cont..)

private static class FilterFunction extendsAbstractFunction1<Tuple, Object> implements Serializable

private POFilter poFilter;;@Overridepublic Boolean apply(Tuple v1) Result result;;try poFilter.setInputs(null);;poFilter.attachInput(v1);;result = poFilter.getNextTuple();;

return result.returnStatus;; catch (ExecException e) throw new RuntimeException("Couldn't filter tuple", e);;

Page 13: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Spark plan

MR Plan is structured towards mapreduce execution engine.

Spark plan contains a sequence of transformations and more

optimized towards Spark.

Handover logical plan to Spark for much optimized flow, as Spark is

pretty good at doing this.

Page 14: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Benchmark

Page 15: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Setting up Pig on Spark1. Get the code

git clone https://github.com/apache/pig -­b spark

2. Building the project

ant -­Dhadoopversion=23 jar (assumes hadoop-­2.x setup)

3. Env variables

export HADOOP_USER_CLASSPATH_FIRST="true” && export SPARK_MASTER=”local”

4. Start pig grunt shell

bin/pig -­x spark

Page 16: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Roadmap

Spark plan to stand inline with Spark APIs

High shuffle impacting performance

Functional difference with Pig on mapreduce

Page 17: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Contributors

Praveen Rachabattuni (Sigmoid)

Liyun Zhang (Intel)

Xuefu Zhang (Cloudera)

Mohit Sabharwal (Cloudera)

Xianda (Intel)

Page 18: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

References

Apache Pig github mirror

https://github.com/apache/pig/tree/spark

Umbrella jira for Pig on Spark

https://issues.apache.org/jira/browse/PIG-­4059

Page 19: Faster’ETLWorkflows’usingApachePig&Spark’ · Why’pig’on’spark’? E contd. Targeted(users Analysts Large(pig(script(codebase(projects Cost(saving(for(organizationsin(training(new(frameworks

Thank you

Queries ??


Recommended