Date post: | 08-Jan-2017 |
Category: |
Data & Analytics |
Upload: | datamantra |
View: | 402 times |
Download: | 0 times |
Productionalizing a Spark application
Productionalizing an application on a frequently evolving framework like Spark
● Shashank L
● Big data consultant and trainer at datamantra.io
● www.shashankgowda.com
Agenda
● Financial analytics● Requirements● Architecture● Initial solution● RDD to Dataframe API● Code quality and testing● Architectural changes● Future improvements● Lookback
Financial Analytics
Financial analytics is used to predict the stock prices for a specific company using its historical
price information
Architecture
Stocks data (Daily basis)
Sql Server
ETL - Pipeline HDFS
Data preprocessing Data Analytics NoSQL
Frontend(Dashboard)
Our team● Data scientists
○ Coming up with the new magic
● Data engineers○ Productionalizing the magic on large datasets
● Front end developer○ Consumes results to make it presentable to
clients.
Requirements● Across geography developers● Variety of developers in team● Better code quality● Better testing mechanisms● Easier team expansion● Lesser infrastructure maintenance overhead● Use latest libraries available
Iteration 1
Initial solution
Iteration 1● Data scientists
○ They were well versed with Python or SQL○ They did analysis using Python Panda dataframe code○ Analysis were tested on only small set of data
● Data engineers○ Using Spark - Spark 0.9○ They used to port Python to Scala RDD API to be able to
scale the analysis to big data○ Custom Framework with ability to write into and read from
multiple sources (File, Hive Table, S3, JDBC)
Data engineers
ArchitectureStocks data (Daily basis)
Sql Server
ETL - Pipeline
HDFSData
preprocessing Data Analytics NoSQL
Frontend(Dashboard)
Analysis(Python)
Data scientists
Challenges● Framework challenges
○ Porting code from one language to another would lead to a lot of inaccuracies
○ Differences in the language constructs and API lead to change in code design
● Architectural challenges○ Clusters used by the team were manually created and
maintained○ Intermediate data was saved in a text based csv
format.
Iteration 2
RDD API to Dataframe API
Iteration 2● Upgrade to Spark 1.3● Data scientists
○ Dataframe API was introduced which was a better known interface for Data scientists
○ SQL API was easier for the Data scientist to perform simple operations
○ Zeppelin for Data scientists to prototype the analytical algorithms
● Data engineers○ CSV based intermediate format to Parquet○ Amazon EMR based Hadoop cluster with Spark on it
Data science cluster
Data engineer Architecture
Stocks data
ETL HDFS
Zeppelin
Dashboard
Data Analytics(PySpark)
Data engineering cluster
Data preprocessing Data Analytics NoSQL
Challenges● Quality challenges
○ Productionalizing multiple analysis required expansion of Data engineering team
○ Team expansion induced code quality issues and bugs in the code
○ Unit tests for the each functionalities were not present
○ Review process for the changes in the code were not present
Iteration 3
Code quality and testing
Iteration 3● Creation of unit test cases for all the analysis● More readable test case suite for the code using
ScalaTest (http://www.scalatest.org/)● Test cases for unit testing small functionalities and
flow testing to test the full ETL flow on sampled data● Review process for the changes in the code through
Github PR● Daily build in Jenkins to test the flow and
functionalities on a daily basis
ScalaTestclass ExampleSpec extends FlatSpec with Matchers {
"A Stack" should "pop values in last-in-first-out order" in {
val stack = new Stack[Int]
stack.push(1)
stack.push(2)
stack.pop() should be (2)
stack.pop() should be (1)
}
it should "throw NoSuchElementException if an empty stack is popped" in {
val emptyStack = new Stack[Int]
a [NoSuchElementException] should be thrownBy {
emptyStack.pop()
}
}
}
Github PR
Challenges● Architectural challenges
○ Cluster resources was a bottleneck for the teams○ Amazon EMR clusters were not throw away
clusters as data was stored in HDFS.○ Upgrading the Spark version on the cluster was
difficult○ Infrastructure to run scheduled jobs was missing
as Jenkins was not the best way to schedule jobs○ Stability issues with Zeppelin
Iteration 4
Architectural changes
Iteration 4● Moved the data storage from HDFS to s3● Moved to Databricks cloud environment (https:
//databricks.com/product/databricks)● Databricks cloud provides notebook based interface
for writing Spark code in Scala, Java, Python and R● Encourage data scientists to use Scala API● Travis for deployment and testing
Databrick cloud● Cluster config
○ Launch, configure, scale and terminate
Databrick cloud● Jobs
○ Schedule complex workflows
Databrick cloud● Notebooks
○ Explore, Visualize and Share
Improvements● Data engineers
○ Cluster bottleneck was solved with creating multiple throw away clusters when needed.
○ Need not stick to a cluster for a long time as primary data storage was s3
○ Terminating cluster when not being used would be cost efficient
○ Multiple clusters with different versions of Spark enables the user to try out the latest feature in Spark
○ Cluster maintenance and tuning overhead
Improvements● Data engineers
○ Lesser turnaround time in understanding bottlenecks in the workflows
○ Databricks cloud Jobs can be used for scheduling workflows and daily runs
○ Travis enabled strict and immediate code testing● Data scientists
○ Data Scientists can easily share the notebooks and results of the analysis with the team
○ Ability to write in multiple languages
DATABRICKS CLOUD
Jobs
Architecture
DashboardNoSQL
S3
ETL
Stocks data
Datasciencecluster
Notebook(R/Python)
DataEnggcluster1
Notebook(Scala)
DataEnggcluster2
Notebook(Scala)
Challenges● Framework challenges
○ Schema is static and doesn’t change frequently○ Dataframe doesn’t have static schema check○ Pipeline fails in the middle of the processing if there
is any change in the data○ Current window analysis uses Scala constructs to
load specific set of data to memory and run ML on top of it
○ Domain object based functions are called from inside udf currently
Iteration 5
Road ahead
Iteration 5 (Future iteration)● Data engineers
○ Port analysis from Dataframe API into Dataset API (in Spark 2.0)
○ With Dataset API, we get static schema check○ Using existing Domain object based functions
● Data scientists○ Move from Scala window based analysis to
SparkSQL window analytics
Lookback● Spark version
○ 0.9 -> 1.6.0● API
○ RDD -> Dataframe -> Dataset● Deployment
○ EC2 -> EMR -> DB cloud● Scheduling
○ Jenkins -> DB cloud Jobs● Language
○ Scala
Lookback● Data format
○ Text -> Parquet● Storage
○ HDFS -> s3● Deployment
○ Jenkins -> Travis
References● http://go.databricks.com/databricks-community-
edition-beta-waitlist● https://databricks.com/blog/2014/07/14/databricks-
cloud-making-big-data-easy.html● http://shashankgowda.com/2016/02/20/introduction-
to-dataset-api-in-spark.html
Thank you