Using Scalding for Data-Driven Product Development
Sasha OvsankinLinkedIn
Presented to Scala By The BayAug 9, 2014
/summary
Data-Driven Product
Development
/summary
Data-Driven Product
Development
Scalding = Hadoop + Scala
/summary
Data-Driven Product
Development
Scalding = Hadoop + Scala
/data-driven
YourService
/data-driven
YourService
Value
/data-driven
YourService
Value Data
/data-driven
YourService
Value Data
/data-driven
YourService
Value Data
/data-driven
YourAmazing
Service
Value Data
“Online” World
/data-driven/linkedin
Web Applications
NoSQL Data Stores
ETL
“Offline” World (Hadoop)
HDFS
Hadoop Jobs
Tracking/logging
Analytics
Data Products
Messaging
Message delivery
Databases
/linkedin/big-data/links
• “LinkedIn Big Data Ecosystem”– http://lnkd.in/big-data-ecosystem
• Grid Operations– http://lnkd.in/gridops2013
/scalding
http://github.com/twitter/scalding• Scala-based DSL for Map/Reduce jobs• Built on Cascading, stable and mature Hadoop framework• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}
• Succinct and powerful• High level of abstraction
/data-driven/problem/scaling
• Problem: Scaling• Solution– Distributed processing– High-level description of algorithms– Functional programming
…/solution/scalding
../problem/complexity
• Problem: Complexity• Solution– Consistent way of organizing data• Self-describing data formats (Avro)• File organization
– Type safety– Modularization
…/solution/scalding
/linkedin/hadoop/practices
• All online data end up in HDFS– Avro encoding is standard
• Production Process– CI/Automatic Build
• More info forthcoming
– Production Review– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem
../solution/scala/killer-argument
• Map & reduce -- primitivesscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500
/linkedin/scalding/status
• Started >1 year ago• Thousands of production LOC written in Scalding by our
team– Pretty happy with readability, maintainability and tooling
support• Dozens of flows are currently in production, and counting• Created Scalding user group• Growing interest• Learning:
– Scala[Scalding] < Scala[ _ ]
/summary
Data-Driven Product
Development
Scalding = Hadoop + Scala
/linkedin/join-us
• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them become
more productive and successful• We are looking for amazing people interested in Software
Engineering and Data Science– http://linkedin.com/careers
Questions?