+ All Categories
Home > Documents > Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at...

Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at...

Date post: 06-Feb-2018
Category:
Upload: vuduong
View: 218 times
Download: 3 times
Share this document with a friend
11
1 © Cloudera, Inc. All rights reserved. Cloudera’s Investments in the Spark Ecosystem Mike Olson | Founder and Chief Strategy Officer
Transcript
Page 1: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera’s  Investments  in  the  Spark  Ecosystem  Mike  Olson  |  Founder  and  Chief  Strategy  Officer  

Page 2: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Our  history  with  Spark  • On  the  radar  since  2009  (Matei  Zaharia  and  the  RAD  Lab)  •  See  my  2013  blog  post  (“MapReduce  and  Spark”)  •  1st  vendor  to  ship  and  support  Spark  •  6  contributors  to  Spark  v1    (all  other  Hadoop  vendors:  zero)  •  2+  commiXers  (all  other  Hadoop  vendors:  zero)  •  Complemented  by  Intel’s  substanYal  &  early  investment  • Working  across  the  project:  • Core,  Streaming,  Security,  YARN  w  Yahoo!,  Mllib  •  Sentry,  Hive,  Pig,  Crunch,  Dataflow  on  Spark  • Cloudera  Manager,  training,  PS  (6+),  UG,  books,  etc  

•  Single  largest  commercial  distributor  of  Spark  (per  Typesafe/Databricks  survey)  

Page 3: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Our  posiYon  on  Spark  

• Cloudera  is  a  member  of,  and  aligned  with,  the  global  Spark  community  •  Spark  will  replace  MapReduce  as  the  general  purpose  Hadoop  framework  • Tremendous  community  –  400  developers  across  50  companies  • Hadoop  ecosystem  integraYon  (naYve  &  3rd  party)  • Doesn’t  mean  MapReduce  goes  away  –  it  will  be  the  historical  framework  

•  Spark  is  not  just  for  data  science  /  ML  •  Spark  does  not  replace  special  purpose  frameworks  • One  size  does  not  fit  all  for  SQL,  Search,  Graph,  Stream  

Page 4: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Why  Spark  MaXers:  LogisYc  Regression  (data  fits  in  memory)  

0  

500  

1000  

1500  

2000  

2500  

3000  

3500  

4000  

1   5   10   20   30  

Runn

ing  Time  (s)  

Number  of  Itera5ons  

Hadoop  

Spark  

110  s  /  iteration  

first  iteration  80  s  further  iterations  1  s  

Page 5: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

5  ©  Cloudera,  Inc.  All  rights  reserved.  

In-­‐Memory  Datasets  

Trends  ½  price  every  18  months  2x  bandwidth  every  3  years    The  numbers  get  even  more  interesYng  with  upcoming  enhancements  to  the  Intel  architecture.  

128  –  384  GB  

12-­‐24  cores  

50  GB  per  sec  

Memory  an  enabler  for  high  performance  big  data  applica5ons  

Page 6: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Delivering  Spark  in  Cloudera  Enterprise  

Hadoop  Integra5on  •  Standard  Hadoop  data  formats  •  Runs  under  YARN  in  mixed  clusters  •  Security    Libraries  •  Mllib  –  Machine  Learning  toolkit  •  GraphX  (alpha)  –  Graph  analyYcs  

based  on  PowerGraph  abstracYons  •  Spark  Streaming  –  Near  real-­‐Yme  

analyYcs  

Language  support:  •  SparkR  (upcoming)  •  Java  8  • PySpark  and  pandas  interoperability  • Dataframe  API  •  Schema  support  in  Spark’s  APIs  •  SQL  support  in  Spark  Streaming  (upcoming)  

Page 7: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera’s  Spark  Investments  for  2015  Partner  of  choice  for  companies  doing  Spark  integraYon  Increase  our  involvement  in  the  community  

Community  leadership  

Complete  Hive  on  Spark  Complete  Pig  on  Spark  Oozie  acYon  for  Spark  (Oozie  team)  Improve  Spark  core  shuffle  primiYves  to  be  equivalent  or  beXer  than  MapReduce  in  all  respects  Integrate  with  Google  DataFlow  Support  advanced  features  such  as  runYme  DAG  opYmizaYon  

Batch  Tool  of  choice  /  Replace  MR  

EDH  IntegraYon  and  cluster  ciYzenship  

AutomaYc  executor  launch  /  destrucYon  based  on  usage  ValidaYon  of  Parquet  /  Avro  with  Impala  style  usage  Improved  integraYon  with  HBase  to  simplify  RDD  creaYon  against  HBase  tables  ATS  integraYon  for  Spark  Container  resizing  with  YARN  support  Tachyon  alternaYve  in  HDFS  (dependent  on  HDFS  team  prioriYes)  +  off-­‐heap  caching  

Ease  of  development  

Provide  EXPLAIN  PLAN  primiYves  at  runYme  and  compile  Yme  ProgrammaYc  job  submission  interface  Auto-­‐compute  parYYon  model  to  simplify  configuraYon  space  for  users  

Enterprise  grade  

CM  integraYon;  AMon  integraYon;  tuning  hints;  validaYon  Parallel  split  generaYon  REST  API  for  Spark  History  Server  Security  -­‐  EncrypYon:  On-­‐the-­‐wire  encrypYon,  shuffle  encrypYon  Security  -­‐  Navigator:  IntegraYon  with  Audit,  Lineage  (visible  through  Hive  as  well)  Scale  -­‐  Validate  Spark  at  very  large  scale  and  improve  scalability  where  issues  are  found  Security  -­‐  MR  /  Spark:  RecordService  for  deeper  Sentry  integraYon  Security  -­‐  AuthorizaYon:  Integrate  Schema  RDDs  with  Sentry  Availability:  Spark  Streaming  availability    (mostly  complete)  

Data  science  tool  of  choice  

Hue  app  for  Spark  (a  la  Zeppelin,  Databricks):  Phase  1    Rest  based  interface  to  Spark  for  Hue  Oryx2  built  on  Spark  for  data  science  lifecycle  management  

Page 8: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  customer  use  cases  –  core  Spark  Sector   Use  case   Replaces  

Financial  Services  

•  Value-­‐at-­‐Risk  calculaYons  •  ETL  pipeline  speed-­‐up  •  Analyzing  stock  data  for  20  years  

Home  grown  applicaYons  

Genomics   •  IdenYfy  genes  implicated  in  disease  onset  in  full  human  genome  

MySQL  engine  

Data  services   •  Trend  analysis  using  staYsYcal  methods  on  large  data  sets  •  Document  classificaYon  (LDA)  •  Fraud  analyYcs  

•  Netezza  replacement  •  Net  new  

ERP   •  OCR  and  bill  classificaYon   Net  new  

Healthcare   •  CalculaYng  Jaccard  scores  on  health  care  data  sets   Net  new  

Page 9: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  customer  use  cases  –  Streaming  Sector   Use  case   Replaces  

Financial  Services  

•  On-­‐line  fraud  detecYon   Net  new  

Many   •  ConYnuous  ETL  

Retail   •  On-­‐line  recommender  systems  •  Inventory  management  

•  Custom  apps    

Page 10: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Why  Cloudera?  

• Deep  engineering  investment  –  only  distribuYon  vendor  with  engineering  contribuYons  to  Spark  and  actual  technical  know-­‐how  

•  Field  team,  support,  training  and  services  with  experience  in  many  Spark  use  cases  • Driving  roadmap  for  Spark  

ExperYse  

• Most  customers  running  Spark  across  all  distribuYons  put  together  • Range  from  few  nodes  to  800+  nodes  •  Longest  field  presence  –  first  vendor  to  support  and  sYll  only  two  vendors  with  official  support  

Experience  

•  Intel  partnership  brings  15  Spark  developers  focused  on  Cloudera  customer  use  cases  • Business  relaYonship  with  Databricks  to  do  joint  development  on  Spark  

Partnerships  

Page 11: Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  [email protected]  @mikeolson  


Recommended