+ All Categories
Home > Documents > Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf ·...

Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf ·...

Date post: 20-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
45
1 © Cloudera, Inc. All rights reserved. Cloudera Improvements in Apache Spark Brian Baillod | Sales Engineer
Transcript
Page 1: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  Improvements  in  Apache  Spark  Brian  Baillod    |  Sales  Engineer  

Page 2: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Agenda  

•  Introduc@on  • Spark  One  PlaCorm  Ini@a@ve  • Spark  Overview  and  Improvements  • Spark  Proof  of  Concept  • Kudu  and  Record  Service  

Page 3: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  company  snapshot  

Founded  2008,  by  former  employees  of    Employees  Today  900+  worldwide    World  Class  Support  More  than  75  24x7  global  staff  

 Cloudera  University  Over  40,000  trained    We  help  code  Hadoop    Cloudera  employees  are  leading  developers  &  contributors  to  

the  complete  Apache  Hadoop  ecosystem  of  projects    

We  help  fix  Hadoop  Cloudera  fixed  60%  of  all  Hadoop  JIRA  bugs  

Page 4: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Hadoop  Adop@on  

Categories  of  Hadoop  adop/on  

Big  Data  Maturity  

Training  

Services  &  Support  

Subscrip/on  

Free/Developer  

Business  Need  

Training   60%  of  Fortune  100  a`ended  Cloudera  training,  over  40,000  trained  since  2009  

Service  &  Support  9/10  for  support  sa@sfac@on,  ability  to  solve  technical  issues  #1  recommenda/on  

Subscrip@on   Over  2x  revenue  of  nearest  compe@tor,  90%  renewal  rate  

Free/Developer   Over  2.5  million  downloads  

Page 5: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

5  ©  Cloudera,  Inc.  All  rights  reserved.  

What  is  Spark    

•  Fast  general  purpose  processing  engine  for  large  data  • Provides  API’s  in  Java,  Scala  and  Python  •  Includes  an  advanced  DAG  execu@on  engine  that  supports  in-­‐memory  compu@ng  •  Includes  high  level  tools  like  SparkSQL,  Mllib,  GraphX,  and  Spark  Streaming  • Can  run  in  a  cluster,  standalone,  or  local  •  Latest  version  is  1.5.1  •  Spark.apache.org  •  LOTS  of  momentum    

Page 6: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  One  PlaCorm  Ini@a@ve    

• Cloudera  is  doubling  down  on  Spark  

• Outlining  a  vision  for  the  future  •  Kudu,  Record  Service,  Auto-­‐tuning,  Security,  Kaia  integra@on  

• Challenging  other  vendors  to  par@cipate  in  Spark  Development  

Page 7: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera’s  Engineering  Commitment  to  Spark  

Cloudera  67%  

Intel  17%  

Hortonworks  17%  

Spark  CommiPers  by  Hadoop  Distribu/on*  

*  IBM  and  MapR  have  0  commiPers  

Spark  Patches  by  Hadoop  Distribu/on  

Cloudera,  370  Hortonworks,  4  IBM,  12  MapR,  1  Intel,  400  

Page 8: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  will  replace  MapReduce  To  become  the  standard  execu@on  engine  for  Hadoop    

Page 9: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

9  ©  Cloudera,  Inc.  All  rights  reserved.  

The  Future  of  Data  Processing  on  Hadoop  Spark  complemented  by  specialized  fit-­‐for-­‐purpose  engines  

General  Data  Processing  w/Spark  

Fast  Batch  Processing,  Machine  Learning,    and  Stream  Processing  

 

Analy/c  Database  w/

Impala  Low-­‐Latency  

Massively  Concurrent  Queries  

     

Full-­‐Text  Search  w/Solr    Querying  textual  data  

On-­‐Disk  Processing  w/MapReduce  Jobs  at  extreme  scale  and  extremely  disk  IO  intensive      

Shared:  •  Data  Storage  •  Metadata  •  Resource  

Management  •  Administra@on  •  Security  •  Governance  

Page 10: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Why  is  Cloudera  leading  this  ini@a@ve?  

• Cloudera  was  the  first  Hadoop  vendor  to  ship  and  support  Spark    

•  Spark  is  a  fully  integrated  part  of  Cloudera’s  plaCorm  •  Shared  data,  metadata,  resource  management,  administra@on,  security,  and  governance    

• Cloudera  is  the  first  Hadoop  vendor  to  offer  Spark  training  •  Trained  more  customers  than  any  other  vendor  

 • Cloudera  has  more  Spark  customers  in  produc@on  than  all  other  companies  combined  

Page 11: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Overview  and  Improvements  

Page 12: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Spark  Flexible,  in-­‐memory  data  processing  for  Hadoop  

Easy    Development  

Flexible  Extensible    API  

Fast  Batch  &  Stream  Processing  

•  Rich  APIs  for  Scala,  Java,  and  Python  

 •  Interac@ve  shell  

•  APIs  for  different  types  of  workloads:  •  Batch    •  Streaming  •  Machine  Learning  •  Graph  

•  In-­‐Memory  processing  and  caching  

Page 13: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  High  Produc@vity  Language  Support  

• Na@ve  support  for  mul@ple  languages  with  iden@cal  APIs  • Scala,  Java,  Python  

• Use  of  closures,  itera@ons,  and  other  common  language  constructs  to  minimize  code  • 2-­‐5x  less  code  

Python  lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala  val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count()

Java  JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 14: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  Or  Scala?  

• Use  Python  for  prototyping  •  Spark  Python  API  is  slower  than  Scala  

• Use  Scala  for  development  • Steep  learning  curve  for  func@onal  programming  

Page 15: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Use  Interac@vely  

•  Interac@ve  explora@on  of  data  for  data  scien@sts  • No  need  to  develop  “applica@ons”  

• Developers  can  prototype  applica@on  on  live  system  

percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>

Page 16: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Expressive  API  •  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  sample

•  take

•  first

•  partitionBy

•  mapWith

•  pipe

•  save

•  …

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip

Page 17: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Memory  Management  for  Greater  Performance  

Trends:  • ½  price  every  18  months  •  2x  bandwidth  every  3  years  

64-­‐128GB  RAM  

16  cores  

50  GB  per  second  

Memory  can  be  enabler  for  high  performance  big  data  applica/ons  

Page 18: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Concepts  

• RDD  –  Resilient  Distributed  Dataset  • Transforma@ons  • Ac@ons  • Caching  • DataFrames  •  Spark  Streaming  •  SparkSQL  • Pluggable  Spark  

Page 19: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Resilient  Distributed  Dataset  (RDD)  

• Read-­‐only  par@@oned  collec@on  of  records  • Created  through:  • Transforma@on  of  data  in  storage  • Transforma@on  of  RDDs  

• Contains  lineage  to  compute  from  storage  •  Lazy  materializa@on  • Users  control  persistence  and  par@@oning  

Page 20: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

20  ©  Cloudera,  Inc.  All  rights  reserved.  

RDD  Opera@ons  

• Transforma/ons  create  new  RDD  from  an  exis@ng  one  • Ac/ons  run  computa@on  on  RDD  and  return  a  value  

• Transforma@ons  are  lazy    • Ac@ons  materialize  RDDs  by  compu@ng  transforma@ons  • RDDs  can  be  cached  to  avoid  re-­‐compu@ng  

Page 21: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Example  Opera@ons  

• Map • Filter • Sample

• Join

• Reduce • Count • First, Take

• SaveAs

Transforma/ons   Ac/ons  

Page 22: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Fault-­‐Tolerance  

• RDDs  contain  lineage  • Lineage:  Source  loca@on  and  list  of  transforma@ons  • Lost  par@@ons  can  be  re-­‐computed  from  source  data  

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS  File   Filtered  RDD   Mapped  RDD  filter  

(func  =  startsWith(…))  map  

(func  =  split(...))  

Page 23: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

23  ©  Cloudera,  Inc.  All  rights  reserved.  

Caching  –  Storage  Levels  

Different  op@ons  provide  tradeoffs  between  memory  usage  and  CPU  efficiency.    Cache  when  using  itera@ve  algorithms.  

• MEMORY_ONLY  –  most  CPU  efficient,  data  has  to  fit  in  memory  • MEMORY_ONLY_SER  –  More  space  efficient  but  s@ll  reasonably  fast  • MEMORY_AND_DISK  • MEMORY_AND_DISK_SER  • DISK_ONLY  • MEMORY_ONLY_2,  MEMORY_AND_DISK_2…  

Page 24: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

24  ©  Cloudera,  Inc.  All  rights  reserved.  

Data  Frames  

• Distributed  collec@on  of  rows  organized  into  named  columns  •  Spark  SQL’s  Data  Source  API  can  read  and  write  Data  Frames  using  a  variety  of  formats  • Hive,  JSON,  Parquet,  HDFS  

• Calling  the  DataFrame  API  can  let  you  • Select  the  columns  you  want  •  Join  data  sources  • Aggregate  and  Filter  

•  Spark  1.5  lets  you  access  the  Hive  Metastore  to  read/write  schemas  directly.  

 

Page 25: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

25  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming    What  is  it?  •  Run  con,nuous  processing  of  data  using  

Spark’s  core  API  •  Extends  Spark  concepts  to  fault-­‐tolerant,  

transformable  streams    •  Adds  “rolling  window”  opera@ons    

•  Example:  Compute  rolling  averages  or  counts  for  data  over  last  five  minutes  

Benefits:  •  Same  programming  paradigm  for  streaming  and  

batch  

•  Excellent  throughput  •  Scale  easily  to  support  large  volumes  of  data  

ingest  

Common  Use  Cases:  •  “On-­‐the-­‐fly”  ETL  as  data  is  ingested  into  

Hadoop/HDFS  •  Detect  anomalous  behavior  and  trigger  alerts  •  Con@nuous  repor@ng  of  summary  metrics  for  

incoming  data  

Page 26: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

26  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Streaming  Architectures  

Data  Sources  

Ingest  

Integra/on  Layer  

•  Flume  •  Kaia  

Spark  Stream  Processing  

Data  Prep   Aggrega@on  /  Scoring  

Transformed  Results  

HDFS  

Spark  Long-­‐Term  Analy/cs/  Model  Building  

HBase  

Real-­‐Time  Result  Serving  

Page 27: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

27  ©  Cloudera,  Inc.  All  rights  reserved.  

SparkSQL  Machine  Learning  Applica@ons  

• Goal:    •  Spark/Java  Developers  and  Data  Scien@sts  can  inline  SQL  into  Spark  apps  

• Designed  for:  •  Ease  of  development  for  Spark  developers  • Handful  of  concurrent  Spark  jobs  

 

• Strengths:  •  Ease  of  embedding  SQL  into  Java  or  Scala  applica@ons  •  SQL  for  common  func@onality  in  developer  flow  (eg.  aggrega@ons,  filters,  samples)  

Page 28: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

28  ©  Cloudera,  Inc.  All  rights  reserved.  

Impala  Remains  Tool  of  Choice  for  Interac@ve  SQL    

0  

50  

100  

150  

200  

250  

300  

350  

Impala   Spark  SQL   Presto   Hive-­‐on-­‐Tez  

Time  (in

 second

s)  

Single  User  vs  10  User  Response  Time/Impala    Times  Faster  

(Lower  bars  =  be`er)    

Single  User,  5  

10  Users,  11  

Single  User,  25  

10  Users,  120  

10  Users,  302  

10  Users,  202  

Single  User,  37  

Single  User,  77  

5.0x  

10.6x  

7.4x  

27.4x  

15.4x  

18.3x  

Page 29: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

29  ©  Cloudera,  Inc.  All  rights  reserved.  

Pluggable  Spark  –  replace  MapReduce  

Stage  1  

• Crunch  on  Spark  • Search  on  Spark  

Stage  2  

• Hive  on  Spark  (beta)  • Spark  on  HBase  (beta)  

Stage  3  

• Pig  on  Spark  (alpha)  • Sqoop  on  Spark  • Spark  on  Kudu  

Cloudera  is  leading  community  development  to  port  components  to  Spark:    

Page 30: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

30  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Customer  Use  Cases  Core  Spark   Spark  Streaming  

•  PorColio  Risk  Analysis  •  ETL  Pipeline  Speed-­‐Up  •  20+  years  of  stock  data  Financial  

Services  

Health  

•  Iden@fy  disease-­‐causing  genes  in  the  full  human  genome  

•  Calculate  Jaccard  scores  on  health  care  data  sets  

ERP  

•  Op@cal  Character  Recogni@on  and  Bill  Classifica@on  

•  Trend  analysis    •  Document  classifica@on  (LDA)  •  Fraud  analy@cs  Data  

Services  

1010  

•  Online  Fraud  Detec@on  Financial  Services  

Health  

•  Incident  Predic@on  for  Sepsis  

Retail  

•  Online  Recommenda@on  Systems  •  Real-­‐Time  Inventory  Management  

Ad  Tech  

•  Real-­‐Time  Ad  Performance  Analysis  

Page 31: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

31  ©  Cloudera,  Inc.  All  rights  reserved.  

Doing  the  Math  –  Executors  and  Cores  

4  Core  

Don’t  exceed=  5  Cores  per  Executor    h`p://blog.cloudera.com/blog/2015/03/how-­‐to-­‐tune-­‐your-­‐apache-­‐spark-­‐jobs-­‐part-­‐2/  

4  Core  

4  Core  

4  Core  

16  Total  Cores  in  Cluster  

C 1  Core  for  Applica@on  Master  

1111 15  Cores  for  Executors  

Core  Alloca@on   Allocate  Executors  

1  Executor  

4  Cores  15  Cores   3  Executors  with  

4  Cores  Each  x  

Other  Ra@os  may  lead  to  be`er  resource  u@liza@on  

1  Executor  

2  Cores  15  Cores   7  Executors  with  

2  Cores  Each  x  

(Leaves  3  Cores  un-­‐u@lized)  

(Leaves  1  Core  un-­‐u@lized)  

Determine  the  op@mal  resource  alloca@on  for  the  Spark  job  

Page 32: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

32  ©  Cloudera,  Inc.  All  rights  reserved.  

Kudu  

Page 33: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

33  ©  Cloudera,  Inc.  All  rights  reserved.  

• High  throughput  for  big  scans  (columnar  storage  and  replica@on)  Goal:  Within  2x  of  Parquet    

•  Low-­‐latency  for  short  accesses  (primary  key  indexes  and  quorum  design)  Goal:  1ms  read/write  on  SSD  

 

• Database-­‐like  seman@cs  (ini@ally  single-­‐row  ACID)  

 •  Rela/onal  data  model  •  SQL  query  •  “NoSQL”  style  scan/insert/update  (Java  client)  

Kudu  Design  Goals  

Page 34: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

34  ©  Cloudera,  Inc.  All  rights  reserved.  

Kudu  Storage  for  Fast  Analy@cs  on  Fast  Data  

• New  upda@ng  column  store  for  Hadoop  •  Simplifies  the  architecture  for  building  analy@c  applica@ons  on  changing  data  

•  Designed  for  fast  analy@c  performance  •  Na@vely  integrated  with  Hadoop  

 •  Apache-­‐licensed  open  source  (intent  to  donate  to  ASF)  

•  Beta  now  available  

FILESYSTEM  HDFS  

NoSQL  HBASE  

INGEST  –  SQOOP,  FLUME,  KAFKA  

DATA  INTEGRATION  &  STORAGE  

SECURITY  –  SENTRY  

RESOURCE  MANAGEMENT  –  YARN  UNIFIED  DATA  SERVICES  

BATCH   STREAM   SQL   SEARCH   MODEL   ONLINE  DATA  ENGINEERING   DATA  DISCOVERY  &  ANALYTICS   DATA  APPS  

SPARK,  HIVE,  PIG  

SPARK   IMPALA   SOLR   SPARK   HBASE  

RELATIONAL  KUDU  

Page 35: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

35  ©  Cloudera,  Inc.  All  rights  reserved.  

Kudu  Trade-­‐Offs  

• Random  updates  will  be  slower  • HBase  model  allows  random  updates  without  incurring  a  disk  seek  • Kudu  requires  a  key  lookup  before  update,  Bloom  lookup  before  insert    

• Single-­‐row  reads  may  be  slower  • Columnar  design  is  op@mized  for  scans  • Future:  may  introduce  “column  groups”  for  applica@ons  where  single-­‐row  access  is  more  important  

Page 36: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

36  ©  Cloudera,  Inc.  All  rights  reserved.  

Resources  

Join  the  community  h`p://getkudu.io  

 Download  the  Beta  

 cloudera.com/downloads    Read  the  Whitepaper  

 getkudu.io/kudu.pdf    

Page 37: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

37  ©  Cloudera,  Inc.  All  rights  reserved.  

RecordService  

Page 38: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

38  ©  Cloudera,  Inc.  All  rights  reserved.  

Hadoop  started  out  with  zero  security  

• Didn’t  need  it  for  the  Silicon  Valley  applica@ons  

• Does  need  it  for  Corporate  applica@ons  

• Cloudera  is  working  on  providing  full  featured  Spark  Security  

Page 39: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

39  ©  Cloudera,  Inc.  All  rights  reserved.  

Comprehensive,  Compliance-­‐Ready  Security  Authen@ca@on,  Authoriza@on,  Audit,  and  Compliance  

Access  Defining  what  users  and  applica@ons  can  

do  with  data      

Technical  Concepts:  Permissions  Authoriza@on  

 

Data  Protec@ng  data  in  the  

cluster  from  unauthorized  visibility      

Technical  Concepts:  Encryp@on,  Tokeniza@on,  

Data  masking    

Visibility  Repor@ng  on  where  data  came  from  and  how  it’s  being  used  

   

Technical  Concepts:  Audi@ng  Lineage  

 

Cloudera  Manager   Apache  Sentry  &  RecordService  

Cloudera  Navigator   Navigator  Encrypt  &  Key  Trustee  |  Partners  

Perimeter  Guarding  access  to  the  

cluster  itself        Technical  Concepts:  

Authen@ca@on  Network  isola@on  

   

Page 40: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

40  ©  Cloudera,  Inc.  All  rights  reserved.  

Ac@ve  Directory  and  Kerberos  

• Manages  Users,  Groups,  and  Services  •  Provides  username  /  password  authen@ca@on  •  Group  membership  determines  Service  access  

Ac@ve  Directory  

•  Trusted  and  standard  third-­‐party  •  Authen@cated  users  receive  “Tickets”  •  “Tickets”  gain  access  to  Services  

Kerberos  

User  authen@cates  

to  AD  

Authen@cated  user  gets  

Kerberos  Ticket  

Ticket  grants  access  to  Services  

e.g.  Impala  User                  [ssmith]  Password[*****  ]                                                                            

Page 41: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

41  ©  Cloudera,  Inc.  All  rights  reserved.  

Fine-­‐Grained  Access  Control  in  HDFS  Across  All  Hadoop  Paths    

Columns:    Sensi@ve  column  visibility  varies  by  role  (Ex.  credit  card  numbers)  •  Managers:  1234  5678  1234  5678  •  Call  Center:  XXXX  XXXX  XXXX  5678  •  Analysts:  XXXX  XXXX  XXXX  XXXX  •  Others:  No  access  to  credit  card  column  

 

Rows:    Different  user  groups  need  access  to  different  records  •  European  privacy  laws  •  Government  security  clearance    •  Financial  informa@on  restric@ons  

Page 42: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

42  ©  Cloudera,  Inc.  All  rights  reserved.  

RecordService  Unified  Access  Control  Enforcement  

• New  high  performance  security  layer  that  centrally  enforces  access  control  policies  across  Hadoop  •  Complements  Apache  Sentry’s  unified  policy  defini@on  

•  Row-­‐  and  column-­‐based  security  •  Dynamic  data  masking  

 •  Apache-­‐licensed  open  source  

•  Beta  now  available  

FILESYSTEM  HDFS  

NoSQL  HBASE  

INGEST  –  SQOOP,  FLUME,  KAFKA  

DATA  INTEGRATION  &  STORAGE  

SECURITY  –  SENTRY,  RECORDSERVICE  

RESOURCE  MANAGEMENT  –  YARN  UNIFIED  DATA  SERVICES  

BATCH   STREAM   SQL   SEARCH   MODEL   ONLINE  DATA  ENGINEERING   DATA  DISCOVERY  &  ANALYTICS   DATA  APPS  

SPARK,  HIVE,  PIG  

SPARK   IMPALA   SOLR   SPARK   HBASE  

Page 43: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

43  ©  Cloudera,  Inc.  All  rights  reserved.  

Fine-­‐Grained  HDFS  Access  without  RecordService  

Date//me   Accnt  #   SSN   Asset   Trade   Country  

09:33:11  16-­‐Feb-­‐2015  

0234837823   238-­‐23-­‐9876   AAPL   Sell   US  

11:33:01  16-­‐Feb-­‐2015  

3947848494   329-­‐44-­‐9847   TBT   Buy   EU  

14:12:34  16-­‐Feb-­‐2015  

4848367383   123-­‐56-­‐2345   IBM   Sell   UK  

09:22:03  16-­‐Feb-­‐2015  

3485739384   585-­‐11-­‐2345   INTC   Buy   US  

11:55:33  16-­‐Feb-­‐2015  

3847598390   234-­‐11-­‐8765   F   Buy   US  

10:22:55  16-­‐Feb-­‐2015  

8765432176   344-­‐22-­‐9876   UA   Buy   UK  

13:45:24  16-­‐Feb-­‐2015  

3456789012   412-­‐22-­‐8765   AMZN   Sell   EU  

09:03:44  16-­‐Feb-­‐2015  

4857389329   123-­‐44-­‐5678   TMV   Buy   US  

15:55:55  16-­‐Feb-­‐2015  

4756983234   234-­‐76-­‐9274   MA   Buy   UK  

Date//me   Accnt  #   SSN   Asset   Trade   Country  

14:12:34  16-­‐Feb-­‐2015  

4848367383   123-­‐56-­‐2345   IBM   Sell   UK  

10:22:55  16-­‐Feb-­‐2015  

8765432176   344-­‐22-­‐9876   UA   Buy   UK  

15:55:55  16-­‐Feb-­‐2015  

4756983234   234-­‐76-­‐9274   MA   Buy   UK  

Date//me   Accnt  #   SSN   Asset   Trade   Country  

11:33:01  16-­‐Feb-­‐2015  

3947848494   329-­‐44-­‐9847   TBT   Buy   EU  

13:45:24  16-­‐Feb-­‐2015  

3456789012   412-­‐22-­‐8765   AMZN   Sell   EU  

Date//me   Accnt  #   SSN   Asset   Trade   Country  

09:33:11  16-­‐Feb-­‐2015  

0234837823   238-­‐23-­‐9876   AAPL   Sell   US  

09:22:03  16-­‐Feb-­‐2015  

3485739384   585-­‐11-­‐2345   INTC   Buy   US  

11:55:33  16-­‐Feb-­‐2015  

3847598390   234-­‐11-­‐8765   F   Buy   US  

09:03:44  16-­‐Feb-­‐2015  

4857389329   123-­‐44-­‐5678   TMV   Buy   US  

Split  the  original  file  Use  HDFS  permissions  to  limit  access  

Page 44: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

44  ©  Cloudera,  Inc.  All  rights  reserved.  

Fine-­‐Grained  HDFS  Access  Control  with  RecordService  

•  Apply  controls  to  the  master  data  file  •  Row,  column,  and  sub-­‐column  (masking)  controls  •  Enforce  these  across  all  access  paths  

Date//me   Accnt  #   SSN   Asset   Trade   Country  

09:33:11  16-­‐Feb-­‐2015  

0234837823   238-­‐23-­‐9876   AAPL   Sell   US  

11:33:01  16-­‐Feb-­‐2015  

3947848494   329-­‐44-­‐9847   TBT   Buy   EU  

14:12:34  16-­‐Feb-­‐2015  

4848367383   123-­‐56-­‐2345   IBM   Sell   EU  

09:22:03  16-­‐Feb-­‐2015  

3485739384   585-­‐11-­‐2345   INTC   Buy   US  

11:55:33  16-­‐Feb-­‐2015  

3847598390   234-­‐11-­‐8765   F   Buy   US  

10:22:55  16-­‐Feb-­‐2015  

8765432176   344-­‐22-­‐9876   UA   Buy   EU  

13:45:24  16-­‐Feb-­‐2015  

3456789012   412-­‐22-­‐8765   AMZN   Sell   EU  

 Column-­‐Level  Controls  

 

 Ro

w-­‐Level  Con

trols  

 

Date//me   Accnt  #   SSN   Asset   Trade   Country  

09:33:11  16-­‐Feb-­‐2015  

0234837823   238-­‐23-­‐9876   AAPL   Sell   US  

11:33:01  16-­‐Feb-­‐2015  

3947848494   329-­‐44-­‐9847   TBT   Buy   group2  

14:12:34  16-­‐Feb-­‐2015  

4848367383   123-­‐56-­‐2345   IBM   Sell   group3  

09:22:03  16-­‐Feb-­‐2015  

3485739384   585-­‐11-­‐2345   INTC   Buy   US  

11:55:33  16-­‐Feb-­‐2015  

3847598390   234-­‐11-­‐8765   F   Buy   US  

10:22:55  16-­‐Feb-­‐2015  

8765432176   344-­‐22-­‐9876   UA   Buy   group3  

13:45:24  16-­‐Feb-­‐2015  

3456789012   412-­‐22-­‐8765   AMZN   Sell   group2  

 Column-­‐Level  Controls  

 

 Ro

w-­‐Level  Con

trols  

 

   

       

XXX-­‐XX  

XXX-­‐XX  

XXX-­‐XX  

What  U.S.  Brokers  See  

Page 45: Madison Big Data Spark Share - Meetupfiles.meetup.com/3315552/MadisonBigData_Spark_Share.pdf · ©"Cloudera,"Inc."All"rights"reserved." 4 Hadoop"Adop@on" Categoriesof Hadoopadopon!

45  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Resources  •  Learn  Spark  • Spark  Cookbook  –  by  Rishi  Yadav  • O’Reilly  Advanced  Analy@cs  with  Spark  eBook  (wri`en  by  Clouderans)  • Cloudera  Developer  Blog  • cloudera.com/spark      

• Get  Trained  • Cloudera  Spark  Training  


Recommended