+ All Categories
Home > Technology > Data Science

Data Science

Date post: 13-Jan-2015
Category:
Upload: ahmet-bulut
View: 489 times
Download: 0 times
Share this document with a friend
Description:
Scalable Data Analytics, Distributed Data Analytics, Fault Tolerant Data Analytics, Big Data Analytics
Popular Tags:
56
Dr. Ahmet Bulut [email protected] Oct 4 / 2013, İstanbul Data Science
Transcript
Page 1: Data Science

Dr.  Ahmet  [email protected]

Oct  4  /  2013,  İstanbul

Data  Science

Page 2: Data Science

Available,  Fault-­‐Tolerant,  and  Scalable

• High  Availability  (HA):  service  availability,  can  we  incur  no  down9me?

• Fault  Tolerance:  tolerate  failures,  and  recover  from  failures,  e.g,  so?ware,  hardware,  and  other.  

• Scalability:  going  from  1  to  1,000,000,000,000  comfortably.

Page 3: Data Science

Towering  for  Civiliza9on

Page 4: Data Science

Website

App  Server

1  User

DB

Page 5: Data Science

App  Server  1

1,000  Users

App  Server  2

Load  Balancer

Website

DB

Page 6: Data Science

App  Server  1

1,000  Users

App  Server  2

Load  Balancer

Website

DB

Hardware  Failure

1

Page 7: Data Science

App  Server  1

1,000  Users

App  Server  2

Load  Balancer

Website

New  Hardware

2DB 45  mins

Page 8: Data Science

1,000,000  Users

Copy

App  Server  1

Load  Balancer  1

Website

App  Server  2 App  Server  N

Load  Balancer  2

MasterDB

SlaveDB

Page 9: Data Science

1,000,000  Users

Txn  Log  File  Ship

App  Server  1

Load  Balancer  1

Website

App  Server  2 App  Server  N

Load  Balancer  2

AnaDB

SlaveDB

Hardware  Failure

1

Page 10: Data Science

1,000,000  Users

App  Server  1

Load  Balancer  1

Website

App  Server  2 App  Server  N

Load  Balancer  2

MasterDB

Promo9on

22  Mins

Page 11: Data Science

1,000,000  Users

App  Server  1

Load  Balancer  1

Website

App  Server  2 App  Server  N

Load  Balancer  2

MasterDBBackup  normal

3Copy

SlaveDB 10  mins

Page 12: Data Science

%99.999  =  5.26  mins  of  down9me  in  a  year!

%99.99  =  4.32  mins  of  down9me  in  a  month!

2  mins?

Page 13: Data Science

100,000,000  Users

Big DB Server

RAM 1

RAM 2

RAM 3

RAM...

RAM N-1

RAM N

Clustered  Cache

Page 14: Data Science

100,000,000  Users

RAM 1

RAM 2

RAM 3

RAM...

RAM N-1

RAM N

Clustered  Cache

So?ware  Upgrade

Page 15: Data Science

100,000,000  Users

RAM 1

RAM 2

RAM 3

RAM...

RAM N-1

RAM N

Clustered  Cache

0  minsSo?ware  Upgrade

Page 16: Data Science

Clustered  Cache

Page 17: Data Science

Clustered  Cache

Page 18: Data Science

Clustered  Cache

Page 19: Data Science

Towering  for  Civiliza9on

Page 20: Data Science

Distributed  File  System

My  Precious!!!

Page 21: Data Science

No  down9me?

İstanbulİzmir

Ankara

Bakü

Page 22: Data Science

Army  of  machines  logging

Page 23: Data Science

•Query:  Find  the  most  issued  web  request!

•How  would  you  compute?

A  simple  sum  over  the  incoming  web  requests...  

Page 24: Data Science

What  about  recommending  items?•Collabora9ve  Filtering.

•Easy,  hard,  XXL-­‐hard?

Page 25: Data Science

Extract  Transform  and  Load  (ETL)

App  Server  1 App  Server  2

DB

Page 26: Data Science

DB

App  Server  1099 App  Server  77

App  Server  657 App  Server  45

App  Server  1 App  Server  2

Extract  Transform  and  Load  (ETL)

Page 27: Data Science

Working  with  data  small  |  big  |  extra  big•Business  Opera9ons:  DBMS.

•Business  Analy9cs:  Data  Warehouse.

•I  want  interac9vity...  I  get  Data  Cubes!

•I  want  the  most  recent  news...

•How  recent,  how  o?en?

•Real  9me?  

•Near  real  9me?

Page 28: Data Science

Sooo?

•Things  are  looking  good  except  that  we  have:

•DONT-­‐WANT-­‐SO-­‐MANY  database  objects.

•Database  objects  such  as  

•tables,

•indices,

•views,

•logs.

Page 29: Data Science

Ship  it!

•Tradi9onal  approach  has  been  to  ship  data  to  where  the  queries  will  be  issued.

•The  new  world  order  demands  us  to  ship  “compute  logic”  to  where  data  is.

Page 30: Data Science

App  Server  77

Ship  the  compute-­‐logicApp  Server  77

App  Server  77App  Server  77

App  Server  77App  Server  77

Page 31: Data Science

Map/Reduce  (M/R)  Framework

Page 32: Data Science

What  does  M/R  give  me?

•Fine-­‐grained  fault  tolerance.

•Fine-­‐grained  determinis9c  task  model.

•Mul9-­‐tenancy.

•Elas9city.

Page 33: Data Science

M/R  based  plajorms

•Hadoop.

•Hive,  Pig.

•Spark,  Shark.

•...  (many  others).

Page 34: Data Science

Towering  for  Civiliza9on

Page 35: Data Science

Resilient Distributed Datasets

Parallel Operations

Spark

Page 36: Data Science

Resilient  Distributed  Dataset  (RDD)

•Read-­‐only  collec9on  of  objects  par99oned  across  a  set  of  machines  that  can  be  re-­‐built  if  a  par99on  is  lost.

•RDDs  can  always  be  re-­‐constructed  in  the  face  of  node  failures.  

Resilient Distributed Datasets

Parallel Operations

Page 37: Data Science

•RDDs  can  be  constructed  by:

•From  a  file  in  DFS,  e.g.,  Hadoop-­‐DFS  (HDFS).

•Slicing  a  collec9on  (an  array)  into  mul9ple  pieces  through  parallelizaAon.

•  Transforming  an  exis9ng  RDD.  An  RDD  with  elements  of  type  A  being  mapped  to  an  RDD  with  elements  of  type  B.

•Persis9ng  an  exis9ng  RDD  through  cache  and  save  opera9ons.  

Resilient  Distributed  Dataset  (RDD)

Page 38: Data Science

Parallel  Opera9ons

•reduce:  combining  data  elements  using  an  associa9ve  func9on  to  produce  a  result  at  the  driver.

•collect:  sends  all  elements  of  the  dataset  to  the  driver.

•foreach:  pass  each  data  element  through  a  UDF. Resilient Distributed Datasets

Parallel Operations

Page 39: Data Science

Spark

•Let’s  count  the  lines  containing  errors  in  a  large  log  file  stored  in  HDFS:  

val file = spark.textFile("hdfs://...")val errs = file.filter(_.contains("ERROR"))val ones = errs.map(_ => 1)val count = ones.reduce(_+_)

Page 40: Data Science

Spark  Lineageval file = spark.textFile("hdfs://...")val errs = file.filter(_.contains("ERROR"))val ones = errs.map(_ => 1)val count = ones.reduce(_+_)

Page 41: Data Science

Towering  for  Civiliza9on

Page 42: Data Science

Shark  Architecture

Page 43: Data Science

SQL  QueriesSELECT [GROUP_BY_COLUMN], COUNT(*) FROM lineitem GROUP BY [GROUP_BY_COLUMN]

SELECT * from lineitem l join supplier s ON l.L_SUPPKEY = s.S_SUPPKEY WHERE SOME_UDF(s.S_ADDRESS)

Page 44: Data Science

SQL  Queries

•Data  Size:  2.1  TB  Data

•Selec9vity:  2.5  million  of  dis9nct  groups!

Time: 2.5 mins

Page 45: Data Science

Machine  Learning

•LogisHc  Regression:  Search  for  a  hyperplane  w  that  best  separates  two  sets  of  points  (e.g.,  spammers  and  non-­‐spammers).  

•The  algorithm  applies  gradient  descent  op9miza9on  by  star9ng  with  a  randomized  vector  w.  

•The  algorithm  updates  w  itera9vely  by  moving  along  gradients  towards  the  op9mal  w’’.  

Page 46: Data Science

Machine  Learningdef logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w }val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")

val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)}

val trainedVector = logRegress(features.cache())

Page 47: Data Science

Batch  and/or  Real-­‐Hme  Data  Processing

Page 48: Data Science

History

Page 49: Data Science

LinkedIn  Recommenda9ons

•Core  matching  algorithm  uses  Lucene  (customized).

•Hadoop  is  used  for  a  variety  of  needs:

•Compu9ng  collabora9ve  filtering  features,  

•Building  Lucene  indices  offline,  

•Doing  quality  analysis  of  recommenda9on.

•Lucene  does  not  provide  fast  real-­‐9me  indexing.  

•To  keep  indices  up-­‐to  date,  a  real-­‐9me  indexing  library  on  top  of  Lucene  called  Zoie  is  used.

Page 50: Data Science

LinkedIn  Recommenda9ons

•Facets  are  provided  to  members  for  drilling  down  and  exploring  recommenda9on  results.  

•Face9ng  Search  library  is  called  Bobo.

•For  storing  features  and  for  caching  recommenda9on  results,  a  key-­‐value  store  Voldemort  is  used.

•For  analyzing  tracking  and  repor9ng  data,  a  distributed  messaging  system  called  Ka3a  is  used.

Page 51: Data Science

LinkedIn  Recommenda9ons

•Bobo,  Zoie,  Voldemort  and  Kara  are  developed  at  LinkedIn  and  are  open  sourced.  

•Kara  is  an  apache  incubator  project.

•Historically,  they  used  R  for  model  training.  Now  experimen9ng  with  Mahout  for  model  training.

•All  the  above  technologies,  combined  with  great  engineers  powers  LinkedIn’s  Recommenda9on  plajorm.

Page 52: Data Science

Live  and  Batch  Affair

•Using  Hadoop:

1.  Take  a  snapshot  of  data  (member  profiles)  in  produc9on.

2.  Move  it  to  HDFS.

3.  Grandfather  members  with  <ADDED-­‐VALUE>  in  a  mawer  of  hours  in  the  cemetery  (Hadoop).

4.  Copy  this  data  back  online  for  live  servers  (ResurrecHon).  

Page 53: Data Science

Who  we  are?

•We  are  Data  Scien9sts.

Page 54: Data Science

Our  Culture

•Our  work  culture  relies  heavily  on  Cloud  Compu9ng.

•Cloud  Compu9ng  is  a  perspec9ve  for  us,  not  a  technology!

Page 55: Data Science

What  we  do?

•Distributed  Data  Mining.

•Computa9onal  Adver9sing.

•Natural  Language  Processing.

•Scalable  Data  Analy9cs.

•Data  Visualiza9on.

•Probabilis9c  Inference.

Page 56: Data Science

Ongoing  projects

•Data  Science  Team:  3  Faculty;  1  Doctoral,  6  Masters,  and  6  Undergraduate  Students.

•Vista  Team:  Me,  2  Masters  &  4  Undergraduate  Students.

•Türk  Telekom  funded  project  (T2C2):  Scalable  Analy9cs.

•Tübitak  1001  funded  project:  Computa9onal  Adver9sing.

•Tübitak  1005  (submi7ed):  Computa9onal  Adver9sing,  NLP.

•Tübitak  1003  (in  prepera:on):  Online  Learning.


Recommended