Data Science

Dr. Ahmet [email protected]

Oct 4 / 2013, İstanbul

Data Science

Available, Fault-‐Tolerant, and Scalable

• High Availability (HA): service availability, can we incur no down9me?

• Fault Tolerance: tolerate failures, and recover from failures, e.g, so?ware, hardware, and other.

• Scalability: going from 1 to 1,000,000,000,000 comfortably.

Towering for Civiliza9on

Website

App Server

1 User

DB

App Server 1

1,000 Users

App Server 2

Load Balancer

Website

DB

App Server 1

1,000 Users

App Server 2

Load Balancer

Website

DB

Hardware Failure

1

App Server 1

1,000 Users

App Server 2

Load Balancer

Website

New Hardware

2DB 45 mins

1,000,000 Users

Copy

App Server 1

Load Balancer 1

Website

App Server 2 App Server N

Load Balancer 2

MasterDB

SlaveDB

1,000,000 Users

Txn Log File Ship

App Server 1

Load Balancer 1

Website


Load Balancer 2

AnaDB

SlaveDB

Hardware Failure

1

1,000,000 Users

App Server 1

Load Balancer 1

Website


Load Balancer 2

MasterDB

Promo9on

22 Mins

1,000,000 Users

App Server 1

Load Balancer 1

Website


Load Balancer 2

MasterDBBackup normal

3Copy

SlaveDB 10 mins

%99.999 = 5.26 mins of down9me in a year!

%99.99 = 4.32 mins of down9me in a month!

2 mins?

100,000,000 Users

Big DB Server

RAM 1

RAM 2

RAM 3

RAM...

RAM N-1

RAM N

Clustered Cache

100,000,000 Users

RAM 1

RAM 2

RAM 3

RAM...

RAM N-1

RAM N

Clustered Cache

So?ware Upgrade

100,000,000 Users

RAM 1

RAM 2

RAM 3

RAM...

RAM N-1

RAM N

Clustered Cache

0 minsSo?ware Upgrade

Clustered Cache

Clustered Cache

Clustered Cache


Distributed File System

My Precious!!!

No down9me?

İstanbulİzmir

Ankara

Bakü

Army of machines logging

•Query: Find the most issued web request!

•How would you compute?

A simple sum over the incoming web requests...

What about recommending items?•Collabora9ve Filtering.

•Easy, hard, XXL-‐hard?

Extract Transform and Load (ETL)

App Server 1 App Server 2

DB

DB




Extract Transform and Load (ETL)

Working with data small | big | extra big•Business Opera9ons: DBMS.

•Business Analy9cs: Data Warehouse.

•I want interac9vity... I get Data Cubes!

•I want the most recent news...

•How recent, how o?en?

•Real 9me?

•Near real 9me?

Sooo?

•Things are looking good except that we have:

•DONT-‐WANT-‐SO-‐MANY database objects.

•Database objects such as

•tables,

•indices,

•views,

•logs.

Ship it!

•Tradi9onal approach has been to ship data to where the queries will be issued.

•The new world order demands us to ship “compute logic” to where data is.

App Server 77

Ship the compute-‐logicApp Server 77

App Server 77App Server 77

App Server 77App Server 77

Map/Reduce (M/R) Framework

What does M/R give me?

•Fine-‐grained fault tolerance.

•Fine-‐grained determinis9c task model.

•Mul9-‐tenancy.

•Elas9city.

M/R based plajorms

•Hadoop.

•Hive, Pig.

•Spark, Shark.

•... (many others).


Resilient Distributed Datasets

Parallel Operations

Spark

Resilient Distributed Dataset (RDD)

•Read-‐only collec9on of objects par99oned across a set of machines that can be re-‐built if a par99on is lost.

•RDDs can always be re-‐constructed in the face of node failures.

Resilient Distributed Datasets

Parallel Operations

•RDDs can be constructed by:

•From a file in DFS, e.g., Hadoop-‐DFS (HDFS).

•Slicing a collec9on (an array) into mul9ple pieces through parallelizaAon.

• Transforming an exis9ng RDD. An RDD with elements of type A being mapped to an RDD with elements of type B.

•Persis9ng an exis9ng RDD through cache and save opera9ons.

Resilient Distributed Dataset (RDD)

Parallel Opera9ons

•reduce: combining data elements using an associa9ve func9on to produce a result at the driver.

•collect: sends all elements of the dataset to the driver.

•foreach: pass each data element through a UDF. Resilient Distributed Datasets

Parallel Operations

Spark

•Let’s count the lines containing errors in a large log file stored in HDFS:

val file = spark.textFile("hdfs://...")val errs = file.filter(_.contains("ERROR"))val ones = errs.map(_ => 1)val count = ones.reduce(_+_)

Spark Lineageval file = spark.textFile("hdfs://...")val errs = file.filter(_.contains("ERROR"))val ones = errs.map(_ => 1)val count = ones.reduce(_+_)


Shark Architecture

SQL QueriesSELECT [GROUP_BY_COLUMN], COUNT(*) FROM lineitem GROUP BY [GROUP_BY_COLUMN]

SELECT * from lineitem l join supplier s ON l.L_SUPPKEY = s.S_SUPPKEY WHERE SOME_UDF(s.S_ADDRESS)

SQL Queries

•Data Size: 2.1 TB Data

•Selec9vity: 2.5 million of dis9nct groups!

Time: 2.5 mins

Machine Learning

•LogisHc Regression: Search for a hyperplane w that best separates two sets of points (e.g., spammers and non-‐spammers).

•The algorithm applies gradient descent op9miza9on by star9ng with a randomized vector w.

•The algorithm updates w itera9vely by moving along gradients towards the op9mal w’’.

Machine Learningdef logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w }val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")

val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)}

val trainedVector = logRegress(features.cache())

Batch and/or Real-‐Hme Data Processing

History

LinkedIn Recommenda9ons

•Core matching algorithm uses Lucene (customized).

•Hadoop is used for a variety of needs:

•Compu9ng collabora9ve filtering features,

•Building Lucene indices offline,

•Doing quality analysis of recommenda9on.

•Lucene does not provide fast real-‐9me indexing.

•To keep indices up-‐to date, a real-‐9me indexing library on top of Lucene called Zoie is used.


•Facets are provided to members for drilling down and exploring recommenda9on results.

•Face9ng Search library is called Bobo.

•For storing features and for caching recommenda9on results, a key-‐value store Voldemort is used.

•For analyzing tracking and repor9ng data, a distributed messaging system called Ka3a is used.


•Bobo, Zoie, Voldemort and Kara are developed at LinkedIn and are open sourced.

•Kara is an apache incubator project.

•Historically, they used R for model training. Now experimen9ng with Mahout for model training.

•All the above technologies, combined with great engineers powers LinkedIn’s Recommenda9on plajorm.

Live and Batch Affair

•Using Hadoop:

1. Take a snapshot of data (member profiles) in produc9on.

2. Move it to HDFS.

3. Grandfather members with <ADDED-‐VALUE> in a mawer of hours in the cemetery (Hadoop).

4. Copy this data back online for live servers (ResurrecHon).

Who we are?

•We are Data Scien9sts.

Our Culture

•Our work culture relies heavily on Cloud Compu9ng.

•Cloud Compu9ng is a perspec9ve for us, not a technology!

What we do?

•Distributed Data Mining.

•Computa9onal Adver9sing.

•Natural Language Processing.

•Scalable Data Analy9cs.

•Data Visualiza9on.

•Probabilis9c Inference.

Ongoing projects

•Data Science Team: 3 Faculty; 1 Doctoral, 6 Masters, and 6 Undergraduate Students.

•Vista Team: Me, 2 Masters & 4 Undergraduate Students.

•Türk Telekom funded project (T2C2): Scalable Analy9cs.

•Tübitak 1001 funded project: Computa9onal Adver9sing.

•Tübitak 1005 (submi7ed): Computa9onal Adver9sing, NLP.

•Tübitak 1003 (in prepera:on): Online Learning.

Date post:	13-Jan-2015
Category:	Technology
Upload:	ahmet-bulut
View:	489 times
Download:	0 times

Data Science

Technology