+ All Categories
Home > Documents > Amr El Abbadi Computer Science, UC Santa Barbara [email protected].

Amr El Abbadi Computer Science, UC Santa Barbara [email protected].

Date post: 17-Jan-2016
Category:
Upload: job-hubbard
View: 238 times
Download: 0 times
Share this document with a friend
Popular Tags:
45
The Challenges of Managing Big Data Amr El Abbadi Computer Science, UC Santa Barbara [email protected]
Transcript
Page 1: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

The Challenges of Managing Big Data

Amr El AbbadiComputer Science, UC Santa Barbara

[email protected]

Page 2: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Coined by Jeannette Wing (CMU, NSF, Microsoft). Essential for the productive citizenry of the

21 century. Computer Science is the critical component

in Science, Commerce, Finance, Engineering, Social Science, and even Humanity.

Our role as educators is challenging but very rewarding. Need to understand and teach the foundations and essence of computation.

AlAkhawayn 2015 2

Computational Thinking

Page 3: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 3

 Accurate and effective public health emergency response demands deep understanding of the context and the details of chaotic situations.

How Big Data Analysis Guides Hurricane Sandy Response

http://www.directrelief.org/2012/11/how-big-data-analysis-guides-our-hurricane-sandy-response/

Page 4: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 4

Evaluate more candidates, amass more data and peer more deeply into applicants' personal lives and interests.

Allows employers to predict specific outcomes, such as whether a prospective hire will quit too soon, file disability claims, or steal.

Meet the New Boss: Big Data

• For example after a half-year trial that cut attrition by a fifth, Xerox now leaves all hiring for its 48,700 call-center jobs to software.

• The model for the ideal call-center worker  is a person who lives near the job, has reliable transportation and uses one or more social networks, but not more than four.

• Note that practices that even unintentionally filter out older or minority applicants can be illegal.

• Wall Street Journal, Sept 20, 2012.

Page 5: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 5

The Big Data Eco-System in the Cloud

inter-data center

Inter-data center

Analysis & Quality

Infrastructure

Page 6: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 6

The 3 V’s of Big Data

Page 7: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

7

Big Data in Numbers

Facebook:◦ 1.5 Billion users◦ 140.3 Billion friendships

Twitter in a day:◦ 500 million tweets sent

Youtube in a day:◦ 3 billion videos viewed

Stats from facebook.com, twitter.com and youtube.comAlAkhawayn 2015

Page 8: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

8

104+ Hours of video uploaded on youtube 42,408+ App Downloads 153,804+ New photos uploaded on Facebook $263,947+ money spent on web shopping 298,013+ New Tweets 1,881,737+ youtube video views 2,521,244+ search queries on Google 2,692,323+ New Facebook Likes 20,234,009+ flickr photos views 204,709,030+ emails sent over the Internet

Source: http://whathappensontheinternetin60seconds.com/

AlAkhawayn 2015

Page 9: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 9

Page 10: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 10

Reality Check: Inside a Data Center

Page 11: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015

App Server

App Server

App Server

11

Scaling in the Cloud

Load Balancer (Proxy)

App Server

DATABASE

Client Site

App Server

Client Site Client Site

Database becomes the Scalability Bottleneck

Cannot leverage elasticity

Page 12: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015

App Server

App Server

App Server

12

Scaling in the Cloud

Load Balancer (Proxy)

App Server

DATABASE

Client Site

App Server

Client Site Client Site

Page 13: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015

Key Value Stores

App Server

App Server

App Server

13

Scaling for Big Data

Load Balancer (Proxy)

App Server

Client Site

App Server

Client Site Client Site

Scalable and Elastic,but limited consistency

and operational flexibility

Page 14: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 14

Page 15: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 15

Every read or write of a single row is atomic.

Objective: make all operations single-sited.

Key Value Stores

Page 16: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Scale-up◦ Classical enterprise setting

(RDBMS)◦ Flexible ACID transactions◦ Transactions in a single node

Scale-out◦ Cloud friendly (Key value

stores)◦ Execution at a single server

Limited functionality & guarantees

◦ No multi-row or multi-step transactions

AlAkhawayn 2015 16

Two approaches to scalability

Page 17: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 17

Why Consistency Matters On-line Social Media needs to be consistent!

◦ New unanticipated applications

The host’s dilemma◦ Remove “unpopular” friend X as friend◦ Post Party Next Friday at YYY”

Page 18: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 18

What about the Application Programmer?

Key-value StoresTransactions and SQL

Page 19: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 19

SQL and Scale-outS

cale

Ou

t

SQL Transactions

Key Value Stores

RDBMSs

Page 20: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 20

Cloud Elasticity

Page 21: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Challenge: Elasticity in Database tier

AlAkhawayn 2015 21

Database tier

Load Balancer

Application/Web/Caching tier

Page 22: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

22

What the user wants…

What the service provider wants…

The need for Live Migration: If the database platform was a phone booth…

AlAkhawayn 2015

Page 23: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 23

Catastrophic Failures: Geo-Replication

Page 24: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

“As a result they had no access to email,

calendars, or - most importantly - their

documents and Office

Online applications”

“most of digital communication - email, Lync, Sharepoint -

was out”

“Most of the other high-profile companies, including the

likes of Amazon, have had substantial outages … cloud services are still in their infancy, and glitches

like this are going to happen”

Page 25: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Geo Replication Promises: Low latency reads Tolerate Failures & data centers outages

AlAkhawayn 2015 25

Page 26: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Geo Replication

AlAkhawayn 2015 26

Page 27: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Geo Replication

AlAkhawayn 2015 27

Page 28: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Geo Replication

AlAkhawayn 2015 28

Page 29: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Communication Overhead

21101 99

169

341

173

260

AlAkhawayn 2015 29

Page 30: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Latency lower-bound [SIGMOD 2015]

Replica1

Replica 2

Wide-area link

Datacenter A Datacenter B

Transaction T1

Round Trip Latency Delay

AlAkhawayn 2015 30

Commit latency of T1 + Commit latency of T2 must be greater than or equal the Round-Trip Time between

them

Transaction T2

Page 31: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Then we have the challenge of VARIETY! Diverse access methods OLTP OLAP Graph

AlAkhawayn 2015 31

Fall of ‘One Size Fits all’

Page 32: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

Replication Driven Solution Leverage Replication by storing replicas in

different representations

Execution Engine

Column Row

Graph

OLTP Client

OLAP Client

Graph Client

AlAkhawayn 2015 32

Page 33: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

BIG Data Analytics Needs

Large scale data processing is difficult!◦ Managing hundreds or thousands of processors◦ Managing parallelization and distribution◦ I/O Scheduling◦ Status and monitoring◦ Fault/crash tolerance

AlAkhawayn 2015 33

Page 34: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 34

MapReduce to the Rescue?

Overview:◦ Data-parallel programming model ◦ An associated parallel and distributed

implementation for commodity clusters Pioneered by Google

◦ Processes 20 PB of data per day Popularized by open-source Hadoop project

◦ Used by Yahoo!, Facebook, Amazon, and the list is growing …

and now there is SPARK…….

Page 35: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 35

MapReduce (Hadoop) Model

Raw Input: <key, value>

MAP

<K2,V2><K1, V1> <K3,V3>

REDUCE

Execution Model: Data splits Map phase Intermediate data sort, partition and shuffling Reduce phase

Page 36: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 36

How Target Figured Out A Teen Girl Was Pregnant Before Her Dad Did

Stores everything customers bought + demographic information.

Crawl through the data, assign each shopper a “pregnancy prediction” score + estimate due date

Send coupons timed to specific stages of pregnancy.

 Jenny is 23, lives in Atlanta, in March bought cocoa-butter lotion, a large purse, zinc and magnesium supplements and a bright blue rug

87% chance that Jenny’s pregnant and her delivery date is sometime in late August.

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-

before-her-father-did/

Page 37: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 37

Privacy in the Cloud Data confidentiality

◦ Attacks Unauthorized accesses,

side channel attacks◦ Solutions

Encryption, querying encrypted data

Trusted computing

User

Cloud Servers

DataQuery

Answer

• Access privacy– Attacks

• Inferences on access patterns or query results

– Solutions• Private information

retrieval• Query obfuscation

Page 38: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 38

The Internet and how we use it

“Nearly two-thirds of Internet users worldwide use some type of social media” (McCafferty, CACM 2012)

• “The internet's largest impact comes in connecting people to other people for advice or sharing valuable experiences. For about one-third (34%) of those who used the … social networks was part of the decision-making dynamic. ” (Horrigan et al. 2006)

Page 39: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

39

Diffusion of Information

Diffusion (a.k.a. cascade, spread) in social networks: Does it happen?

• Mass Convergence and Emergency Events (Hughes et al. 2009, Sakaki et al. 2010)

• Education through Social Networks (Cheong et al. 2010)

• Collective Action

AlAkhawayn 2015

Page 40: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 40

Time Critical Social MobilizationThe 10 red balloons DARPA Challenge. MIT winning team found the 10 balloons

in 8 hours and 52 minutes using incentivized diffusion in a social network (Twitter). Science Vol 334, 28 Oct 2011.

Page 41: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 41

Are we missing something?Morocco is the Best!

Morocco is the Best!Morocco is

the Best!

Morocco is the Best!

Morocco is the Best!

Morocco is the Best!

Traditional trend detection fails to capture the difference between the two scenarios

Dispersed interest in the

topic

Interest from structural subgroup

Page 42: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 42

Coordinated vs Uncoordinated Trends

Consider two traditionally similar hashtags #pawpawty and #mafiawars (using Prefuse)

#pawpawty: Traditional rank: 289Significant as a coordinated trend: rank 24

#mafiawars: Traditional rank: 212Insignificant as a coordinated trend (rank 25812th )

Disconnectednodes

#pawpawty is a hashtag used by animal rights defenders while #mafiawars is used by

gamers. One might entail more of a community formation…

Next question: Are topics of certain categorical nature more (or less) important as structural

trends? Yes, political hashtags tend to be more significant as structural trends while hashtags

relating to gaming, music etc. are more significant as uncoordinated trend

Page 43: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 43

GeoScope Reporting diffusion: Geo-correlated trend

detection:◦ Provides high level information about topics and

locations:◦ Detects important location-topic pairs in a sliding

window What is the popularity of #ff

in Morocco?Which topics are of

interest particularly to Ifran

today?

Page 44: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 44

Cities

Users of certain cities like Jakarta have diverse interestsWhile other cities like Cairo are interested in more local topics

Page 45: Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu.

AlAkhawayn 2015 45

Computational Thinking Data Cross-disciplinary Globalization

Privacy and Societal issues. Energy Efficiency

Conclusions


Recommended