Juliana Freire PPT

Post on 15-Jun-2015

733 views 0 download

Tags:

transcript

Exploring Big and not so Big Data: Opportunities and Challenges

Juliana Freire juliana.freire@nyu.edu

Visualization and Data Analysis (ViDA) Center http://bigdata.poly.edu

NYU Poly

2 ViDA Center Juliana Freire

Big Data: What is the Big deal?

http://www.google.com/trends/explore#q=%22big%20data%22!

3 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Many success stories –  Google: many billions of pages indexed, products,

structured data –  Facebook: 1.1 billion users using the site each month –  Twitter: 517 million accounts, 250 million tweets/day

  This is changing society!

4 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Smart Cities: 50% of the world population lives in cities –  Census, crime, emergency visits, cabs, public transportation,

real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the

lives of their citizens http://www.nyu.edu/about/university-initiatives/center-for-urban-science-progress.html

  Enable scientific discoveries: science is now data rich –  Petabytes of data generated each day, e.g., Australian radio

telescopes, Large Hadron Collider –  Social data, e.g., Facebook, Twitter (2,380,000 and 2,880,000

results in Google Scholar!)   Data is currency

5 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Smart Cities –  Census, crime, emergency visits, cabs, public transportation,

real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the

lives of their citizens   Enable scientific discoveries: science is now data rich

–  Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider

–  Social data, e.g., Facebook, Twitter

  Data is currency

6 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Big data is not new: financial transactions, call detail records, astronomy, …

  What is new is that there are many more data enthusiasts

  More data are widely available, e.g., Web, data.gov, scientific data

  Computing is cheap and easy to access –  Server with 64 cores, 512GB RAM ~$11k –  Cluster with 1000 cores ~$150k –  Pay as you go: Amazon EC2

data

volu

mes,

% IT

inve

stm

ent

Astronomy

Geosciences

Chemistry Microbiology

rank

2020

2010 Social Sciences

Physics

Medicine

Plot from Howe and Halperin, DEB 2012

7 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Big data is not new: financial transactions, call detail records, astronomy, …

  What is new is that there are many more data enthusiasts

  More data are widely available, e.g., Web, data.gov, scientific data, social and urban data

  Computing is cheap and easy to access –  Server with 64 cores, 512GB RAM ~$11k –  Cluster with 1000 cores ~$150k –  Pay as you go: Amazon EC2

8 ViDA Center Juliana Freire

Big Data: What is hard?

  Scalability is not the problem…   Usability is the Big issue

data knowledge

statistics

algorithms

machine learningmath

user interfaces

data visual encodings

interaction modes

technology

data management

provenance

Exploring data is hard

data knowledge

statistics

algorithms

machine learningmath

user interfaces

data visual encodings

interaction modes

technology

data management

provenance

Exploring data is hard, regardless of whether the data

is big or small

data knowledge

statistics

algorithms

machine learningmath

user interfaces

data visual encodings

interaction modes

technology

data management

provenance

11 ViDA Center Juliana Freire

Case Study: Studying Cab Trips in NYC

Prepare data for analysis   Raw data for 2011 63 GB

–  24 csv files, 2 csv files for each month - one for trip data, and snother for fare data

–  ~170M trips

  Cleaning –  ~60,000 fare records do not have trip records –  ~200 duplicates per month

12 ViDA Center Juliana Freire

Storage Solutions: Temporal Queries

  SQLite – 20 GB of storage

(index on pickup_time)

– Ordered queries: 9.39s

– Reverse ordered queries: 9.41s

– Shuffled queries: 9.37s

  Custom storage – 12 GB of storage (in-

memory binary search instead of index)

– Ordered queries: 0.6s – Reverse ordered

queries: 1.4s – Shuffled queries: 1.2s

13 ViDA Center Juliana Freire

Storage Solutions: Spatial-Temporal

  All trips for a week in a given region   All trips in a week for a given taxi   All trips in a week for a given taxi in a

given region

Needs a complex indexing scheme that combines spatial, temporal, and taxi id searches

14 ViDA Center Juliana Freire

Storage Solutions: Spatial-Temporal

  SQLite – 20+10 GB of storage

(index on time and id, r-tree for coordinates)

– Creating indexes: 52hrs

– Range queries: 2.1s – Combined queries:

15.3s – Cross-table queries:

57s

  Custom storage (ours) – 12+4 GB of storage

(using (4d) kd-tree on time, id and coordinates)

– Building kd-tree: 8 mins

– Range queries: 0.2s – Combined queries:

0.2s – Cross-table queries:

2s

15 ViDA Center Juliana Freire

Summary Statistics

  13,237 Medallion Cabs   42,000 Taxi Drivers   Average Number of Rides: 485k/day   Average Number of Passengers: 660k/day

Analysis/Modeling

Rides in 2011

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Aug 28 Irene

Apr 2 Apr 3

Dec 25

29k

590k

16 ViDA Center Juliana Freire

Rides per Hour June 2011

Between 5k and 35k rides/hour

0h

Rides at Midnight

0h

0h

0h

0h

0h

Night Life!

Weekly Patterns

Analysis/Modeling

17 ViDA Center Juliana Freire

TLCVis

18 ViDA Center Juliana Freire

Drop-off

Pickup

Most of the drop-off’s occur on the avenues while most of the pick-up’s occur on the streets

Drop-offs vs. Pickups

19 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 6:00AM-6:30AM 4:00AM-4:30AM

Sunday, May 1st 2011

20 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 6:00AM-6:30AM 4:00AM-4:30AM

Sunday, May 1st 2011

21 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 9:30AM-10:00AM Sunday, May 1st 2011

22 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 9:30AM-10:00AM Sunday, May 1st 2011

Five Borough Bike Tour

Interpretation

23 ViDA Center Juliana Freire

Studying Anomalies

Sunday May 1st 2011

07:00AM-08:00AM

24 ViDA Center Juliana Freire

Studying Anomalies

Sunday May 1st 2011

08:00AM-10:00AM

25 ViDA Center Juliana Freire

Studying Anomalies

Sunday May 1st 2011

10:00AM-11:00AM

26 ViDA Center Juliana Freire

Studying Patterns

May 1st – May 7th 2011

3.6 Million Trips

Compare movement in the

airports against the large train stations

27 ViDA Center Juliana Freire

Studying Patterns

May 1st – May 7th 2011

3.6 Million Trips

Train Stations Airports

28 ViDA Center Juliana Freire

Studying Patterns

May 1st – May 7th 2011

3.6 Million Trips

Train Stations Airports

29 ViDA Center Juliana Freire

Data exploration reveals bad data…

30 ViDA Center Juliana Freire

Uses of Clean Data: FindMeACab App

31 ViDA Center Juliana Freire

Take Away

  Data exploration is challenging for both small and big data

  It is hard to prepare data for exploration   For many tasks, existing tools are either too

cumbersome, not scalable, etc.   Need better, usable tools

–  Tools for data enthusiasts who are not computer scientists!   Visualization is essential for exploring large volumes

of data --- “A picture is worth a thousand words’’   Pictures help us think [Tamara Munzner]

–  Substitute perception for cognition –  Free up limited cognitive/memory resources for higher-

level problems

32 ViDA Center Juliana Freire

Masters in Big Data

  New degree at NYU Poly – Spring 2014   Courses:

–  Machine learning –  Massive data analysis –  Visualization –  Visual Analytics –  Database Systems –  Algorithms –  …

Thanks