Date post: | 15-Jun-2015 |
Category: |
Documents |
Upload: | laura-manley |
View: | 733 times |
Download: | 0 times |
Exploring Big and not so Big Data: Opportunities and Challenges
Juliana Freire [email protected]
Visualization and Data Analysis (ViDA) Center http://bigdata.poly.edu
NYU Poly
2 ViDA Center Juliana Freire
Big Data: What is the Big deal?
http://www.google.com/trends/explore#q=%22big%20data%22!
3 ViDA Center Juliana Freire
Big Data: What is the Big deal?
Many success stories – Google: many billions of pages indexed, products,
structured data – Facebook: 1.1 billion users using the site each month – Twitter: 517 million accounts, 250 million tweets/day
This is changing society!
4 ViDA Center Juliana Freire
Big Data: What is the Big deal?
Smart Cities: 50% of the world population lives in cities – Census, crime, emergency visits, cabs, public transportation,
real estate, noise, energy, … – Make cities more efficient and sustainable, and improve the
lives of their citizens http://www.nyu.edu/about/university-initiatives/center-for-urban-science-progress.html
Enable scientific discoveries: science is now data rich – Petabytes of data generated each day, e.g., Australian radio
telescopes, Large Hadron Collider – Social data, e.g., Facebook, Twitter (2,380,000 and 2,880,000
results in Google Scholar!) Data is currency
5 ViDA Center Juliana Freire
Big Data: What is the Big deal?
Smart Cities – Census, crime, emergency visits, cabs, public transportation,
real estate, noise, energy, … – Make cities more efficient and sustainable, and improve the
lives of their citizens Enable scientific discoveries: science is now data rich
– Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider
– Social data, e.g., Facebook, Twitter
Data is currency
6 ViDA Center Juliana Freire
Big Data: What is the Big deal?
Big data is not new: financial transactions, call detail records, astronomy, …
What is new is that there are many more data enthusiasts
More data are widely available, e.g., Web, data.gov, scientific data
Computing is cheap and easy to access – Server with 64 cores, 512GB RAM ~$11k – Cluster with 1000 cores ~$150k – Pay as you go: Amazon EC2
data
volu
mes,
% IT
inve
stm
ent
Astronomy
Geosciences
Chemistry Microbiology
rank
2020
2010 Social Sciences
Physics
Medicine
Plot from Howe and Halperin, DEB 2012
7 ViDA Center Juliana Freire
Big Data: What is the Big deal?
Big data is not new: financial transactions, call detail records, astronomy, …
What is new is that there are many more data enthusiasts
More data are widely available, e.g., Web, data.gov, scientific data, social and urban data
Computing is cheap and easy to access – Server with 64 cores, 512GB RAM ~$11k – Cluster with 1000 cores ~$150k – Pay as you go: Amazon EC2
8 ViDA Center Juliana Freire
Big Data: What is hard?
Scalability is not the problem… Usability is the Big issue
data knowledge
statistics
algorithms
machine learningmath
user interfaces
data visual encodings
interaction modes
technology
data management
provenance
Exploring data is hard
data knowledge
statistics
algorithms
machine learningmath
user interfaces
data visual encodings
interaction modes
technology
data management
provenance
Exploring data is hard, regardless of whether the data
is big or small
data knowledge
statistics
algorithms
machine learningmath
user interfaces
data visual encodings
interaction modes
technology
data management
provenance
11 ViDA Center Juliana Freire
Case Study: Studying Cab Trips in NYC
Prepare data for analysis Raw data for 2011 63 GB
– 24 csv files, 2 csv files for each month - one for trip data, and snother for fare data
– ~170M trips
Cleaning – ~60,000 fare records do not have trip records – ~200 duplicates per month
12 ViDA Center Juliana Freire
Storage Solutions: Temporal Queries
SQLite – 20 GB of storage
(index on pickup_time)
– Ordered queries: 9.39s
– Reverse ordered queries: 9.41s
– Shuffled queries: 9.37s
Custom storage – 12 GB of storage (in-
memory binary search instead of index)
– Ordered queries: 0.6s – Reverse ordered
queries: 1.4s – Shuffled queries: 1.2s
13 ViDA Center Juliana Freire
Storage Solutions: Spatial-Temporal
All trips for a week in a given region All trips in a week for a given taxi All trips in a week for a given taxi in a
given region
Needs a complex indexing scheme that combines spatial, temporal, and taxi id searches
14 ViDA Center Juliana Freire
Storage Solutions: Spatial-Temporal
SQLite – 20+10 GB of storage
(index on time and id, r-tree for coordinates)
– Creating indexes: 52hrs
– Range queries: 2.1s – Combined queries:
15.3s – Cross-table queries:
57s
Custom storage (ours) – 12+4 GB of storage
(using (4d) kd-tree on time, id and coordinates)
– Building kd-tree: 8 mins
– Range queries: 0.2s – Combined queries:
0.2s – Cross-table queries:
2s
15 ViDA Center Juliana Freire
Summary Statistics
13,237 Medallion Cabs 42,000 Taxi Drivers Average Number of Rides: 485k/day Average Number of Passengers: 660k/day
Analysis/Modeling
Rides in 2011
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Aug 28 Irene
Apr 2 Apr 3
Dec 25
29k
590k
16 ViDA Center Juliana Freire
Rides per Hour June 2011
Between 5k and 35k rides/hour
0h
Rides at Midnight
0h
0h
0h
0h
0h
Night Life!
Weekly Patterns
Analysis/Modeling
17 ViDA Center Juliana Freire
TLCVis
18 ViDA Center Juliana Freire
Drop-off
Pickup
Most of the drop-off’s occur on the avenues while most of the pick-up’s occur on the streets
Drop-offs vs. Pickups
19 ViDA Center Juliana Freire
Studying Anomalies
8:00AM-8:30AM 6:00AM-6:30AM 4:00AM-4:30AM
Sunday, May 1st 2011
20 ViDA Center Juliana Freire
Studying Anomalies
8:00AM-8:30AM 6:00AM-6:30AM 4:00AM-4:30AM
Sunday, May 1st 2011
21 ViDA Center Juliana Freire
Studying Anomalies
8:00AM-8:30AM 9:30AM-10:00AM Sunday, May 1st 2011
22 ViDA Center Juliana Freire
Studying Anomalies
8:00AM-8:30AM 9:30AM-10:00AM Sunday, May 1st 2011
Five Borough Bike Tour
Interpretation
23 ViDA Center Juliana Freire
Studying Anomalies
Sunday May 1st 2011
07:00AM-08:00AM
24 ViDA Center Juliana Freire
Studying Anomalies
Sunday May 1st 2011
08:00AM-10:00AM
25 ViDA Center Juliana Freire
Studying Anomalies
Sunday May 1st 2011
10:00AM-11:00AM
26 ViDA Center Juliana Freire
Studying Patterns
May 1st – May 7th 2011
3.6 Million Trips
Compare movement in the
airports against the large train stations
27 ViDA Center Juliana Freire
Studying Patterns
May 1st – May 7th 2011
3.6 Million Trips
Train Stations Airports
28 ViDA Center Juliana Freire
Studying Patterns
May 1st – May 7th 2011
3.6 Million Trips
Train Stations Airports
29 ViDA Center Juliana Freire
Data exploration reveals bad data…
30 ViDA Center Juliana Freire
Uses of Clean Data: FindMeACab App
31 ViDA Center Juliana Freire
Take Away
Data exploration is challenging for both small and big data
It is hard to prepare data for exploration For many tasks, existing tools are either too
cumbersome, not scalable, etc. Need better, usable tools
– Tools for data enthusiasts who are not computer scientists! Visualization is essential for exploring large volumes
of data --- “A picture is worth a thousand words’’ Pictures help us think [Tamara Munzner]
– Substitute perception for cognition – Free up limited cognitive/memory resources for higher-
level problems
32 ViDA Center Juliana Freire
Masters in Big Data
New degree at NYU Poly – Spring 2014 Courses:
– Machine learning – Massive data analysis – Visualization – Visual Analytics – Database Systems – Algorithms – …
Thanks