+ All Categories
Home > Documents > Introduction to Data Science Section 3 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and...

Introduction to Data Science Section 3 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and...

Date post: 23-Dec-2015
Category:
Upload: kevin-golden
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
Introduction to Data Science Section 3 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey [email protected] 1
Transcript

1

Introduction to Data ScienceSection 3Data Matters 2015

Sponsored by the Odum Institute, RENCI, and NCDS

Thomas M. [email protected]

2

Big Data

3

4

Big Data

• The launch of the Data Science conversation has been sparked primarily by the so-called “Big Data” revolution.

• As mentioned, we have always had data that taxed our technical and computational capacities.

• “Big Data” makes front-page news, however, because of the explosion of data about people.

• Contemporary definitions of Big Data focus on:– Volume (the amount of data)– Velocity (the speed of data in and out)– Variety (the diverse types of data)

5

6

7

How Big is Big?

• 500 million Tweets sent per day (by 2013)• 300 hours of video uploaded to YouTube every

minute.– https://www.youtube.com/yt/press/statistics.html

• 1.44 Billion Facebook users (April, 2015)• Internet Usage:– http://www.internetlivestats.com/

8

So Much More

• Locational Tracking (smart cars, smart phones)• Satellite images (Nightlight Project, parking lot

images, crop images)• Internet of Things– Smart energy grid; biochips in livestock; Fitbits;

predictive maintenance;

9

10

Big Data

• Despite their linkage in many contemporary discussions, Big Data ≠ Data Science.– Data science principles apply to all data – big and

small.– There is also the so-called “Long Tail” of data.

11

The Long Tail

Big Data

Most Data

12

Challenges of Big Data

• Big Data does present some unique challenges.– Searching for average patterns may be better served by

sampling– Searching rare events might require big data

• Big haystacks (may) contain more needles.

– This raises a point about so-called outliers• Rare or odd events might distort estimates of “average” effects.• However, rare events might also be exactly what you are

seeking to study

– Methods of outlier detection are crucial• Note looking for single outliers, pairs, or clusters

13

Challenges in the Long Tail

• Individual data sets are smaller.• Aggregation could produce a whole greater

than the sum of its parts, but:– Data sets might have similar measures, but use

slightly different measurement strategies, metadata, etc.• The DataBridge Project• http://databridge.web.unc.edu/

14

The Promise of Big Data

• There has been a lot of hype about Big Data.• There is the belief among some that Big Data

will solve all sorts of social, economic, and scientific problems.

• The “Truth” must be in there somewhere – we just need to find it.

• We have big problems – Big Data can help us solve them.

15

16

17

Hope or Hype?

• Washington Post column by Samuel Arbesman titled “Five myths about big data” (8-16-2013) referenced the following tween offered as a definition of Big Data.– Big Data, n.: the belief that any sufficiently large

pile of shit contains a pony with probability approaching 1 (by James Grimmelmann)

18

Even If True, What Kind of Pony?

19

Arbesman’s 5 Myths

• “Big Data” has a clear definition• Big Data is new• Big Data is revolutionary• Bigger Data is better data• Big Data means the end of scientific theories

20

Does Big = Good?

• Lost in most discussions of Big Data is whether it is representative data or not.– We can mine Twitter, but who tweets?– We can mine health records, who whose records do we

have?– We can track online purchasing, but what about off-line

market behavior?• Survey research has spent decades worrying about

representativeness, weighting, etc., but I do not see it discussed nearly as much in data science.

21

22

Theory, Methods, and Big Data

• The greatest need for theory and the greatest challenges for computationally intensive methods arise:– When data is too small – there is not enough

information in the data by itself.– When data is too big – the computational costs

become too high– There is a “just right” that allows for complex models

and computationally demanding methods to be used so that theoretical assumptions can be relaxed.

23

One Example of Data Science

24

Data Science and Elections

• The Obama campaigns in 2008 and 2012 are credited for their successful use of social media and data mining.

• Micro-targeting in 2012– http://www.theatlantic.com/politics/archive/2012/04/the-creepiness-factor

-how-obama-and-romney-are-getting-to-know-you/255499/

– http://www.mediabizbloggers.com/group-m/How-Data-and-Micro-Targeting-Won-the-2012-Election-for-Obama---Antony-Young-Mindshare-North-America.html

– Micro-profiles built from multiple sources accessed by aps, real-time updating data based on door-to-door visits, focused media buys, e-mails and Facebook messages highly targeted.

– 1 million people installed the Obama Facebook app that gave access to info on “friends”.

26

27

29

Big Data and Politics: Something Old, Something New . . .

• The massive data collection and micro-targeting regarding voters that defined 2012 is both:– New• that amount and diversity of data mobilized for near

real time updating and analysis was unprecedented.

– Old• it is a reversion to retail, door-to-door, personalized

politics.

– “All Politics is Local” – Tip O’Neill.


Recommended