+ All Categories
Home > Documents > CS 102: Big Data - Stanford University · CS 102: Big Data Tools and Techniques ... Akash Das...

CS 102: Big Data - Stanford University · CS 102: Big Data Tools and Techniques ... Akash Das...

Date post: 14-May-2018
Category:
Upload: dangngoc
View: 216 times
Download: 1 times
Share this document with a friend
38
CS 102: Big Data Tools and Techniques Discoveries and Pitfalls Prof. Jennifer Widom Ethan Chan, Akash Das Sarma, Ivan Gozali
Transcript

CS 102: Big Data Tools and Techniques

Discoveries and Pitfalls

Prof. Jennifer Widom Ethan Chan, Akash Das Sarma, Ivan Gozali

CS102 2

Background

§  Taught freshman seminar on Big Data in fall quarter (CS 46N)

§  High demand, students seemed happy

§  Now trying it as regular course

WARNING: This is a brand-new course!

CS102 3

Who Should Take the Course

§  Non-CS majors (or early CS majors)

§  Undergraduates or graduates

§  Not afraid of numbers

§  Not afraid of computer tools

§  Took equivalent of one programming class

§  Patient and tolerant, because…

This is a brand-new course!

CS102 4

Who is Taking the Course

§  Biology §  Business §  Civil/Env Engineering §  Computer Science §  Economics §  Electrical Engineering §  Energy Resource Engg §  English §  Epidemiology §  Human Biology

Undergraduates, Masters, MBA, PhD

§  International Policy §  Management §  Materials Sci & Engg §  Math and Comp Science §  Management Sci & Engg §  Psychology §  Science Tech & Society §  Statistics §  Symbolic Systems §  Undeclared

CS102 5

Who is Taking the Course

CS102 6

Who is Taking the Course

CS102 7

Who is Taking the Course

CS102 8

Who is Taking the Course

CS102 9

Who is Taking the Course

CS102 10

What Does “Big Data” Mean?

(1) Collecting large amounts of data Via computers, sensors, people, events …

(2) Doing something with it Make decisions, confirm hypotheses, gain insights, predict future …

“Data Science” = Going from (1) to (2)

CS102 11

Is Big Data a Fad? Was computer programming a fad?

Ø Ability to collect data will only increase Ø Ability to analyze data will only improve

What would go away?

CS102 12

Rest of Today

1.  Introduction to good stuff (utilities, discoveries)

2.  Introduction to bad stuff (pitfalls, privacy)

3.  What the course will cover

4.  Class logistics

CS102 13

Traffic

(1)  Collect data (2)  Do something with it

CS102 14

Recommendation Systems

(1)  Collect data (2)  Do something with it

CS102 15

Sports

Reams of computer-analyzed basketball video revealed that when Kobe Bryant shot a “pull up jumper going to his left hand,” his success rate was 0.1 points less than average for his team’s possession. (1)  Collect data

(2)  Do something with it

CS102 16

Gene-Drug Relationships

PharmGKB collects, curates, and disseminates knowledge about the impact of human genetic variation on drug responses.

(1) Collect and curate data (2) Do something with it

CS102 17

Ocean Health

44,000 sensors, over 2 billion measurements Physical, chemical, biological …

CS102 18

Advertising

http://www.fastcompany.com/3036425/a-facebook-users-challenge-to-facebook-heres-all-my-data-now-give-me-ads-i-like

(1)  Collect data (2)  Do something with it

Pitfalls

CS102 20

Pitfall: Causation

CS102 21

Pitfall: Correlation

CS102 22

Pitfall: Correlation

CS102 23

Pitfall: Correlation

CS102 24

Pitfall: Correlation

CS102 25

Privacy

Publicized cases of improper covert collection of individual data

But individual data can also bleed legally through surprising channels

CS102 26

Google Flu Trends

  Extremely accurate geographic flu tracking based on user search terms

  Until one year it didn’t work

“Big Data Hubris”

CS102 27

Google Translate

  Automatically learn language translation from examples on the web

Anyone see a problem?

CS102 28

Football Game Prediction

§  Saturday: receive email from “Prescient Polly” predicting results of four Sunday football games. She’s right.

§  Same thing the following weekend.

§  And two more weekends.

§  Fifth Saturday: Polly offers to place bets for you on Sunday games, for a fee.

Should you do it?

CS102 29

Football Game Prediction

How many contacts does Polly need on week one for 100 potential betters on week five?

216 = 65,536 × 100 ≈ 6.5 million

CS102 30

Enough Negativity!

Discoveries outweigh pitfalls

Balance will only improve

CS102 31

What We’ll Cover

§  Data analysis techniques   Basic data operations, data mining, machine

learning (regression, classification, clustering)

§  Tools for data management & analysis   Spreadsheets, data visualization, Jupyter,

relational databases / SQL, Python, R

§  Anomaly detection, sampling and statistical significance

§  Plus: guest speakers, history, case studies, pitfalls, privacy issues, …

CS102 32

Assigned Work

Assignment/Project Assigned Due

Assignment #1 Spreadsheets and basic data visualization March 31 April 10

Assignment #2 SQL, correlation and causation April 12 April 21

Project #1 Personal Data Analysis April 19 April 26

May 12

Assignment #3 Data mining, regression, Python April 21 May 1

Project #2 Movie-Rating Predictions May 10 May 23

Assignment #4 Classification, clustering, anomalies, sampling & statistical significance, R

Extra credit: Tableau

May 17 May 29

CS102 33

Exams

Exam Date

Midterm exam In class May 5

Final exam At assigned time but not 3 hours June 7

CS102 34

Class Logistics §  Units: 4 for undergraduates, 3-4 for graduates

§  WAYS requirement: Applied Quantitative Reasoning (WAY-AQR)

§  Textbook? No Readings? Recommended

§  Grade weighting: 1/3 each assignments, projects, exams

§  Graded on a curve? No

§  Late policy: 10%/30% for 24/48 hours late, four free late days

Unlikely course will be taught next year

CS102 35

Office Hours

Working office hours every evening Sunday-Thursday •  7:00-9:30 PM in Huang basement (starting April 3) •  Staffed by course assistant(s), look for “CS 102” sign •  Good time & place to work on assignments!

Prof. Widom daytime office hours •  Default Mondays & Fridays 11:00-12:00 •  Some variation, posted in advance on course

home page •  Also by appointment •  Gates CS building, office #422

CS102 36

Online

  Website – http://cs102.stanford.edu

  Piazza - for announcements, Q&A, discussion

  Canvas - for assignment submission & grading

  Staff mailing list - [email protected] for private questions only

CS102 37

For Thursday’s Class

1)  Get set up on Google Drive if you’re not already

2)  Download US City Temperatures data from course website (two files)

3)  Copy data files into Google Drive, make sure you can open with Google Sheets

4)  Bring laptop to class (or share)

CS102: Big Data Tools and Techniques

Discoveries and Pitfalls

Prof. Jennifer Widom Ethan Chan, Akash Das Sarma, Ivan Gozali

QUESTIONS?


Recommended