CS 102: Big Data Tools and Techniques
Discoveries and Pitfalls
Prof. Jennifer Widom Ethan Chan, Akash Das Sarma, Ivan Gozali
CS102 2
Background
§ Taught freshman seminar on Big Data in fall quarter (CS 46N)
§ High demand, students seemed happy
§ Now trying it as regular course
WARNING: This is a brand-new course!
CS102 3
Who Should Take the Course
§ Non-CS majors (or early CS majors)
§ Undergraduates or graduates
§ Not afraid of numbers
§ Not afraid of computer tools
§ Took equivalent of one programming class
§ Patient and tolerant, because…
This is a brand-new course!
CS102 4
Who is Taking the Course
§ Biology § Business § Civil/Env Engineering § Computer Science § Economics § Electrical Engineering § Energy Resource Engg § English § Epidemiology § Human Biology
Undergraduates, Masters, MBA, PhD
§ International Policy § Management § Materials Sci & Engg § Math and Comp Science § Management Sci & Engg § Psychology § Science Tech & Society § Statistics § Symbolic Systems § Undeclared
CS102 10
What Does “Big Data” Mean?
(1) Collecting large amounts of data Via computers, sensors, people, events …
(2) Doing something with it Make decisions, confirm hypotheses, gain insights, predict future …
“Data Science” = Going from (1) to (2)
CS102 11
Is Big Data a Fad? Was computer programming a fad?
Ø Ability to collect data will only increase Ø Ability to analyze data will only improve
What would go away?
CS102 12
Rest of Today
1. Introduction to good stuff (utilities, discoveries)
2. Introduction to bad stuff (pitfalls, privacy)
3. What the course will cover
4. Class logistics
CS102 15
Sports
Reams of computer-analyzed basketball video revealed that when Kobe Bryant shot a “pull up jumper going to his left hand,” his success rate was 0.1 points less than average for his team’s possession. (1) Collect data
(2) Do something with it
CS102 16
Gene-Drug Relationships
PharmGKB collects, curates, and disseminates knowledge about the impact of human genetic variation on drug responses.
(1) Collect and curate data (2) Do something with it
CS102 18
Advertising
http://www.fastcompany.com/3036425/a-facebook-users-challenge-to-facebook-heres-all-my-data-now-give-me-ads-i-like
(1) Collect data (2) Do something with it
CS102 25
Privacy
Publicized cases of improper covert collection of individual data
But individual data can also bleed legally through surprising channels
CS102 26
Google Flu Trends
Extremely accurate geographic flu tracking based on user search terms
Until one year it didn’t work
“Big Data Hubris”
CS102 27
Google Translate
Automatically learn language translation from examples on the web
Anyone see a problem?
CS102 28
Football Game Prediction
§ Saturday: receive email from “Prescient Polly” predicting results of four Sunday football games. She’s right.
§ Same thing the following weekend.
§ And two more weekends.
§ Fifth Saturday: Polly offers to place bets for you on Sunday games, for a fee.
Should you do it?
CS102 29
Football Game Prediction
How many contacts does Polly need on week one for 100 potential betters on week five?
216 = 65,536 × 100 ≈ 6.5 million
CS102 31
What We’ll Cover
§ Data analysis techniques Basic data operations, data mining, machine
learning (regression, classification, clustering)
§ Tools for data management & analysis Spreadsheets, data visualization, Jupyter,
relational databases / SQL, Python, R
§ Anomaly detection, sampling and statistical significance
§ Plus: guest speakers, history, case studies, pitfalls, privacy issues, …
CS102 32
Assigned Work
Assignment/Project Assigned Due
Assignment #1 Spreadsheets and basic data visualization March 31 April 10
Assignment #2 SQL, correlation and causation April 12 April 21
Project #1 Personal Data Analysis April 19 April 26
May 12
Assignment #3 Data mining, regression, Python April 21 May 1
Project #2 Movie-Rating Predictions May 10 May 23
Assignment #4 Classification, clustering, anomalies, sampling & statistical significance, R
Extra credit: Tableau
May 17 May 29
CS102 33
Exams
Exam Date
Midterm exam In class May 5
Final exam At assigned time but not 3 hours June 7
CS102 34
Class Logistics § Units: 4 for undergraduates, 3-4 for graduates
§ WAYS requirement: Applied Quantitative Reasoning (WAY-AQR)
§ Textbook? No Readings? Recommended
§ Grade weighting: 1/3 each assignments, projects, exams
§ Graded on a curve? No
§ Late policy: 10%/30% for 24/48 hours late, four free late days
Unlikely course will be taught next year
CS102 35
Office Hours
Working office hours every evening Sunday-Thursday • 7:00-9:30 PM in Huang basement (starting April 3) • Staffed by course assistant(s), look for “CS 102” sign • Good time & place to work on assignments!
Prof. Widom daytime office hours • Default Mondays & Fridays 11:00-12:00 • Some variation, posted in advance on course
home page • Also by appointment • Gates CS building, office #422
CS102 36
Online
Website – http://cs102.stanford.edu
Piazza - for announcements, Q&A, discussion
Canvas - for assignment submission & grading
Staff mailing list - [email protected] for private questions only
CS102 37
For Thursday’s Class
1) Get set up on Google Drive if you’re not already
2) Download US City Temperatures data from course website (two files)
3) Copy data files into Google Drive, make sure you can open with Google Sheets
4) Bring laptop to class (or share)