+ All Categories
Home > Documents > Data Science Background and Course Software setup Week 1.

Data Science Background and Course Software setup Week 1.

Date post: 20-Jan-2016
Category:
Upload: lee-joseph
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
Data Science Background and Course Software setup Week 1
Transcript
Page 1: Data Science Background and Course Software setup Week 1.

Data Science Background and Course Software setup

Week 1

Page 2: Data Science Background and Course Software setup Week 1.

Index

Installation process

Lecture 1: Introduction to big data and data

science

Lecture 2: Performing data science and

preparing data

Page 3: Data Science Background and Course Software setup Week 1.

Installation process (I)

The same development environment: Two free software packages: VirtualBox and Vagrant Virtual Machine

Hardware and Software Prerequisites Minimum Hardware Requirements

Free disk space: 3.5 GB  RAM memory: 2.5 GB (4+ GB preferred) Processor:  Any recent Intel or AMD multicore processor

should be sufficient. Supported Operating Systems

Windows, Linux, MAC OS X

Page 4: Data Science Background and Course Software setup Week 1.

Installation process (II) Installation of the Virtual box:

virtualbox.org Downloads Choose the appropriate version of the Virtual box for your OS

Installation of Vagrant: www.vagrantup.com -> Downloads Choose the appropriate

version of the Vagrant for your OS

Installation of the Virtual Machine: Create a custom directory (e.g., /home/marrval/myvagrant) Download the file:

https://github.com/spark-mooc/mooc-setup/archive/master.zip to the custom directory and unzip it.

Copy Vagrantfile to the custom directory you created in step #1 Open a DOS prompt (Windows) or Terminal (Mac/Linux), change

to the custom directory, and issue the command vagrant up (the Virtual box opened in the background)

Sparkvm is running!

Page 5: Data Science Background and Course Software setup Week 1.

Installation process (III)

Basic Instructions for Using the Virtual Machine

To start the VM, from a DOS prompt (Windows) or Terminal (Mac/Linux), issue the command vagrant up.

To stop the VM, use the command vagrant halt You should always stop the VM before you log off, turn off, or

reboot your computer. To erase or delete the VM, vagrant destroy

Once the VM is running, to access the notebook, open a web browser to "http://localhost:8001/" : start the iPython notebook on port 8001 (so we can have access to an IPython notebook with a Spark)

Page 6: Data Science Background and Course Software setup Week 1.

Installation process (IV)

Running Your First Notebook

Start the VM

Open a web browser to "http://localhost:8001/"

Upload the file "lab0_student.ipynb”, which is contained in the .zip

Verify that you do not encounter any errors in the run of the cells

Page 7: Data Science Background and Course Software setup Week 1.

Introduction to big data and data science (I)

Correlation doesn’t imply causation

• Use more data

• Explore more types of data/factors

Page 8: Data Science Background and Course Software setup Week 1.

Introduction to big data and data science (II)

Big Data: Why all this excitement?

From 2003 to 2008, they looked at weekly search queries Identify 45 terms relevant to people searching about flu

Build a model

Google rolled out flu stories in Google News

during this period + reading stories

skewed the results

Page 9: Data Science Background and Course Software setup Week 1.

Introduction to big data and data science (III)

Big Data: Why all this excitement?

• Bloggers used data science to analyze the elections

• The campaigns were using data science (database that modeled the behavior of the electorate)

Pollsters try to predict the

outcome by polling people they

have biases (+errors)

incorrect results

Challenge: remove biases + errors

Page 10: Data Science Background and Course Software setup Week 1.

Introduction to big data and data science (IV)

Cautionary tale

• How did they come to this conclusion?

• Look at Google trend searches for MySpace and use the same model to Facebook

• Correlation doesn’t imply causation

• Identify important factors

Page 11: Data Science Background and Course Software setup Week 1.

Introduction to big data and data science (V)

Where Does Big Data Come From?

• Online (And can be recorded). Many data are collected and few analyzed

• Users (user-generated content)

Individually is not very large

Page 12: Data Science Background and Course Software setup Week 1.

Introduction to big data and data science (VI)

Where Does Big Data Come From?

• Health and scientific computing

• Graphs

• Log files (generated by servers around The Internet)

• The Internet of Things (e.g., sensors in a forest, toll collection transponder to traffic reporting)

Page 13: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (I)

What is Data Science?

• Data Science aims to derive knowledge from big data, efficiently and intelligently”

• Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government

Apply algorithms at scale to large amounts of data, and understand both the

algorithms and the results

Collect data, analyze them and understands the

analytical process and results

Collect knowledge,

apply algorithms, but do not understand

Page 14: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (I)

What is Data Science?

• Data Science aims to derive knowledge from big data, efficiently and intelligently”

• Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government

Apply domain-specific knowledge at very large scale, and understand

both the algorithms and the results

Page 15: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (II)

Contrasting Data Science: Database

Page 16: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (III)

Contrasting Data Science: Database

Contrasting Data Science: Scientific computing

Page 17: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (IV)

Contrasting Data Science: Traditional Machine Learning

Page 18: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (V)

Doing data science

• Problem Collect data clean the data build a model communicate the results

Page 19: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (V)

• Cloud computing: key enabler of data science

• Allows date science on a massive scale

Data science practice

Page 20: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (VI)

What is hard about Data Science?

Page 21: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (VII)

Data acquisition and Preparation

1. Extract data from sources

2. Load data into the sink

3. Transform data (source, sink, staging area)

Page 22: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (VIII)

Data acquisition and Preparation

• We create pipelines or workflows, which can be scheduled

• Recording the execution of a workflow is known as capturing lineage or provenance (Spark does it automatically)

• Impediments to collaboration: diversity of tools/programming languages, finding a script is hard, most analysis work is ‘thrown away’

Page 23: Data Science Background and Course Software setup Week 1.

Performing Data Science and preparing Data (VIII)

Data Science roles

Individual

Organizational


Recommended