+ All Categories
Home > Technology > Training a New Generation of Data Scientists

Training a New Generation of Data Scientists

Date post: 20-Aug-2015
Category:
Upload: cloudera-inc
View: 1,510 times
Download: 0 times
Share this document with a friend
Popular Tags:
32
Josh Wills | Senior Director of Data Science Training a New Generation of Data Scientists
Transcript
Page 1: Training a New Generation of Data Scientists

Josh Wills | Senior Director of Data Science

Training a New Generation of Data Scientists

Page 2: Training a New Generation of Data Scientists

About Me

Page 3: Training a New Generation of Data Scientists

What Do Data Scientists Do?

Page 4: Training a New Generation of Data Scientists

What I Think I Do

Page 5: Training a New Generation of Data Scientists

What Other People Think I Do

Page 6: Training a New Generation of Data Scientists

What I Actually Do

Page 7: Training a New Generation of Data Scientists

The Emergence of Data Science

Page 8: Training a New Generation of Data Scientists

Data Storage in 2001: Databases• Structured schemas• Intensive processing

done where data is stored• Somewhat reliable• Expensive at scale

Page 9: Training a New Generation of Data Scientists

Data Storage in 2001: Filers

• No schemas, stores any kind of file• No data processing

capability• Reliable• Expensive at scale

Page 10: Training a New Generation of Data Scientists

And Then, This Happened

Page 11: Training a New Generation of Data Scientists

Data Economics, Return on Byte

Page 12: Training a New Generation of Data Scientists

Big Data Economics• No individual record is

particularly valuable• Having every record is

incredibly valuable• Web index• Recommendation systems• Sensor data• Market basket analysis• Online advertising

Page 13: Training a New Generation of Data Scientists

Enter Hadoop

Page 14: Training a New Generation of Data Scientists

The Hadoop Distributed File System• Based on the Google File

System• Data stored in large files• Large block size: 64MB to

256MB per block• Blocks are replicated to

multiple nodes in the cluster

Page 15: Training a New Generation of Data Scientists

Simple, Reliable, Distributed Processing: MapReduce

•Map Stage• Embarrassingly parallel

• Shuffle Stage: Large-scale distributed sort• Reduce Stage• Process all the values that have the same key in a single step

• Process the data where it is stored•Write once and you’re done.

Page 16: Training a New Generation of Data Scientists

Thinking Like a Data Scientist

Page 17: Training a New Generation of Data Scientists

Solving Problems vs. Finding Insights

Page 18: Training a New Generation of Data Scientists

Parallelize Everything

Page 19: Training a New Generation of Data Scientists

Abundance vs. Scarcity

Page 20: Training a New Generation of Data Scientists

Building Data Products

Page 21: Training a New Generation of Data Scientists

Create a Data Science Team

Page 22: Training a New Generation of Data Scientists

Choose Good Problems

Page 23: Training a New Generation of Data Scientists

Design the Model

Page 24: Training a New Generation of Data Scientists

Mind the Gap

Page 25: Training a New Generation of Data Scientists

Amortize Costs

Page 26: Training a New Generation of Data Scientists

Measure Everything

Page 27: Training a New Generation of Data Scientists

Rinse and Repeat

Page 28: Training a New Generation of Data Scientists

Work Like a Data Scientist

Page 29: Training a New Generation of Data Scientists

Train Like a Data Scientist

Hadoop Developer Training

Hive and Pig Training

Introduction to Data Science

Page 30: Training a New Generation of Data Scientists

Introduction to Data Science:Building Recommender Systems

http://university.cloudera.com/

Page 31: Training a New Generation of Data Scientists
Page 32: Training a New Generation of Data Scientists

• Submit questions in the Q&A panel

• Watch on-demand video of this webinar at http://cloudera.com

• Follow Josh on Twitter @josh_wills

• Follow Cloudera University @ClouderaU

• Thank you for attending!

Register now for Cloudera training at http://university.cloudera.com

Use discount code DSvideo_10 to save 10% on new enrollments in Cloudera-delivered training classes until June 1

Use discount code 15off2 to save 15% on enrollments in two or more Cloudera-delivered training classes until June 1


Recommended