+ All Categories
Home > Documents > Lifecycle Seminar Series

Lifecycle Seminar Series

Date post: 24-Feb-2016
Category:
Upload: justus
View: 34 times
Download: 0 times
Share this document with a friend
Description:
Lifecycle Seminar Series. Welcome to the Community!. Live Tweet to #DSSS2. The Lifecycle Series. #1: July 10 The Scientist, The Team and The Purpose #2: July 31 Organizing and Feeling Out Your Data Dates and Topics not Finalized, but roughly: #3: Data / Analytics Preparation - PowerPoint PPT Presentation
Popular Tags:
48
Lifecycle Seminar Series Welcome to the Community! Live Tweet to #DSSS2
Transcript
Page 1: Lifecycle Seminar Series

Lifecycle Seminar Series

Welcome to the Community!

Live Tweet to #DSSS2

Page 2: Lifecycle Seminar Series

The Lifecycle Series

• #1: July 10 The Scientist, The Team and The Purpose• #2: July 31 Organizing and Feeling Out Your Data

Dates and Topics not Finalized, but roughly:• #3: Data / Analytics Preparation• #4: Modeling, Classification, and Decision-Making• #5: The Data Science Team• #6: Telling The Story: Visualizing Results

Page 3: Lifecycle Seminar Series

We Want Contributors!

• Looking for people willing to lead one of the Topics in given seminars

• Looking for people who have an interesting anecdote or challenge to offer– Want to try integrating with main speaker or kick

off networking session– Particularly interested in experiences/anecdotes

for Session II (July 31) : Organizing and Feeling Out Your Data

Page 4: Lifecycle Seminar Series

Data Lifecycle = Where we are

Build Team

Organizing Data

Preparing for

Analysis

Finding Insights

Telling Stories

But!...

Page 5: Lifecycle Seminar Series

Data Science Lifecycle

• Tonight, Focus is on Feeling Out Data– Primarily early-stage

skill, but a part of all stages

– Something everyone can do, increasingly so with modern tools

Build Team

Organizing Data

Preparing for

Analysis

Finding Insights

Telling Stories Organizing and

Feeling Out your Data

Page 6: Lifecycle Seminar Series

Tonight’s Agenda

• The Data Scientist Seminar Series– Followup from Seminar 1– Participation opportunities

• Jason Sroka: “Organizing and Feeling Out your Data”• Wrap-up & Announcements• Networking Session – Buy Jason Tequila!

Page 7: Lifecycle Seminar Series

MarketMeSuite – Our Venue Sponsor

MarketMeSuite’s Inbox For Social is how small businesses convert leads and market on social media

Page 8: Lifecycle Seminar Series
Page 9: Lifecycle Seminar Series

Approach & Goals

• Walk through steps of organizing and feeling out data– Focus on Data Scientist Survey

• Use Survey data and anecdotes to touch on Data Science topics– Not going deep, but trying to give a real feel

• Tool Discussion– Tableau and Google Refine

Page 10: Lifecycle Seminar Series

Data Setup• We are all getting our data from somewhere

– Personal data– Private data– Public data

• Need tool(s) to look at it with– Will see Tableau here, many others available

• Focus is on feeling out the data, not managing it– Will only mention some data management challenges– Not dealing with Big Data tonight (when we go international…)– These are topics that will be more central to future Meetup

Seminars

Page 11: Lifecycle Seminar Series

What I did

• Quick scan of source– Excel File– Nulls in Beige– True flags in Green– 84 Data Rows

• Import the data– Tableau reads

straight from Excel

Source(s)

Import

Analytics Tool

Page 12: Lifecycle Seminar Series

What a Quick Scan Shows

• Organization of Raw Data– Nulls in Beige– True flags in Green

Page 13: Lifecycle Seminar Series

Start with the Basics

• The first question– How many data?– 85 records imported

• Move to things you know/understand– Simple categories

(gender, age, ..)– Check assumptions (e.g.

more males than females)

Page 14: Lifecycle Seminar Series

Gender

• Simple category– Binary– Meaningful to

everyone• Data not quite so

simple– What is a Null,

compared to a Blank

Page 15: Lifecycle Seminar Series

Message #1: Data is Messy!• Data Scientists have gender issues!

– We have a Null and 3 blanks– Back to the source…

• Null is a bad record (header?)• Blanks were user option

– Clean it up• Don’t re-discover and re-implement• Someone needs to track these!

– Null filtered in Tableau• Count now at 84

– Blank relabeled to “N/A” in Excel

• Tools Discussion and Seminar 3 will go into Data Cleansing in more detail

Before Cleaning After Cleaning

Page 16: Lifecycle Seminar Series

Handedness

• Didn’t we just fix the NULL thing?– Yes – this is a new Null– Excel had a cut-and-

paste error! • Formula wasn’t used in

column – values were hard-coded

– Fixed formula, copied throughout

Before Cleaning After Cleaning

Page 17: Lifecycle Seminar Series

Data Scientist Ethic

• Don’t ignore the warts!– Most warts are meaningless

• Of those that aren’t, most are easy to figure out– Of those that aren’t, most are at least easy to fix once you figure it out

» Of those that aren’t, most times you can get someone else to help you fix it• Of those that aren’t, you’ll usually improve your

implementation skills when you resolve it• Sometimes this line of work sucks

– The ones that aren’t help you understand the data• In this case, a problem with the data process• In other cases, interesting quirks and potential insights!

Page 18: Lifecycle Seminar Series

Age

• Survey question: Birth Year

• Seeing old and new issues– Blanks– Number ranges• Survey did not

constrain to YYYY

Page 19: Lifecycle Seminar Series

Age

• Survey question: Birth Year

• Seeing old and new issues– Blanks– Number ranges

• Survey did not constrain to YYYY

• Fixed these three entries

Page 20: Lifecycle Seminar Series

Age

• Survey question: Birth Year

• Seeing old and new issues– Nulls

• Turn out to be blanks – valid option in Survey

– Number ranges• Survey did not constrain

to YYYY• Fixed these three entries

Page 21: Lifecycle Seminar Series

Age, as Age

• Birth Year isn’t our interest, Age is– Transform your data to suit your needs– Be as direct between the data and the context as

you can

Birth Year Age Decade

Page 22: Lifecycle Seminar Series

The Art of Data Science

• Message #2: Connect the Data to the Context– Transform the data to suit your

needs• Easy investigation/understanding• Analytics goals• Operational goals

– This is where Telling the Story feeds back• Effective plots help the data tell

their story to you• Try things out!

Page 23: Lifecycle Seminar Series

Favorite Color

• Here, I’ve assigned colors near the named color

• Sorting by most prevalent to least

• Blank isn’t adding anything– Removing

Page 24: Lifecycle Seminar Series

Favorite Color• Now, let’s add Gender

• Okay – I see differences!– Something to form an

impression from– Something to come

back to

• Blue is now the Official Data Scientist color!

Page 25: Lifecycle Seminar Series

Check Assumptions

• Assumption 1: More Males than Females

• Assumption 2: 10-15% Lefties– Underestimate!

• Assumption 3: Different color preferences by Gender

Page 26: Lifecycle Seminar Series

Checking Assumptions…

• Familiarizes You with the Data– Identifies data issues– Tests your assumptions

• Gives you Confidence in the Data…– Confidence in the initial source– Confidence in Extraction, Transformation, Load

• …and Your Assumptions– Confidence in your Intuition where it was right– Updates to your Intuition where it was off

Page 27: Lifecycle Seminar Series

Building a Data Model

• Data comes in different types– Categorical

• Gender, Handedness, Favorite Color, any true/false– Scalar

• Age, height, weight– Label/identifier– …

• These data types often associate with the purpose to which it will be applied– Categories are dimensions along which we might divide the records– Measurements (Scalars) are facts about specific instances of what we’re modeling

• A good data model allows for rapid analytics– Modular construction of sets of dimensions and measurements– Automated investigation of cross-relationships

Page 28: Lifecycle Seminar Series

Survey Duration• Another processed ‘field’

– End Time – Start Time

• Plotting it all: sparse info– A lot of short times– A few long times

• Outliers are hiding the data!– After filtering out extremely high

values, a different picture emerges…

Same Data,Different Lenses

Page 29: Lifecycle Seminar Series

Playing with Plots 1:Beware Bad Binners!

• How you choose bins and plot a histogram can impact your interpretation

Very flat; One entry per bin

Still flat, but the voids in X-axis have meaning

Same Data,Different Axes

Page 30: Lifecycle Seminar Series

Survey Duration: 1 Second Bins

Page 31: Lifecycle Seminar Series

Survey Duration: 3 Second Bins

Page 32: Lifecycle Seminar Series

Survey Duration: 5 Second Bins

Page 33: Lifecycle Seminar Series

Survey Duration: 10 Second Bins

Page 34: Lifecycle Seminar Series

Survey Duration: 15 Second Bins

Page 35: Lifecycle Seminar Series

Survey Duration: 20 Second Bins

Page 36: Lifecycle Seminar Series

Survey Duration: 30 Second Bins

Page 37: Lifecycle Seminar Series

Survey Duration: 45 Second Bins

Page 38: Lifecycle Seminar Series

Survey Duration: 60 Second Bins

Page 39: Lifecycle Seminar Series

Survey Duration: 1,000 Second Bins

Page 40: Lifecycle Seminar Series

The Practice of Data Science

1,000

60

45

30201510

5

3

1 • I just tricked you into looking at a bunch of data!– That is Data Science in

action• It is a skill like many others– We all have some ability– We get better with practice

• It’s pattern recognition

Bin Size (seconds)

Page 41: Lifecycle Seminar Series

The Science of Data

• Distributions have meaning– Flat: random, fixed– Normal distributions: repeated

processes– Exponential: cumulative processes

• Over time, we interpret data in terms of known distributions– Survey Duration: Gaussian +

Exponential

Wikipedia.org

Wikipedia.org

Page 42: Lifecycle Seminar Series

Survey Duration• Another processed ‘field’

– End Time – Start Time

• Plotting it all: sparse info– A lot of short times– A few long times

• Outliers are hiding the data!– After filtering out extremely high values,

a different picture emerges– Normal Distribution plus sparse tail

• People who start, complete, end• People who start, stop, return, <repeat>,

end

Same Data,Different Lenses

Page 43: Lifecycle Seminar Series

Tools

• I used Tableau here– A lot can be done directly

in Excel• Google Refine looks

impressive http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded

– Highlights cleansing issues, supports resolution

Source(s)

Import

Analytics Tool

Page 44: Lifecycle Seminar Series

Data Science Lifecycle

• Tonight, Focus is on Feeling Out Data– Primarily early-stage

skill, but a part of all stages

– Something everyone can do, increasingly so with modern tools

Build Team

Organizing Data

Preparing for

Analysis

Finding Insights

Telling Stories

Organizing and Feeling Out your Data

Page 45: Lifecycle Seminar Series

Closing Thoughts

• Message #1: Data is Messy– Don’t ignore the warts

• Message #2: Connect the Data to the Context– Translate data so it is expressed in your terms

• Message #3: Check Your Assumptions– Explore the data for insights

• Message #4: Develop Your Intuition– Look at a lot of data in a lot of ways

Page 46: Lifecycle Seminar Series

Who Rocks?

• A HUGE thanks to Peggy Sue for executing the survey and organizing the results!

• Super thanks to Tammy for live tweeting and sponsoring us at CIC!

Page 47: Lifecycle Seminar Series

The Lifecycle Series

Quick Note:

• #6: Telling The Story: Visualizing Results– Speaker:

• Hjalmar Gislason– CEO of DataMarket.com– Conference Speaker– Currently writing a book for O’Reilly called Effective Data

Visualization

Page 48: Lifecycle Seminar Series

Connect with us!

Jason Sroka @jasonSroka http://www.linkedin.com/in/jasonsroka

Peggy Sue Hect @hydriad http://www.linkedin.com/in/peggysuehecht

John Baker @johnAlanBaker http://www.linkedin.com/in/jab5569

Tammy Fennell @tammyKFennell http://www.linkedin.com/in/tammykahnfennell


Recommended