Lifecycle Seminar Series

Lifecycle Seminar Series

Welcome to the Community!

Live Tweet to #DSSS2

The Lifecycle Series

• #1: July 10 The Scientist, The Team and The Purpose• #2: July 31 Organizing and Feeling Out Your Data

Dates and Topics not Finalized, but roughly:• #3: Data / Analytics Preparation• #4: Modeling, Classification, and Decision-Making• #5: The Data Science Team• #6: Telling The Story: Visualizing Results

We Want Contributors!

• Looking for people willing to lead one of the Topics in given seminars

• Looking for people who have an interesting anecdote or challenge to offer– Want to try integrating with main speaker or kick

off networking session– Particularly interested in experiences/anecdotes

for Session II (July 31) : Organizing and Feeling Out Your Data

Data Lifecycle = Where we are

Build Team

Organizing Data

Preparing for

Analysis

Finding Insights

Telling Stories

But!...

Data Science Lifecycle

• Tonight, Focus is on Feeling Out Data– Primarily early-stage

skill, but a part of all stages

– Something everyone can do, increasingly so with modern tools

Build Team

Organizing Data

Preparing for

Analysis

Finding Insights

Telling Stories Organizing and

Feeling Out your Data

Tonight’s Agenda

• The Data Scientist Seminar Series– Followup from Seminar 1– Participation opportunities

• Jason Sroka: “Organizing and Feeling Out your Data”• Wrap-up & Announcements• Networking Session – Buy Jason Tequila!

MarketMeSuite – Our Venue Sponsor

MarketMeSuite’s Inbox For Social is how small businesses convert leads and market on social media

Approach & Goals

• Walk through steps of organizing and feeling out data– Focus on Data Scientist Survey

• Use Survey data and anecdotes to touch on Data Science topics– Not going deep, but trying to give a real feel

• Tool Discussion– Tableau and Google Refine

Data Setup• We are all getting our data from somewhere

– Personal data– Private data– Public data

• Need tool(s) to look at it with– Will see Tableau here, many others available

• Focus is on feeling out the data, not managing it– Will only mention some data management challenges– Not dealing with Big Data tonight (when we go international…)– These are topics that will be more central to future Meetup

Seminars

What I did

• Quick scan of source– Excel File– Nulls in Beige– True flags in Green– 84 Data Rows

• Import the data– Tableau reads

straight from Excel

Source(s)

Import

Analytics Tool

What a Quick Scan Shows

• Organization of Raw Data– Nulls in Beige– True flags in Green

Start with the Basics

• The first question– How many data?– 85 records imported

• Move to things you know/understand– Simple categories

(gender, age, ..)– Check assumptions (e.g.

more males than females)

Gender

• Simple category– Binary– Meaningful to

everyone• Data not quite so

simple– What is a Null,

compared to a Blank

Message #1: Data is Messy!• Data Scientists have gender issues!

– We have a Null and 3 blanks– Back to the source…

• Null is a bad record (header?)• Blanks were user option

– Clean it up• Don’t re-discover and re-implement• Someone needs to track these!

– Null filtered in Tableau• Count now at 84

– Blank relabeled to “N/A” in Excel

• Tools Discussion and Seminar 3 will go into Data Cleansing in more detail

Before Cleaning After Cleaning

Handedness

• Didn’t we just fix the NULL thing?– Yes – this is a new Null– Excel had a cut-and-

paste error! • Formula wasn’t used in

column – values were hard-coded

– Fixed formula, copied throughout

Before Cleaning After Cleaning

Data Scientist Ethic

• Don’t ignore the warts!– Most warts are meaningless

• Of those that aren’t, most are easy to figure out– Of those that aren’t, most are at least easy to fix once you figure it out

» Of those that aren’t, most times you can get someone else to help you fix it• Of those that aren’t, you’ll usually improve your

implementation skills when you resolve it• Sometimes this line of work sucks

– The ones that aren’t help you understand the data• In this case, a problem with the data process• In other cases, interesting quirks and potential insights!

Age

• Survey question: Birth Year

• Seeing old and new issues– Blanks– Number ranges• Survey did not

constrain to YYYY

Age


• Seeing old and new issues– Blanks– Number ranges

• Survey did not constrain to YYYY

• Fixed these three entries

Age


• Seeing old and new issues– Nulls

• Turn out to be blanks – valid option in Survey

– Number ranges• Survey did not constrain

to YYYY• Fixed these three entries

Age, as Age

• Birth Year isn’t our interest, Age is– Transform your data to suit your needs– Be as direct between the data and the context as

you can

Birth Year Age Decade

The Art of Data Science

• Message #2: Connect the Data to the Context– Transform the data to suit your

needs• Easy investigation/understanding• Analytics goals• Operational goals

– This is where Telling the Story feeds back• Effective plots help the data tell

their story to you• Try things out!

Favorite Color

• Here, I’ve assigned colors near the named color

• Sorting by most prevalent to least

• Blank isn’t adding anything– Removing

Favorite Color• Now, let’s add Gender

• Okay – I see differences!– Something to form an

impression from– Something to come

back to

• Blue is now the Official Data Scientist color!

Check Assumptions

• Assumption 1: More Males than Females

• Assumption 2: 10-15% Lefties– Underestimate!

• Assumption 3: Different color preferences by Gender

Checking Assumptions…

• Familiarizes You with the Data– Identifies data issues– Tests your assumptions

• Gives you Confidence in the Data…– Confidence in the initial source– Confidence in Extraction, Transformation, Load

• …and Your Assumptions– Confidence in your Intuition where it was right– Updates to your Intuition where it was off

Building a Data Model

• Data comes in different types– Categorical

• Gender, Handedness, Favorite Color, any true/false– Scalar

• Age, height, weight– Label/identifier– …

• These data types often associate with the purpose to which it will be applied– Categories are dimensions along which we might divide the records– Measurements (Scalars) are facts about specific instances of what we’re modeling

• A good data model allows for rapid analytics– Modular construction of sets of dimensions and measurements– Automated investigation of cross-relationships

Survey Duration• Another processed ‘field’

– End Time – Start Time

• Plotting it all: sparse info– A lot of short times– A few long times

• Outliers are hiding the data!– After filtering out extremely high

values, a different picture emerges…

Same Data,Different Lenses

Playing with Plots 1:Beware Bad Binners!

• How you choose bins and plot a histogram can impact your interpretation

Very flat; One entry per bin

Still flat, but the voids in X-axis have meaning

Same Data,Different Axes

Survey Duration: 1 Second Bins









Survey Duration: 1,000 Second Bins

The Practice of Data Science

1,000

60

45

30201510

5

3

1 • I just tricked you into looking at a bunch of data!– That is Data Science in

action• It is a skill like many others– We all have some ability– We get better with practice

• It’s pattern recognition

Bin Size (seconds)

The Science of Data

• Distributions have meaning– Flat: random, fixed– Normal distributions: repeated

processes– Exponential: cumulative processes

• Over time, we interpret data in terms of known distributions– Survey Duration: Gaussian +

Exponential

Wikipedia.org

Wikipedia.org

Survey Duration• Another processed ‘field’

– End Time – Start Time

• Plotting it all: sparse info– A lot of short times– A few long times

• Outliers are hiding the data!– After filtering out extremely high values,

a different picture emerges– Normal Distribution plus sparse tail

• People who start, complete, end• People who start, stop, return, <repeat>,

end

Same Data,Different Lenses

Tools

• I used Tableau here– A lot can be done directly

in Excel• Google Refine looks

impressive http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded

– Highlights cleansing issues, supports resolution

Source(s)

Import

Analytics Tool

http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded

http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded

Data Science Lifecycle

• Tonight, Focus is on Feeling Out Data– Primarily early-stage

skill, but a part of all stages

– Something everyone can do, increasingly so with modern tools

Build Team

Organizing Data

Preparing for

Analysis

Finding Insights

Telling Stories

Organizing and Feeling Out your Data

Closing Thoughts

• Message #1: Data is Messy– Don’t ignore the warts

• Message #2: Connect the Data to the Context– Translate data so it is expressed in your terms

• Message #3: Check Your Assumptions– Explore the data for insights

• Message #4: Develop Your Intuition– Look at a lot of data in a lot of ways

Who Rocks?

• A HUGE thanks to Peggy Sue for executing the survey and organizing the results!

• Super thanks to Tammy for live tweeting and sponsoring us at CIC!

The Lifecycle Series

Quick Note:

• #6: Telling The Story: Visualizing Results– Speaker:

• Hjalmar Gislason– CEO of DataMarket.com– Conference Speaker– Currently writing a book for O’Reilly called Effective Data

Visualization

Connect with us!

Jason Sroka @jasonSroka http://www.linkedin.com/in/jasonsroka

Peggy Sue Hect @hydriad http://www.linkedin.com/in/peggysuehecht

John Baker @johnAlanBaker http://www.linkedin.com/in/jab5569

Tammy Fennell @tammyKFennell http://www.linkedin.com/in/tammykahnfennell

http://www.linkedin.com/in/jasonsroka

http://www.linkedin.com/in/peggysuehecht

http://www.linkedin.com/in/jab5569

http://www.linkedin.com/in/tammykahnfennell

Date post:	24-Feb-2016
Category:	Documents
Upload:	justus
View:	34 times
Download:	0 times

Lifecycle Seminar Series

Documents