Lifecycle Seminar Series
Welcome to the Community!
Live Tweet to #DSSS2
The Lifecycle Series
• #1: July 10 The Scientist, The Team and The Purpose• #2: July 31 Organizing and Feeling Out Your Data
Dates and Topics not Finalized, but roughly:• #3: Data / Analytics Preparation• #4: Modeling, Classification, and Decision-Making• #5: The Data Science Team• #6: Telling The Story: Visualizing Results
We Want Contributors!
• Looking for people willing to lead one of the Topics in given seminars
• Looking for people who have an interesting anecdote or challenge to offer– Want to try integrating with main speaker or kick
off networking session– Particularly interested in experiences/anecdotes
for Session II (July 31) : Organizing and Feeling Out Your Data
Data Lifecycle = Where we are
Build Team
Organizing Data
Preparing for
Analysis
Finding Insights
Telling Stories
But!...
Data Science Lifecycle
• Tonight, Focus is on Feeling Out Data– Primarily early-stage
skill, but a part of all stages
– Something everyone can do, increasingly so with modern tools
Build Team
Organizing Data
Preparing for
Analysis
Finding Insights
Telling Stories Organizing and
Feeling Out your Data
Tonight’s Agenda
• The Data Scientist Seminar Series– Followup from Seminar 1– Participation opportunities
• Jason Sroka: “Organizing and Feeling Out your Data”• Wrap-up & Announcements• Networking Session – Buy Jason Tequila!
MarketMeSuite – Our Venue Sponsor
MarketMeSuite’s Inbox For Social is how small businesses convert leads and market on social media
Approach & Goals
• Walk through steps of organizing and feeling out data– Focus on Data Scientist Survey
• Use Survey data and anecdotes to touch on Data Science topics– Not going deep, but trying to give a real feel
• Tool Discussion– Tableau and Google Refine
Data Setup• We are all getting our data from somewhere
– Personal data– Private data– Public data
• Need tool(s) to look at it with– Will see Tableau here, many others available
• Focus is on feeling out the data, not managing it– Will only mention some data management challenges– Not dealing with Big Data tonight (when we go international…)– These are topics that will be more central to future Meetup
Seminars
What I did
• Quick scan of source– Excel File– Nulls in Beige– True flags in Green– 84 Data Rows
• Import the data– Tableau reads
straight from Excel
Source(s)
Import
Analytics Tool
What a Quick Scan Shows
• Organization of Raw Data– Nulls in Beige– True flags in Green
Start with the Basics
• The first question– How many data?– 85 records imported
• Move to things you know/understand– Simple categories
(gender, age, ..)– Check assumptions (e.g.
more males than females)
Gender
• Simple category– Binary– Meaningful to
everyone• Data not quite so
simple– What is a Null,
compared to a Blank
Message #1: Data is Messy!• Data Scientists have gender issues!
– We have a Null and 3 blanks– Back to the source…
• Null is a bad record (header?)• Blanks were user option
– Clean it up• Don’t re-discover and re-implement• Someone needs to track these!
– Null filtered in Tableau• Count now at 84
– Blank relabeled to “N/A” in Excel
• Tools Discussion and Seminar 3 will go into Data Cleansing in more detail
Before Cleaning After Cleaning
Handedness
• Didn’t we just fix the NULL thing?– Yes – this is a new Null– Excel had a cut-and-
paste error! • Formula wasn’t used in
column – values were hard-coded
– Fixed formula, copied throughout
Before Cleaning After Cleaning
Data Scientist Ethic
• Don’t ignore the warts!– Most warts are meaningless
• Of those that aren’t, most are easy to figure out– Of those that aren’t, most are at least easy to fix once you figure it out
» Of those that aren’t, most times you can get someone else to help you fix it• Of those that aren’t, you’ll usually improve your
implementation skills when you resolve it• Sometimes this line of work sucks
– The ones that aren’t help you understand the data• In this case, a problem with the data process• In other cases, interesting quirks and potential insights!
Age
• Survey question: Birth Year
• Seeing old and new issues– Blanks– Number ranges• Survey did not
constrain to YYYY
Age
• Survey question: Birth Year
• Seeing old and new issues– Blanks– Number ranges
• Survey did not constrain to YYYY
• Fixed these three entries
Age
• Survey question: Birth Year
• Seeing old and new issues– Nulls
• Turn out to be blanks – valid option in Survey
– Number ranges• Survey did not constrain
to YYYY• Fixed these three entries
Age, as Age
• Birth Year isn’t our interest, Age is– Transform your data to suit your needs– Be as direct between the data and the context as
you can
Birth Year Age Decade
The Art of Data Science
• Message #2: Connect the Data to the Context– Transform the data to suit your
needs• Easy investigation/understanding• Analytics goals• Operational goals
– This is where Telling the Story feeds back• Effective plots help the data tell
their story to you• Try things out!
Favorite Color
• Here, I’ve assigned colors near the named color
• Sorting by most prevalent to least
• Blank isn’t adding anything– Removing
Favorite Color• Now, let’s add Gender
• Okay – I see differences!– Something to form an
impression from– Something to come
back to
• Blue is now the Official Data Scientist color!
Check Assumptions
• Assumption 1: More Males than Females
• Assumption 2: 10-15% Lefties– Underestimate!
• Assumption 3: Different color preferences by Gender
Checking Assumptions…
• Familiarizes You with the Data– Identifies data issues– Tests your assumptions
• Gives you Confidence in the Data…– Confidence in the initial source– Confidence in Extraction, Transformation, Load
• …and Your Assumptions– Confidence in your Intuition where it was right– Updates to your Intuition where it was off
Building a Data Model
• Data comes in different types– Categorical
• Gender, Handedness, Favorite Color, any true/false– Scalar
• Age, height, weight– Label/identifier– …
• These data types often associate with the purpose to which it will be applied– Categories are dimensions along which we might divide the records– Measurements (Scalars) are facts about specific instances of what we’re modeling
• A good data model allows for rapid analytics– Modular construction of sets of dimensions and measurements– Automated investigation of cross-relationships
Survey Duration• Another processed ‘field’
– End Time – Start Time
• Plotting it all: sparse info– A lot of short times– A few long times
• Outliers are hiding the data!– After filtering out extremely high
values, a different picture emerges…
Same Data,Different Lenses
Playing with Plots 1:Beware Bad Binners!
• How you choose bins and plot a histogram can impact your interpretation
Very flat; One entry per bin
Still flat, but the voids in X-axis have meaning
Same Data,Different Axes
Survey Duration: 1 Second Bins
Survey Duration: 3 Second Bins
Survey Duration: 5 Second Bins
Survey Duration: 10 Second Bins
Survey Duration: 15 Second Bins
Survey Duration: 20 Second Bins
Survey Duration: 30 Second Bins
Survey Duration: 45 Second Bins
Survey Duration: 60 Second Bins
Survey Duration: 1,000 Second Bins
The Practice of Data Science
1,000
60
45
30201510
5
3
1 • I just tricked you into looking at a bunch of data!– That is Data Science in
action• It is a skill like many others– We all have some ability– We get better with practice
• It’s pattern recognition
Bin Size (seconds)
The Science of Data
• Distributions have meaning– Flat: random, fixed– Normal distributions: repeated
processes– Exponential: cumulative processes
• Over time, we interpret data in terms of known distributions– Survey Duration: Gaussian +
Exponential
Wikipedia.org
Wikipedia.org
Survey Duration• Another processed ‘field’
– End Time – Start Time
• Plotting it all: sparse info– A lot of short times– A few long times
• Outliers are hiding the data!– After filtering out extremely high values,
a different picture emerges– Normal Distribution plus sparse tail
• People who start, complete, end• People who start, stop, return, <repeat>,
end
Same Data,Different Lenses
Tools
• I used Tableau here– A lot can be done directly
in Excel• Google Refine looks
impressive http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded
– Highlights cleansing issues, supports resolution
Source(s)
Import
Analytics Tool
Data Science Lifecycle
• Tonight, Focus is on Feeling Out Data– Primarily early-stage
skill, but a part of all stages
– Something everyone can do, increasingly so with modern tools
Build Team
Organizing Data
Preparing for
Analysis
Finding Insights
Telling Stories
Organizing and Feeling Out your Data
Closing Thoughts
• Message #1: Data is Messy– Don’t ignore the warts
• Message #2: Connect the Data to the Context– Translate data so it is expressed in your terms
• Message #3: Check Your Assumptions– Explore the data for insights
• Message #4: Develop Your Intuition– Look at a lot of data in a lot of ways
Who Rocks?
• A HUGE thanks to Peggy Sue for executing the survey and organizing the results!
• Super thanks to Tammy for live tweeting and sponsoring us at CIC!
The Lifecycle Series
Quick Note:
• #6: Telling The Story: Visualizing Results– Speaker:
• Hjalmar Gislason– CEO of DataMarket.com– Conference Speaker– Currently writing a book for O’Reilly called Effective Data
Visualization
Connect with us!
Jason Sroka @jasonSroka http://www.linkedin.com/in/jasonsroka
Peggy Sue Hect @hydriad http://www.linkedin.com/in/peggysuehecht
John Baker @johnAlanBaker http://www.linkedin.com/in/jab5569
Tammy Fennell @tammyKFennell http://www.linkedin.com/in/tammykahnfennell