An Analyst’s Toolbox and the inclusion of RAli Arsalan KazmiApril 9th, 2015
2
What is this presentation about?
ToolsThey determine
• What can be done
• How can it be done
• By when can it be done
They undergo rapid changes/improvements
New tools constantly made
All tools designed to be ‘cures’ for specific ‘problems’
3
Tools
Tools may be• Under-utilised
• Over-utilised
• Incorrectly utilised
Tools may not ‘cure’ problems, as such
“What problems do we ‘cure’ using our tools at
Aimia?”
4
Presentation Overview
• R
• An Analyst’s work flow
• Tool #1: Reproducibility
• Tool #2: Automation
• Tool #3: Visualisation
• What Problems R does not solve
• Conclusion: A Data Analyst’s Toolbox
5
R – Environment for Statistical Computing
Why choose R?• Lingua France for Computational Statisticians
• Now has capability to perform almost all of data extraction, manipulation, analyses, and visualisation tasks
• Offers specialist as well as general data analyses functions
• Is continuously improved
• Given all the above, is still free
6
Presentation Overview
• R
• An Analyst’s work flow
• Tool #1: Reproducibility
• Tool #2: Automation
• Tool #3: Visualisation
• What Problems R does not solve
• Conclusion: A Data Analyst’s Toolbox
7
An Analyst’s Work Flow
Generally, different tools used at each stage
Each stage faces different problems
…But tools available to cure the problems…
8
An Analyst’s Work Flow
Generally, different tools used at each stage
Each stage faces different problems
…But tools available to cure the problems…
9
Presentation Overview
• R
• An Analyst’s work flow
• Tool #1: Reproducibility
• Tool #2: Automation
• Tool #3: Visualisation
• What Problems R does not solve
• Conclusion: A Data Analyst’s Toolbox
10
Tool #1: Reproducibility
Definition: The quality of analyses or a work flow to be reproduced
• Computational Reproducibility
• Statistical Reproducibility
What problems does this tool cure?
• Unreliability
• Lack of Quality Control
• Concealment of knowledge
• Dynamic/Reactive Documents and Reports
11
Tool #1: Reproducibility
How to use this tool (i.e. Reproducibility)?
12
Tool #1: Reproducibility
How to use this tool?
With Excel?• Absence of operation history
• Unclear intra/inter-sheet organisation
• Quality Control difficult
• Difficult for a newcomer to follow
With R?• Command history available
• Comments included in script
• Scripting – easier to automate
• Quality Control checks injected into code
• Dynamic/Reactive documents and reports
13
Presentation Overview
• R
• An Analyst’s work flow
• Tool #1: Reproducibility
• Tool #2: Automation
• Tool #3: Visualisation
• What Problems R does not solve
• Conclusion: A Data Analyst’s Toolbox
14
Tool #2: Automation
Definition: Using a computer to mechanise a set of tasks on the bases of rules (i.e. programmable code)
What problems does this tool cure?
• Unproductiveness
• Human-induced errors
15
Tool #2: Automation
How to use this tool?
With Excel?• Record a macro for basic tasks
• Learn VBA
With R?• Windows Task Scheduler
• scheduleR package
16
Presentation Overview
• R
• An Analyst’s work flow
• Tool #1: Reproducibility
• Tool #2: Automation
• Tool #3: Visualisation
• What Problems R does not solve
• Conclusion: A Data Analyst’s Toolbox
17
Tool #3: Visualisation
Definition: A visual generated in response to a question(s). Opens up the analysis of data pictorially.
What problems does this tool cure?
• Unintuitive Communication of data analyses
• Inaccessible insights from data
What if the user does not have a static set of questions?
• Interactivity
18
Tool #3: Visualisation
How to use this tool?
With Excel?• Inflexible
• Difficult to Automate
• Inefficient
• Less fluid Dashboards
With R?• Charting package (ggplot2) based on
a Grammar of Graphics
• Greater charting capabilities
• Easily automated
• Efficient with large data sets
• Interactivity
• Interactive and efficient Dashboards
19
Tool #3: Visualisation
20
Presentation Overview
• R
• An Analyst’s work flow
• Tool #1: Reproducibility
• Tool #2: Automation
• Tool #3: Visualisation
• What Problems R does not solve
• Conclusion: A Data Analyst’s Toolbox
21
What Problems R does not solve?
R has:
• Difficult for some to use…?
• Limitations based on size of data
But is rapidly improving (packages support parallelisation and Hadoop)
22
Presentation Overview
• R
• An Analyst’s work flow
• Tool #1: Reproducibility
• Tool #2: Automation
• Tool #3: Visualisation
• What Problems R does not solve
• Conclusion: A Data Analyst’s Toolbox
23
A Data Analyst’s Toolbox
In today’s world, tools used by analysts/computational statisticians, computer scientists are continuously evolving…
A toolbox, ideally, will contain tools to cure multiple problems across multiple dimensions
Over/under-utilisation of tools, and incorrect use of tools will keep us bounded by problems
Thank You