2
The Data-Information-Knowledge-Wisdom Hierarchy
- Russell Ackoff
What?
How much?
How many?
How?
Why?
Individual facts
(quantities,
characters, or
symbols)
3
1 exabytes= 1billion GB=1018 bytes
4
How do we make decisions?
Experience
Data(Experiments)
Statistics
Big data Data science
(Probability, uncertainty)
• How much? - or - How many?
– Regression algorithms
• What it is? Is this A or B?
– Classification algorithms
• Is this weird?
– Anomaly detection algorithms
Questions that you can answer with data science5
Correlation vs. causation6
A B
(1) A B
(2) A B
(3) A B
C
(4) A B (5) Coincidence
Causation is not observed but inferred
• Social drinking vs. earnings
• Energy consumption vs. economic growth
• Debt rate vs. performance of company
• Shoe size vs. reading ability
• Ice cream consumption vs. rate of drowning
• Obesity vs. diabetes (risk factor)
• Children who get tutored get worse grades than
children who do not get tutored
Population vs. sample7
Population
Sample
Statistic
Standard deviation
Standard error
n
sSE
Y
N
n
8
True
situationOur conclusion Control errors
No effect
(negative)
Not significant True negative
Significant
(Reject H0)
False positive
“Type I error”
Confidence level,
P value
Has an effect
(positive)
Significant
(Reject H0)True positive
Not significantFalse negative
"Type II error"
Statistical power,
sample size
Null hypothesis (H0): A has no effect on B.
Confounding/nuisance
variables
(undesired sources of variation that
affect the dependent variable)
9
Dependent variable
A
Independent variable
B
D
C
E
F
If you can, fix the confounding variable (make it a constant).
If you can’t fix the confounding variable, use blocking.
If you can neither fix nor block the confounding variable, use randomization.
Avoid confounding variables
Common probability distributions10
Regression analysis11
R2: coefficient of determination, 0 to 1
R: correlation coefficient, -1 to +1
• Linear regression
• Logistic regression
• Nonlinear regression
• Stepwise regression- Forward
- Backward
• Ridge, LASSO &
ElasticNet regression- Handle multicollinearity
variables
Machine learning12
• Learning:
- improve performance from experience.
• Machine learning:
- teach computers to make and improve predictions based
on data. approach to achieve artificial intelligence
- classification
- prediction (regression)
• Data mining:
- use algorithms to create knowledge from data.
Bayesian statistics for machine learning13
Bayes' rule provides the tools to update the probability for a
hypothesis as more evidence or information becomes available.
New
Common data science algorithms14
• Linear regression• Decision tree• Random forest• Association rule mining• K-Means clustering
Unsupervised = exploratory
Supervised = predictive
Decision tree15
• The attribute with the largest std reduction is chosen for the decision node.
• Stop when std for the branch becomes smaller than a certain fraction (e.g., 5%)
of std for the full dataset or when too few instances remain in the branch.
http://www.saedsayad.com/decision_tree_reg.htm
4/14
Std=3.49
5/14
Std=10.87 5/14
Std=7.78
Std=9.32
Decision tree16
• You can define a split-point for either categorical variable or continuous variable.
• Split the dataset based on homogeneity of data.
X2
X1
Classification & Regression Trees (CART)
(Ankit Sharma, 2014)
Random forest17
• Averaging multiple deep decision trees, trained on different parts of the same
training set; Overcoming overfitting problem of individual decision tree
• Widely used machine learning algorithm for classification
- Approx. 2/3rd of the total training data are selected at random to grow each tree.
- Predictor variables are selected at random and the best split is used to split the node.
- For each tree, using the leftover (1/3rd) data to calculate the out of bag error rate.
- Each tree gives a classification. The forest chooses the classification having the most
votes over all the trees in the forest.
Variable importance plot18
Random forests can be used
to rank the importance of
variables in a regression or
classification problem.
• Mean decrease accuracy: How much
the model accuracy decreases if we
drop that variable
• Mean decrease gini: Measure of variable
importance based on the Gini impurity index
used for the calculation of splits in trees
Classifying income of adults
Association rule mining19
An association rule is a pattern that states when X occurs, Y occurs with certain probability (If/then statement).
Initially used for Market Basket Analysis to find how items purchased by customers are related.
n
countYXsupport
). (
countX
countYXconfidence
.
). (
Goal: Find all rules that satisfy the user-specified minimum support
and minimum confidence.
itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset
{2 3 5}
itemset sup
{2 3 5} 2
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Min support =50%
2,35 confidence=100%
3,52 confidence=100%
2,53 confidence=67%
Association rule mining (the Apriori Algorithm)
K-Means clustering21
The algorithm works iteratively to assign each data point to one of K
groups based on feature similarity (ex. defined distance measure).
• Find the centroids of the K clusters
• Labels for the training data
Open-source language for data science
22
23
Demand for deep analytical talent in the U.S. projected to be 50-60% greater than
supply by 2018.
24
Become a data scientist?
Job trends form indeed.com