Flavio Figueiredo Justiça na Aprendizagem de...

Post on 30-Apr-2020

2 views 0 download

transcript

Justiça naAprendizagem de MáquinaEscola de Verão 2020Flavio Figueiredo

1

About Me

Flavio Figueiredo

Whenever someone asks me: What do you research?

About Me

Flavio Figueiredo

Whenever someone asks me: What do you research?

● Distributed Systems○ Dependability, File Sharing, Social Exchanges

● Information Retrieval○ Folksonomies

● Social Networks○ Popularity and evolution

● Human Computer Interaction○ User studies

● Machine Learning○ Learning models from social data

Human Factors in Computer Science

Distributed SystemsWhy and how do people share information?

Information RetrievalHow does unstructured human knowledge grow?

Social DynamicsHow does user actions impact popularity? How do users perceive popularity?

Machine LearningHow to capture complex human behavior?

Interest on Human-Centric Machine Learning

5

Fairness is one of the main issues!

6

Simplified Background

● Mathematically speaking, what is the goal of a supervised learning system?

● The goal is to learn some parameters

● Where these parameters maximize some prediction function across y

This is just one view. Optimization v. Bayesian and other topics out of the scope. 7

Simplified Background

● The goal of a supervised learning algorithm is to discriminate

8

Notation

We observe a dataset sampled from some joint distribution

Our ML model creates a hypothesis focused on good predictions

9

Pip

elin

e

10

Data

Training Val (Dev) Test

Time

Test

Test

Test

Test

Development Production

Ho

pe

full

y. L

et’

s as

sum

e s

o.

11

Data

Training Val (Dev) Test

Tempo

Test

Test

Test

Test

Development Production

Same Distribution

Simplified Background

● The goal of a supervised learning algorithm is to discriminate ● Why are we now so worried that it does? It seems we can trust them.

12

Machine Bias

● Pro Publica analysis of COMPAS (which stands for Correctional Offender Management Profiling for Alternative Sanctions)https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

13

Criminal Prediction

https://callingbullshit.org/case_studies/case_study_criminal_machine_learning.html

14

Discrimination

● The overall goal of a ML model is on discrimination● The best hypothesis is a good discriminator

○ Both in training time and in testing time

● In society, discrimination has a different connotation

15

Regulation

16

Law (USA)

Slide from: https://mrtz.org/nips17/#/9

17

Law (USA)

Slide from: https://mrtz.org/nips17/#/9

18

Two Guiding Principles

Disparate Treatment: Individuals from a sensitive group must be treated equally

19

Two Guiding Principles

Disparate Treatment: Individuals from a sensitive group must be treated equallyDisparate Impact: The impact of decisions must affect groups equally

Extreme case: A credit scoring system that denies all loans?

20

Two Guiding Principles

Disparate Treatment: Individuals from a sensitive group must be treated equallyDisparate Impact: The impact of decisions must affect groups equally

Extreme case: A credit scoring system that denies all loans?Real Case: https://en.wikipedia.org/wiki/Ricci_v._DeStefano

21

COMPAS

22

23

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Machine Bias

There’s software used across the country to predict future criminals. And it’s biased against blacks.by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, ProPublica May 23, 2016

COMPAS

Correctional Offender Management Profiling for Alternative Sanction

● First developed in 1998● Still used today● Predicts chance of recidivism● Developed by Northpointe (now Equivant)● Most information comes from a manual

"A Practitioner's Guide to COMPAS Core"

24

COMPAS

Used in (or was used in, may be outdated)

● Florida● New York● Wisconsin● California● and others...

25

Overall

It’s not the algorithm, it’s the data.

https://cacm.acm.org/magazines/2017/2/212422-its-not-the-algorithm-its-the-data/fulltext

“There is no debate that both of these types of technologies are being used on a fairly widespread basis in the U.S. According to a 2013 article published by Sonja B. Starr, a professor of law at the University of Michigan Law School, nearly every state has adopted some type of risk-based assessment tools to aid in sentencing.”

26

How is the model trained?

Who knows?!

27

How is the model trained?

We know the input

28

How is the model trained?

Compas Questionnairehttps://www.documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-CORE.html

29

Is COMPAS Using Well Known ML?

● Linear Regression● Logistic Regression● Support Vector Machines● Random Forests● etc● etc

Who knows? But they cite such methods.

30

Several Citations to ML Work

Note that it is a regression model. It seems that they regress to their own risk score.

31

Decile Scores

My interpretation:

● It seems that COMPAS scalesinputs and outputs in deciles

● Inputs come from the questionnaireand other variables

● Outputs is usually some kind of risk● I would guess that this is fed to some

blackbox ML system

32

Decile Scores

33

Single Equation

Only one equation is present in the manual:

Violent Recidivism Risk Score = (age∗−w) + (age-at-first-arrest∗−w) + (history of violence∗w) + (vocation education ∗ w) + (history of noncompliance ∗ w),

where w is weight, the size of which is "determined by the strength of the item’s relationship to person offense recidivism that we observed in our study data."

34

ProPublica’s Study

● Most analysis of such software is done by the developers● ProPublica decided to do it’s own analysis● COMPAS was chosen due to popularity

35

Anecdotal Evidence

Several examples in the article

36

Anecdotal Evidence

● Great for news articles● But the problem needs to be better understood● Use machine learning to understand machine learning

37

Data

38

Response

39

Some Initial Data Science

● Understanding Decile Scores (output of the model)

40

Some Initial Data Science

● Risk of Recidivism Score

41

Some Initial Data Science

● Violent Risk Score

42

Logistic Regression

● The model here is trying to predict the output of Compas● Works somewhat like a reverse engineer● We are not predicting real recidivism!

43

Results of the Model

● Significant values indicatethat the features may be ableto explain the COMPAS score

● Positive values point towardsa recidivism indication by COMPAS

● Negative values are the opposite● Higher values, more impact

44

Violent Recidivism

● Significant values indicatethat the features may be ableto explain the COMPAS score

● Positive values point towardsa recidivism indication by COMPAS

● Negative values are the opposite● Higher values, more impact

45

Fairness

46

21 fairness definitions and their politics

Arvind Narayanan - FAT Conference 2018 Tutorial

● Computer Scientist on a wild goose chase for a single definition● There is value to various definitions● Each can lead to trustworthiness

47

What is Fairness?Sahil Verma and Julia Rubin (2018) -- Fairness Definitions Explained

● A lot of these metrics worry about someform of equality

● Let S be some subset of sensitive attributes.S = { col(j, X) | column j is sensitive }N = { col(i, X) | column i is not-sensitive }

48

Balanced Representation

1 0 1 1

0 0 1 0

1 0 1 1

1 1 0 1

0

1

0

0

SN

1 1 1 0 1

X =

Is this fairness?

49

Classifier Evaluation

If we assume that the future has the same distribution as the past.

● We can measure the error rates on our development step● Indicates that our model works

Or we can...

● Actually wait for the future● Then make claims

In either case, there are multiple ways to evaluate a classifier. Compare predictions with ground-truth labels.

50

Classifier Evaluation

● Different tasks will focus on different metrics● Search engines usually optimize for precision (definition soon)

○ Retrieved cases

● When predicting some rare cancer, recall is more important○ All cases

● How should we work in recidivism?!

To understand the metrics we need a confusion matrix.

51

Parity in Predictions

52

Classifier Evaluation (COMPAS Example)

53

High Risk Low Risk

Recidivism True PositiveTP

False NegativeFN

Stayed Clean False PositiveFP

True NegativeTN

What is the impact of each kind of error?

54

High Risk Low Risk

Recidivism True PositiveTP

False NegativeFN

Stayed Clean False PositiveFP

True NegativeTN

Classifier Evaluation (COMPAS Example)

55

High Risk Low Risk

Recidivism True PositiveTP

False NegativeFN

Criminal loose!

Stayed Clean False PositiveFP

Innocent arrested!

True NegativeTN

COMPAS in Practice

● Each individual is assigned a score

● Scores are broken down into deciles

56

COMPAS in Practice

● Let’s assume three cut-off points● Purple individuals have recidivated● Orange ones have not● Now we need to pick a cut-off point

57

D1 D2 D2 D1 D3 D3 D1

TP FN

FP TN

1 0prediction

true1

0

COMPAS in Practice

● Let’s say that everyone >= D2 is at risk● What is each score?

58

D1

D2 D2

D1

D3 D3

D1

TP FN

FP TN

1 0prediction

true1

0

COMPAS in Practice

● Let’s say that everyone >= D2 is at risk● What is each score?● TP = 3, FP = 1, FN = 0, TN = 3.

59

D1

D2 D2

D1

D3 D3

D1

TP FN

FP TN

1 0prediction

true1

0

COMPAS in Practice

● Let’s say that everyone >= D1 is at risk● What is each score?

60

D1 D2 D2 D1 D3 D3 D1

TP FN

FP TN

1 0prediction

true1

0

COMPAS in Practice

● Let’s say that everyone >= D1 is at risk● What is each score?● TP = 3, FP = 4, FN = 0, TN = 0.

61

D1 D2 D2 D1 D3 D3 D1

TP FN

FP TN

1 0prediction

true1

0

COMPAS in Practice

● Let’s say that everyone >= D3 is at risk● What is each score?● TP = 1, FP = 1, FN = 2, TN = 3.

62

D1 D2 D2 D1

D3 D3

D1

TP FN

FP TN

1 0prediction

true1

0

COMPAS in Practice

● We usually explore such metrics in normalized terms● TPR = True Positive Rate or Recall

○ TP / (TP + FN)○ Row normalization

● FPR○ FP / (FP + TN)○ Row normalization

● How do we maximize each?

63

TP FN

FP TN

prediction

true

COMPAS in Practice

● We usually explore such metrics in normalized terms● TPR = True Positive Rate or Recall

○ TP / (TP + FN)○ Row normalization

● FPR○ FP / (FP + TN)○ Row normalization

● How do we maximize each?○ Recall is maximized when we say everyone is going to recidivate○ FPR is maximized if say no-one is going to

64

TP FN

FP TN

prediction

true

COMPAS Evaluation(From developers)

65

TP FN

FP TN

prediction

true

Predictive Validity of the COMPAS Reentry Risk ScalesAn Outcomes Study Conducted for the Michigan Department of Corrections: Updated Results on an Expanded Release Sample

https://epic.org/algorithmic-transparency/crim-justice/EPIC-16-06-23-WI-FOIA-201600805-MDOC_ReentryStudy082213.pdf

COMPAS in Practice

● TPR or Recall of around 80%● 80% of every recidivism case is accurate

● FPR or around 43%● 43% of innocent individuals receive a

high score● 57% of non-recidivism cases are accurate

66

TP FN

FP TN

prediction

true

COMPAS in Practice

● Suggestion of cut-off point at 4● Anybody above that is predicted as

a recidivism risk● Below are low risk● AUC of 0.72

○ Good score

67

TP FN

FP TN

prediction

true

COMPAS in Practice

● In all fairness, we are only showing one result● The study considers various aspects of COMPAS● If you wish, you can present many of the studies from COMPAS

68

Two views of the confusion matrix

69

Other View of the Confusion Matrix

70

High Risk Low Risk

Recidivism True PositiveTP

False NegativeFN

Stayed Clean False PositiveFP

True NegativeTN

Data focused (recall)

Prediction Focused (Prec.)

Column Normalization

● Positive Predictive Value (PPV) or Precision○ TP / (TP + FP)

● False Discovery Rate○ FP / (TP + FP)○ 1-PPV○ 1-Precision

● Negative Predictive Value (NPV)○ TN / (TN + FN)

● False Omission Rate○ 1 - NPV

Word of advice: Some books transpose the matrix. No decoreba!71

TP FN

FP TN

prediction

true

Propublica Dataset

Now let’s look into the Propublica study.Different dataset (Michigan vs Broward County)

72

Propublica Dataset

Let’s look at results comparing races

Table from Krishna Gummadi

73

TP FN

FP TN

1 0prediction

true1

0

Propublica Dataset

74

TP FN

FP TN

1 0prediction

true1

0

Propublica Dataset

75

TP FN

FP TN

1 0prediction

true1

0

Propublica Dataset

76

TP FN

FP TN

1 0prediction

true1

0Northpointe: FDR rates are comparable!COMPAS is Fair!

Propublica Dataset

● Now let’s look at the rows.● Recall that this focuses on the data!● Columns focus on predictions. Previous results, predictions balanced!

77

TP FN

FP TN

1 0prediction

true1

0

Propublica Dataset

78

TP FN

FP TN

1 0prediction

true1

0

Propublica Dataset

79

TP FN

FP TN

1 0prediction

true1

0

Propublica Dataset

80

TP Loose

Arrest TN

1 0prediction

true1

0● Error rates are not comparable● What are the consequences?

Rates on Imbalanced Datasets

● Both datasets are imbalanced to begin with● There are arguments from both sides● Presentation and Project ideas

○ Re-evaluate COMPAS■ Other classifiers■ Other metrics

○ Present counter-argument papers

81

Impossibility of Fairness

82

Fairness is Impossible in Practice

● Two proofs, one from:Alexandra Chouldechova

and one from

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghava

● References at the end of the slide

83

In Practice

● Classifiers output some probability score● This probability is used as a threshold● This is exactly what we have done with the decile scores

84

Formalizing

● The classifier now outputs a score

● From which we can create classes

● Essentially what COMPAS does85

Formalizing

● Now, let each value be associated with a random variableR is for raceY is for true values (recidivism or not). Not predictions.S is for score

(note to self: exemplify)

86

Formalizing

● Now, let each value be associated with a random variableR is for raceY is for true values (recidivism or not). Not predictions.S is for score

● Each has a probability distribution P(R = 1) and P(R = 0)● Also, let 𝞼 be the true fraction of individuals that have recidivated. This is called our

base rate (or prevalence): 𝞼 = P(Y = 1)

87

3 Definitions of Fairness

● When y = 1. Recidivism

● When y = 0. No recidivism

88

Calibration

89

s = 1

R = white

s = 2

R = white

s = 3

R = white

s = 1

R = black

s = 2

R = black

s = 3

R = black

Calibrated for all Scores (Deciles)

90

s = 1

R = white

s = 2

R = white

s = 3

R = white

s = 1

R = black

s = 2

R = black

s = 3

R = black

Not Calibrated for Score 2

91

s = 1

R = white

s = 2

R = white

s = 3

R = white

s = 1

R = black

s = 2

R = black

s = 3

R = black

Not Calibrated for all Scores

92

s = 1

R = white

s = 2

R = white

s = 3

R = white

s = 1

R = black

s = 2

R = black

s = 3

R = black

Intuition

Calibration: Scores have the same meaning.

93

Predictive Parity

94

s = 1

R = white

s = 2

R = white

s = 3

R = white

s = 1

R = black

s = 2

R = black

s = 3

R = black

Predictive Parity OK for sHR = 2

95

s = 1

R = white

s = 2

R = white

s = 3

R = white

s = 1

R = black

s = 2

R = black

s = 3

R = black

Intuition

Calibration: Scores have the same meaning.

Predictive Parity: I expect the same precision for each group. This was Northpoint’s argument (see previous slides). Also, from the previous slides we can see that PP differs from Calibration.

96

Measuring Calibration and PP

97

Error Rate Balance

● Focused on the rows-normalization of theconfusion matrix

● Note that it is conditioned on true value● Recall like interpretation● This is where ProPublica showed that COMPAS is unfair● Easy to see with a plot (next slide)

98

TP FN

FP TN

1 0prediction

true1

0

Error Rate Balance

99

Intuition

Calibration: Scores have the same meaning.

Predictive Parity: I expect the same precision for each group. This was Northpoint’s argument (see previous slides). Also, from the previous slides we can see that PP differs from Calibration.

Error Rate Balance: We cannot bias to fewer/more errors towards a certain group.

100

Error Rate Balance

Previous graph

● For different cut-off points● Imbalanced error rates● [Main Result] COMPAS is never fair from this point of view. Why?

○ Actually valid for any classifier

101

Proof that FPR and FNR cannot be equal

Knowing that:

● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)

102

TP FN

FP TN

1 0prediction

true1

0

We can write the identity below

Knowing that:

● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)

103

TP FN

FP TN

1 0prediction

true1

0

FP=

TP + FN FP TP

FP + TN FP + TN TP TP + FN

We can write the identity below

Knowing that:

● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)

104

TP FN

FP TN

1 0prediction

true1

0

FP=

TP + FN FP TP

FP + TN FP + TN TP TP + FN

We can write the identity below

Knowing that:

● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)

105

TP FN

FP TN

1 0prediction

true1

0

FP=

TP + FN FP TP

FP + TN FP + TN TP TP + FN

We can write the identity below

Knowing that:

● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)

Rewrite:

106

TP FN

FP TN

1 0prediction

true1

0

FP=

TP + FN FP TP

FP + TN FP + TN TP TP + FN

We can write the identity below

Knowing that:

● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)

107

TP FN

FP TN

1 0prediction

true1

0

FPR =𝞼 1 - PPV

1-FNR1-𝞼 PPV

Two Groups

108

FPR(w) =𝞼(w) 1 - PPV(w)

1-FNR(w)1-𝞼(w) PPV(w)

FPR(b) =𝞼(b) 1 - PPV(b)

1-FNR(b)1-𝞼(b) PPV(b)

Each with a confusion matrix, then for group w:

And for group b:

Calibration and PP are equal for groups

109

If PP and Calibration are Equal

110

FPR(w) =𝞼(w) 1 - PPV(w)

1-FNR(w)1-𝞼(w) PPV(w)

FPR(b) =𝞼(b) 1 - PPV(b)

1-FNR(b)1-𝞼(b) PPV(b)

PPV(w) = PPV(b)

If PP and Calibration are Equal

111

FPR(w) =𝞼(w)

1-FNR(w)1-𝞼(w)

FPR(b) =𝞼(b)

1-FNR(b)1-𝞼(b)

PPV(w) = PPV(b). Divide one with the other.

The other two can only be equal when: 𝞼(w) = 𝞼(b)

This is not the case! From the data: the recidivism rate is not equal across groups.Try setting FPR(w) = FPR(b) with 𝞼(w) != 𝞼(b). Impossible to reach equal FNR.

If PP and Calibration are Equal

112

FPR(w) =𝞼(w)

1-FNR(w)1-𝞼(w)

FPR(b) =𝞼(b)

1-FNR(b)1-𝞼(b)

PPV(w) = PPV(b). Divide one with the other.

The other two can only be equal when: 𝞼(w) = 𝞼(b)

This is not the case! From the data: the recidivism rate is not equal across groups.Try setting FNR(w) = FNR(b) with 𝞼(w) != 𝞼(b). Impossible to reach equal FPR.

If PP and Calibration are Equal

113

FPR(w) =𝞼(w)

1-FNR(w)1-𝞼(w)

FPR(b) =𝞼(b)

1-FNR(b)1-𝞼(b)

PPV(w) = PPV(b). Divide one with the other.

Proof Sketch: 𝞼/(1-𝞼) is bijective.FPR(w) = a * bFPR(b) = c * d. When we set both FPRs to equal, b = d only when a = c.

Results

● Impossibility of achieving predictive parity and equal errors for both error rates● Our classifiers will always favor one group● Where do we go from here:

○ We can tolerate some error threshold○ Decide which group should be biased○ More definitions of fairness [next class]

114

What can we do?

115

Hard Problem

Society (as a consequence datasets) is unfairAccountability is difficult (who do we blame?)Datasets and models are hard to understand

116

Lot’s of Research

https://fairmlbook.org/

117

Thank You!

118

References

● Predictive Validity of the COMPAS Reentry Risk Scaleshttps://epic.org/algorithmic-transparency/crim-justice/EPIC-16-06-23-WI-FOIA-201600805-MDOC_ReentryStudy082213.pdf

● Equality in Opportunity in Machine Learninghttps://arxiv.org/pdf/1609.05807.pdf

● Fair prediction with disparate impact:A study of bias in recidivism prediction instrumentshttps://www.andrew.cmu.edu/user/achoulde/files/disparate_impact.pdf

● Inherent Trade-offs in Algorithmic Fairnesshttps://www.youtube.com/watch?v=p5yY2MyTJXA

119