Flavio Figueiredo Justiça na Aprendizagem de...

transcript

Justiça naAprendizagem de MáquinaEscola de Verão 2020Flavio Figueiredo

About Me

Flavio Figueiredo

Whenever someone asks me: What do you research?

About Me

Flavio Figueiredo

Whenever someone asks me: What do you research?

● Distributed Systems○ Dependability, File Sharing, Social Exchanges

● Information Retrieval○ Folksonomies

● Social Networks○ Popularity and evolution

● Human Computer Interaction○ User studies

● Machine Learning○ Learning models from social data

Human Factors in Computer Science

Distributed SystemsWhy and how do people share information?

Information RetrievalHow does unstructured human knowledge grow?

Social DynamicsHow does user actions impact popularity? How do users perceive popularity?

Machine LearningHow to capture complex human behavior?

Interest on Human-Centric Machine Learning

Fairness is one of the main issues!

Simplified Background

● Mathematically speaking, what is the goal of a supervised learning system?

● The goal is to learn some parameters

● Where these parameters maximize some prediction function across y

This is just one view. Optimization v. Bayesian and other topics out of the scope. 7

● The goal of a supervised learning algorithm is to discriminate

Notation

We observe a dataset sampled from some joint distribution

Our ML model creates a hypothesis focused on good predictions

Training Val (Dev) Test

Development Production

Training Val (Dev) Test

Development Production

Same Distribution

● The goal of a supervised learning algorithm is to discriminate ● Why are we now so worried that it does? It seems we can trust them.

Machine Bias

● Pro Publica analysis of COMPAS (which stands for Correctional Offender Management Profiling for Alternative Sanctions)https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Criminal Prediction

https://callingbullshit.org/case_studies/case_study_criminal_machine_learning.html

Discrimination

● The overall goal of a ML model is on discrimination● The best hypothesis is a good discriminator

○ Both in training time and in testing time

● In society, discrimination has a different connotation

Regulation

Law (USA)

Slide from: https://mrtz.org/nips17/#/9

Law (USA)

Slide from: https://mrtz.org/nips17/#/9

Two Guiding Principles

Disparate Treatment: Individuals from a sensitive group must be treated equally

Disparate Treatment: Individuals from a sensitive group must be treated equallyDisparate Impact: The impact of decisions must affect groups equally

Extreme case: A credit scoring system that denies all loans?

Disparate Treatment: Individuals from a sensitive group must be treated equallyDisparate Impact: The impact of decisions must affect groups equally

Extreme case: A credit scoring system that denies all loans?Real Case: https://en.wikipedia.org/wiki/Ricci_v._DeStefano

COMPAS

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Machine Bias

There’s software used across the country to predict future criminals. And it’s biased against blacks.by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, ProPublica May 23, 2016

COMPAS

Correctional Offender Management Profiling for Alternative Sanction

● First developed in 1998● Still used today● Predicts chance of recidivism● Developed by Northpointe (now Equivant)● Most information comes from a manual

"A Practitioner's Guide to COMPAS Core"

COMPAS

Used in (or was used in, may be outdated)

● Florida● New York● Wisconsin● California● and others...

Overall

It’s not the algorithm, it’s the data.

https://cacm.acm.org/magazines/2017/2/212422-its-not-the-algorithm-its-the-data/fulltext

“There is no debate that both of these types of technologies are being used on a fairly widespread basis in the U.S. According to a 2013 article published by Sonja B. Starr, a professor of law at the University of Michigan Law School, nearly every state has adopted some type of risk-based assessment tools to aid in sentencing.”

How is the model trained?

Who knows?!

We know the input

Compas Questionnairehttps://www.documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-CORE.html

Is COMPAS Using Well Known ML?

● Linear Regression● Logistic Regression● Support Vector Machines● Random Forests● etc● etc

Who knows? But they cite such methods.

Several Citations to ML Work

Note that it is a regression model. It seems that they regress to their own risk score.

Decile Scores

My interpretation:

● It seems that COMPAS scalesinputs and outputs in deciles

● Inputs come from the questionnaireand other variables

● Outputs is usually some kind of risk● I would guess that this is fed to some

blackbox ML system

Decile Scores

Single Equation

Only one equation is present in the manual:

Violent Recidivism Risk Score = (age∗−w) + (age-at-first-arrest∗−w) + (history of violence∗w) + (vocation education ∗ w) + (history of noncompliance ∗ w),

where w is weight, the size of which is "determined by the strength of the item’s relationship to person offense recidivism that we observed in our study data."

ProPublica’s Study

● Most analysis of such software is done by the developers● ProPublica decided to do it’s own analysis● COMPAS was chosen due to popularity

Anecdotal Evidence

Several examples in the article

Anecdotal Evidence

● Great for news articles● But the problem needs to be better understood● Use machine learning to understand machine learning

Response

Some Initial Data Science

● Understanding Decile Scores (output of the model)

● Risk of Recidivism Score

● Violent Risk Score

Logistic Regression

● The model here is trying to predict the output of Compas● Works somewhat like a reverse engineer● We are not predicting real recidivism!

Results of the Model

● Significant values indicatethat the features may be ableto explain the COMPAS score

● Positive values point towardsa recidivism indication by COMPAS

● Negative values are the opposite● Higher values, more impact

Violent Recidivism

● Significant values indicatethat the features may be ableto explain the COMPAS score

● Positive values point towardsa recidivism indication by COMPAS

● Negative values are the opposite● Higher values, more impact

Fairness

21 fairness definitions and their politics

Arvind Narayanan - FAT Conference 2018 Tutorial

● Computer Scientist on a wild goose chase for a single definition● There is value to various definitions● Each can lead to trustworthiness

What is Fairness?Sahil Verma and Julia Rubin (2018) -- Fairness Definitions Explained

● A lot of these metrics worry about someform of equality

● Let S be some subset of sensitive attributes.S = { col(j, X) | column j is sensitive }N = { col(i, X) | column i is not-sensitive }

Balanced Representation

1 0 1 1

0 0 1 0

1 0 1 1

1 1 0 1

1 1 1 0 1

Is this fairness?

Classifier Evaluation

If we assume that the future has the same distribution as the past.

● We can measure the error rates on our development step● Indicates that our model works

Or we can...

● Actually wait for the future● Then make claims

In either case, there are multiple ways to evaluate a classifier. Compare predictions with ground-truth labels.

Classifier Evaluation

● Different tasks will focus on different metrics● Search engines usually optimize for precision (definition soon)

○ Retrieved cases

● When predicting some rare cancer, recall is more important○ All cases

● How should we work in recidivism?!

To understand the metrics we need a confusion matrix.

Parity in Predictions

Classifier Evaluation (COMPAS Example)

High Risk Low Risk

Recidivism True PositiveTP

False NegativeFN

Stayed Clean False PositiveFP

True NegativeTN

What is the impact of each kind of error?

High Risk Low Risk

False NegativeFN

True NegativeTN

Classifier Evaluation (COMPAS Example)

High Risk Low Risk

False NegativeFN

Criminal loose!

Innocent arrested!

True NegativeTN

COMPAS in Practice

● Each individual is assigned a score

● Scores are broken down into deciles

COMPAS in Practice

● Let’s assume three cut-off points● Purple individuals have recidivated● Orange ones have not● Now we need to pick a cut-off point

D1 D2 D2 D1 D3 D3 D1

1 0prediction

COMPAS in Practice

● Let’s say that everyone >= D2 is at risk● What is each score?

1 0prediction

COMPAS in Practice

● Let’s say that everyone >= D2 is at risk● What is each score?● TP = 3, FP = 1, FN = 0, TN = 3.

1 0prediction

COMPAS in Practice

● Let’s say that everyone >= D1 is at risk● What is each score?

D1 D2 D2 D1 D3 D3 D1

1 0prediction

COMPAS in Practice

D1 D2 D2 D1 D3 D3 D1

1 0prediction

COMPAS in Practice

D1 D2 D2 D1

1 0prediction

COMPAS in Practice

● We usually explore such metrics in normalized terms● TPR = True Positive Rate or Recall

○ TP / (TP + FN)○ Row normalization

● FPR○ FP / (FP + TN)○ Row normalization

● How do we maximize each?

prediction

COMPAS in Practice

● We usually explore such metrics in normalized terms● TPR = True Positive Rate or Recall

○ TP / (TP + FN)○ Row normalization

● FPR○ FP / (FP + TN)○ Row normalization

● How do we maximize each?○ Recall is maximized when we say everyone is going to recidivate○ FPR is maximized if say no-one is going to

prediction

COMPAS Evaluation(From developers)

prediction

Predictive Validity of the COMPAS Reentry Risk ScalesAn Outcomes Study Conducted for the Michigan Department of Corrections: Updated Results on an Expanded Release Sample

https://epic.org/algorithmic-transparency/crim-justice/EPIC-16-06-23-WI-FOIA-201600805-MDOC_ReentryStudy082213.pdf

COMPAS in Practice

● TPR or Recall of around 80%● 80% of every recidivism case is accurate

● FPR or around 43%● 43% of innocent individuals receive a

high score● 57% of non-recidivism cases are accurate

prediction

COMPAS in Practice

● Suggestion of cut-off point at 4● Anybody above that is predicted as

a recidivism risk● Below are low risk● AUC of 0.72

○ Good score

prediction

COMPAS in Practice

● In all fairness, we are only showing one result● The study considers various aspects of COMPAS● If you wish, you can present many of the studies from COMPAS

Two views of the confusion matrix

Other View of the Confusion Matrix

High Risk Low Risk

False NegativeFN

True NegativeTN

Data focused (recall)

Prediction Focused (Prec.)

Column Normalization

● Positive Predictive Value (PPV) or Precision○ TP / (TP + FP)

● False Discovery Rate○ FP / (TP + FP)○ 1-PPV○ 1-Precision

● Negative Predictive Value (NPV)○ TN / (TN + FN)

● False Omission Rate○ 1 - NPV

Word of advice: Some books transpose the matrix. No decoreba!71

prediction

Propublica Dataset

Now let’s look into the Propublica study.Different dataset (Michigan vs Broward County)

Propublica Dataset

Let’s look at results comparing races

Table from Krishna Gummadi

1 0prediction

Propublica Dataset

1 0prediction

Propublica Dataset

1 0prediction

Propublica Dataset

1 0prediction

0Northpointe: FDR rates are comparable!COMPAS is Fair!

Propublica Dataset

● Now let’s look at the rows.● Recall that this focuses on the data!● Columns focus on predictions. Previous results, predictions balanced!

1 0prediction

Propublica Dataset

1 0prediction

Propublica Dataset

1 0prediction

Propublica Dataset

TP Loose

Arrest TN

1 0prediction

0● Error rates are not comparable● What are the consequences?

Rates on Imbalanced Datasets

● Both datasets are imbalanced to begin with● There are arguments from both sides● Presentation and Project ideas

○ Re-evaluate COMPAS■ Other classifiers■ Other metrics

○ Present counter-argument papers

Impossibility of Fairness

Fairness is Impossible in Practice

● Two proofs, one from:Alexandra Chouldechova

and one from

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghava

● References at the end of the slide

In Practice

● Classifiers output some probability score● This probability is used as a threshold● This is exactly what we have done with the decile scores

Formalizing

● The classifier now outputs a score

● From which we can create classes

● Essentially what COMPAS does85

Formalizing

● Now, let each value be associated with a random variableR is for raceY is for true values (recidivism or not). Not predictions.S is for score

(note to self: exemplify)

Formalizing

● Now, let each value be associated with a random variableR is for raceY is for true values (recidivism or not). Not predictions.S is for score

● Each has a probability distribution P(R = 1) and P(R = 0)● Also, let 𝞼 be the true fraction of individuals that have recidivated. This is called our

base rate (or prevalence): 𝞼 = P(Y = 1)

3 Definitions of Fairness

● When y = 1. Recidivism

● When y = 0. No recidivism

Calibration

R = white

R = black

Calibrated for all Scores (Deciles)

R = white

R = black

Not Calibrated for Score 2

R = white

R = black

Not Calibrated for all Scores

R = white

R = black

Intuition

Calibration: Scores have the same meaning.

Predictive Parity

R = white

R = black

Predictive Parity OK for sHR = 2

R = white

R = black

Intuition

Predictive Parity: I expect the same precision for each group. This was Northpoint’s argument (see previous slides). Also, from the previous slides we can see that PP differs from Calibration.

Measuring Calibration and PP

Error Rate Balance

● Focused on the rows-normalization of theconfusion matrix

● Note that it is conditioned on true value● Recall like interpretation● This is where ProPublica showed that COMPAS is unfair● Easy to see with a plot (next slide)

1 0prediction

Error Rate Balance

Intuition

Predictive Parity: I expect the same precision for each group. This was Northpoint’s argument (see previous slides). Also, from the previous slides we can see that PP differs from Calibration.

Error Rate Balance: We cannot bias to fewer/more errors towards a certain group.

Error Rate Balance

Previous graph

● For different cut-off points● Imbalanced error rates● [Main Result] COMPAS is never fair from this point of view. Why?

○ Actually valid for any classifier

Proof that FPR and FNR cannot be equal

Knowing that:

● 𝞼 = P(Y = 1) = (TP + FN) / N● 1-𝞼 = P(Y = 0) = (FP + FN) / N● FPR = FP / (FP + TN)● PPV = TP / (TP + FN)● 1 - FNR = Recall = TP / (TP + FN)

1 0prediction

We can write the identity below

Knowing that:

1 0prediction

TP + FN FP TP

FP + TN FP + TN TP TP + FN

Knowing that:

1 0prediction

TP + FN FP TP

Knowing that:

1 0prediction

TP + FN FP TP

Knowing that:

Rewrite:

1 0prediction

TP + FN FP TP

Knowing that:

1 0prediction

FPR =𝞼 1 - PPV

1-FNR1-𝞼 PPV

Two Groups

FPR(w) =𝞼(w) 1 - PPV(w)

1-FNR(w)1-𝞼(w) PPV(w)

FPR(b) =𝞼(b) 1 - PPV(b)

1-FNR(b)1-𝞼(b) PPV(b)

Each with a confusion matrix, then for group w:

And for group b:

Calibration and PP are equal for groups

If PP and Calibration are Equal

FPR(w) =𝞼(w) 1 - PPV(w)

1-FNR(w)1-𝞼(w) PPV(w)

FPR(b) =𝞼(b) 1 - PPV(b)

1-FNR(b)1-𝞼(b) PPV(b)

PPV(w) = PPV(b)

FPR(w) =𝞼(w)

1-FNR(w)1-𝞼(w)

FPR(b) =𝞼(b)

1-FNR(b)1-𝞼(b)

PPV(w) = PPV(b). Divide one with the other.

The other two can only be equal when: 𝞼(w) = 𝞼(b)

This is not the case! From the data: the recidivism rate is not equal across groups.Try setting FPR(w) = FPR(b) with 𝞼(w) != 𝞼(b). Impossible to reach equal FNR.

FPR(w) =𝞼(w)

1-FNR(w)1-𝞼(w)

FPR(b) =𝞼(b)

1-FNR(b)1-𝞼(b)

The other two can only be equal when: 𝞼(w) = 𝞼(b)

This is not the case! From the data: the recidivism rate is not equal across groups.Try setting FNR(w) = FNR(b) with 𝞼(w) != 𝞼(b). Impossible to reach equal FPR.

FPR(w) =𝞼(w)

1-FNR(w)1-𝞼(w)

FPR(b) =𝞼(b)

1-FNR(b)1-𝞼(b)

Proof Sketch: 𝞼/(1-𝞼) is bijective.FPR(w) = a * bFPR(b) = c * d. When we set both FPRs to equal, b = d only when a = c.

Results

● Impossibility of achieving predictive parity and equal errors for both error rates● Our classifiers will always favor one group● Where do we go from here:

○ We can tolerate some error threshold○ Decide which group should be biased○ More definitions of fairness [next class]

What can we do?

Hard Problem

Society (as a consequence datasets) is unfairAccountability is difficult (who do we blame?)Datasets and models are hard to understand

Lot’s of Research

https://fairmlbook.org/

Thank You!

References

● Predictive Validity of the COMPAS Reentry Risk Scaleshttps://epic.org/algorithmic-transparency/crim-justice/EPIC-16-06-23-WI-FOIA-201600805-MDOC_ReentryStudy082213.pdf

● Equality in Opportunity in Machine Learninghttps://arxiv.org/pdf/1609.05807.pdf

● Fair prediction with disparate impact:A study of bias in recidivism prediction instrumentshttps://www.andrew.cmu.edu/user/achoulde/files/disparate_impact.pdf

● Inherent Trade-offs in Algorithmic Fairnesshttps://www.youtube.com/watch?v=p5yY2MyTJXA

Flavio Figueiredo Justiça na Aprendizagem de...

Documents