The Bigot in the Machine: Data Bias - Insurance Ireland · Accessibility - video description to the...

The Bigot in the Machine: Data Bias

Prof Alan F. SmeatonDublin City [email protected]

450+ Academics, Researchers,

Engineers & PhDs

300+Research Awards won over 4 years

83+ Industry Funding Partners

60+ H2020 consortia. 580+

collaborations in 40 Countries

8 Spinouts

57 License agreements

Built on 14 yars of research in

Data Analytics and AI

4 Co-Lead Universities

Insight in 4 years

1,395+Scientific Papers

100+ PhD/Masters

graduated, 250 + by 2019

4 Taught Masters Programs 330

Graduates per year

Funding Partners

3

Multi-nationals(Examples)

Irish & SME(Examples)

4

* Based on Insight’s Top 40 Companies

Impact Assessment Matrix with Industry

4

Broadcast

Security Autonomy

Archives

4

Broadcast

Security Autonomy

Archives

4

Simple message … data bias can hurt you !

But before that … computer vision … the home from which deep learning grew

• For years, we were happy

• Image tagging was ourobjective

7

And in 2012, this happened !

• Krizhevsky, Sutskever and Hinton

at U Toronto, “won” the ImageNet

large scale visual recognition

challenge with a “convolutional

neural network” (deep learning)

• 6 months later, they all work

at Google

10

From John’s presentation …

11

Neural networks are much more complex …

12

Neural Nets can have many layers

13

Neural Nets can vary layer dimensions

14

Configuring Neural Networks …

• … while there are platforms like GPUs and TensorFlow, optimising hyperparameters is a black art !

• Lets look at a contemporary computer vision example ... automatically captioning video

• Many real world applications

✓ Video summarisation

✓ Supporting search and browsing

✓ Accessibility - video description to the blind

15

Insight and Adapt collaboration

16

LSTM

Sequence to Sequence - Video to Text (S2VT)

“Video caption”

CNN

● Required training on 00,000’s of video-caption pairs – never enough !● Video features generated with a CNN, passed to a 2x LSTM stack● LSTM’s encode the features and decode into natural language descriptions

17

#990

a baseball player holding a bat on a

field

#1599

a white cat sitting on top of a table

#603

a green truck is parked on a street

#1695

a person riding a bike down a street

Some Insight – Adapt automatic captions …

How does it perform ?

18

• Took part in an evaluation benchmark – dozens of groups worldwide – run by US National Institute of Standards and Technology

• Human assessment scores [0..100] for each caption from each group on each video – micro-averaged per caption then overall averages, standardised for variation across humans’ mean and std deviation

• Human captions are c.85% satisfactory, ours are c.50% satisfactory

• For many videos ours as good as human, for others we’re poor – why ?1. Our training data has bias – not broad enough to cover all test videos

2. Some of the videos are really difficult to caption anyway

Bias in Training Video Dataset• We have many videos of men playing soccer

• … all manually captioned accordingly, used as training data, but ...

... or ...

19

Easy, and difficult, videos to caption

20

#1002

a woman sitting in a chair with a laptop

#1457

a woman wearing a pink shirt and tie

#1249

a man holding a fork and a cat

#1734

a man in a suit and tie standing at a table

Observations

• Video captioning is hard + need large, diverse set of video-caption pairs, with good coverage, no bias

• History of machine learning has many examples of data bias• 2015 Google Photos app tagged two African American users as gorillas• 2015 Google AdWords shown to advertise more lower-paid jobs to

women and minorities• 2016 Google Image search for CEOs found almost all men• 2017 Russian developers of FaceApp to transform (beautify ?) faces in

photos, automatically lightened skin tones

21TRECVID 2017

Takeaway Message …

• Importantly, we use ML to inform decisions based on large datasets in personal finance, healthcare, job applications, legal system

• ML exacerbates disparities baked into data sets, algorithmic bias results from using machine learning even where there is no discrimination intended - there is a bigot in the machine.

• But volume is your friend, when there’s enough data volume, biases disappear … or so we thought !

• Data-driven approaches assume all data points are created equally, they are not• Mistaken belief that correlation between data sets equals causation

• University of Edinburgh study found significant correlation linking higher chocolate consumption per capita to serial killer activity per capita,

22

Insight and Aviva

• Several projects, including propensity modelling, multi-policy pricing models, CLV modelling … all using machine learning, some deep learning ...

• … all done with real customer data, in Aviva

• We have to stop and ask ourselves ...

• What baked-in biases exist in this data we’re aware of ?

• What baked-in biases exist in this data we’re not aware of ?

• Do those biases matter to the application using the data ?

23

Thank You

[email protected]

24

Date post:	10-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

The Bigot in the Machine: Data Bias - Insurance Ireland · Accessibility - video description to the...

Documents