Archives, algorithms and people

transcript

Tristan Ferne / @tristanfExecutive Producer

BBC Research & Development

Archives, algorithms and peopleor

How we put the BBC World Service radio archive online using machines and

crowdsourcing

The BBC World Service archive

1947-2012

Spelling mistake

Missing data

Sometimes incorrect dataNo semantic data

The missing metadata

How it works

Listening machines

Noisy transcripts

Algorithms

Algorithms and people

The prototype

worldservice.prototyping.bbc.co.uk

Show Synopsis editing version

worldservice.prototyping.bbc.co.uk

Machine learning

Results

70000tag edits

How much data?

1000synopsis edits

71000edits

36000listenableprogrammes

1mmachine tags

70000programmes

3000users

of programmes listened to36%

of programmes tagged21%

And four lost programmes

Tags are a large and sparse space

When is a tag correct?

When is a programme tagged completely?

How do you measure crowd-sourced data?

How good is the data?

Who does the work?

1 person = 30% of edits

10 people = 70% of edits

10% of people = 98% of edits

The shape of the archive

Places mentioned

Linking from the News

The Last Danish Christmas Broadcast

“Entirely in Danish”

We can significantly improve the data

It’s cost-effective with re-usable technology

A crowdsourcing approach

What we’ve learnt

How good are the machine tags?

How much crowdsourcing do you need?

When is your data good enough?

Open questions

worldservice.prototyping.bbc.co.ukwww.bbc.co.uk/rdgithub.com/bbrd

tristan.ferne@bbc.co.uk@tristanf

Archives, algorithms and people

Technology