+ All Categories
Home > Education > Sonia Ranade: 'Traces Through Time overview and next steps'

Sonia Ranade: 'Traces Through Time overview and next steps'

Date post: 30-Jul-2015
Category:
Upload: historyspot
View: 237 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
Transcript
Page 1: Sonia Ranade: 'Traces Through Time overview and next steps'
Page 2: Sonia Ranade: 'Traces Through Time overview and next steps'

Sonia Ranade

23rd June 2015

Traces Through Time

Archives & Society/Digital History seminar

Page 3: Sonia Ranade: 'Traces Through Time overview and next steps'

3

Introduction toTraces Through Time

Page 4: Sonia Ranade: 'Traces Through Time overview and next steps'

The Traces Through Time project

• To trace individuals through records in the National Archives, and other institutions

• To handle the variation and inconsistency that find in historical records

• To assign confidence to the links that we find

• To develop methods that work for records from different time periods

• Funded by the AHRC

• In partnership with IHR and the Universities of Brighton and Leiden

• Project ran from January 2014 to March 2015

Aims

Page 5: Sonia Ranade: 'Traces Through Time overview and next steps'

Datasets

• Highly structured

• Often only partiallytranscribed

• Big data

Early data

• Semi-structured narrative text

• Rich in context and relationships

• Small data

Modern data

Page 6: Sonia Ranade: 'Traces Through Time overview and next steps'

6

Linking records

Page 7: Sonia Ranade: 'Traces Through Time overview and next steps'

Linking records

A 4-stage pipeline for processing and linking:

Data cleansing and

Standardisation

Statistics

Optimisation

Linkingand

Confidence

Page 8: Sonia Ranade: 'Traces Through Time overview and next steps'

Basic Probabilistic Model

• Attribute by attribute comparison

• Calculate ratio:

• Probability of comparison score given it’s the same person

vs.

• Probability of comparison score given they’re different people

• Use frequency tables to calculate scores

• Add up the scores

• Assumes that attributes are independent

Page 9: Sonia Ranade: 'Traces Through Time overview and next steps'

9

AIR 76 ADM 188Name comparison

scores DOB scores Total

Name DOB Name DOB SnameFname

1Fname

2Fname

3 Day Month Year Score

denzil adair bartlett morle

19 May 1879

denzil adair bartlett morle

19 May 1879

5.4442 4.4379 6.2457 5.5897 1.3302 0.9180 1.5307 25.50

kingsley storrs stantonparker

02 January 1900

kingsley storrs stanton parker

02 January 1900

2.5287 4.8050 7.1751 5.2720 1.3302 0.9180 1.1603 23.19

Very high-scoring matches

Page 10: Sonia Ranade: 'Traces Through Time overview and next steps'

10

AIR 76 ADM 188 Name comparison scores DOB scores Total

Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score

f e orton frank edward orton 06 December 1890 3.812 1.289 0.915 0.000 0.000 0.000 0.000 6.016

harry leonard gardner harry leonard garner 06 March 1897 1.839 1.939 2.237 0.000 0.000 0.000 0.000 6.016

charles measures charles measures 29 October 1868 4.362 1.653 0.000 0.000 0.000 0.000 0.000 6.015

charles measures charles measures 25 December 1861 4.362 1.653 0.000 0.000 0.000 0.000 0.000 6.015

g w lester george w lester 11 October 1853 3.542 1.305 1.168 0.000 0.000 0.000 0.000 6.014

george scarrott george scarrott 14 August 1878 4.592 1.422 0.000 0.000 0.000 0.000 0.000 6.014

w j cann william james cann 17 July 1858 3.834 1.212 0.967 0.000 0.000 0.000 0.000 6.013

william james cann 29 November 1867 3.834 1.212 0.967 0.000 0.000 0.000 0.000 6.013

d s gordon david stevenson gordon 29 August 1896 3.225 1.407 1.381 0.000 0.000 0.000 0.000 6.013

h c lyons henry charles lyons 16 November 1893 3.415 1.268 1.330 0.000 0.000 0.000 0.000 6.013

arthur edward john dobson arthur edward dobson 28 February 1885 3.181 1.685 1.847 -0.699 0.000 0.000 0.000 6.013

j w chilton james william chilton 15 July 1884 3.833 0.967 1.212 0.000 0.000 0.000 0.000 6.012

joseph wright chilton 07 October 1899 3.833 0.967 1.212 0.000 0.000 0.000 0.000 6.012

arthur william smith 25 July 1899 arthur smith 25 July 1899 1.693 1.685 -0.699 0.000 1.330 0.918 1.085 6.012

john bertram granville bradley 12 March 1899 john bradley 12 March 1899 2.875 1.201 -0.699 -0.699 1.330 0.918 1.085 6.011

h h england henry humphrey england 30 July 1898 3.475 1.268 1.268 0.000 0.000 0.000 0.000 6.010

herbert henry england 07 February 1889 3.475 1.268 1.268 0.000 0.000 0.000 0.000 6.010

albert edward monecroft albert edward morecroft 29 May 1896 2.404 1.759 1.847 0.000 0.000 0.000 0.000 6.010

w t randallwilliam thomas randall 13 December 1890 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010

william thomas randall 31 August 1894 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010

william thomas randall 06 June 1883 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010

e j breen edward james breen 23 May 1902 4.128 0.915 0.967 0.000 0.000 0.000 0.000 6.010

c w moran charles william moran 01 January 1886 3.462 1.330 1.212 0.000 0.000 0.000 0.000 6.004

christopher walter moran 21 February 1885 3.462 1.330 1.212 0.000 0.000 0.000 0.000 6.004

t h johnston thomas henry johnston 24 January 1894 3.188 1.548 1.268 0.000 0.000 0.000 0.000 6.004

a c m pym albert charles pym 08 January 1899 4.398 0.975 1.330 -0.699 0.000 0.000 0.000 6.004

c a walter charles alfred walter 18 April 1892 3.695 1.330 0.975 0.000 0.000 0.000 0.000 6.000

Lower confidence matches

Page 11: Sonia Ranade: 'Traces Through Time overview and next steps'

11

AIR 76 ADM 188 Name comparison scores DOB scores Total

Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score

clifton james twine clipton james twine 14 November 1897 4.576208 2.009343 1.524698 0.000000 0.000000 0.000000 0.000000 8.110249

josiah c wedgewood josiah wedgewood 27 January 1897 5.215215 3.593108 -0.698970 0.000000 0.000000 0.000000 0.000000 8.109353

harold vaughan hicks 25 January 1897 harold hicks 25 January 1897 3.190312 2.100176 -0.698970 0.000000 1.330211 0.918030 1.265789 8.105548

rupert john goodman crouch 12 February 1897 rupert john goodman crouch 12 February 1892 3.633749 3.716703 1.200945 5.302958 1.330211 0.918030 -8.000000 8.102596

r h wrateroy holcombe wrate 29 May 1899 5.469066 1.313985 1.267723 0.000000 0.000000 0.000000 0.000000 8.050774

roy holcombe wrate 29 May 1899 5.469066 1.313985 1.267723 0.000000 0.000000 0.000000 0.000000 8.050774

john norman longfield 12 December 1884 john norman longfield 13 December 1884 4.829306 1.200945 2.417149 0.000000 -2.735969 0.918030 1.417104 8.046566

d f crittall daniel frederick crittall 06 August 1896 5.340946 1.406961 1.288987 0.000000 0.000000 0.000000 0.000000 8.036893

c c v terry christopher charles vincent terry 13 October 1899 3.375747 1.330257 1.330257 1.979755 0.000000 0.000000 0.000000 8.016016

william john davies 27 December 1894 william john davies 27 December 1894 2.057942 1.201008 1.200945 0.000000 1.330211 0.918030 1.303943 8.012078

h v phippen harold victor phippen 19 July 1892 4.742067 1.267723 1.979755 0.000000 0.000000 0.000000 0.000000 7.989546

edward heron 21 August 1893 edward george appelbe heron 21 August 1893 3.937212 1.846789 -0.698970 -0.698970 1.330211 0.918030 1.332133 7.966435

f r maddaford frank richard maddaford 16 December 1892 5.360395 1.288987 1.313985 0.000000 0.000000 0.000000 0.000000 7.963367

albert john penaluna albert john penaluna 06 May 1892 5.001995 1.759049 1.200945 0.000000 0.000000 0.000000 0.000000 7.961989

harry ward 10 May 1898 harry ward 10 May 1898 2.458691 1.939398 0.000000 0.000000 1.330211 0.918030 1.243949 7.890279

w h a rockett william henry albert rockett 12 May 1904 4.431952 1.212178 1.267723 0.975004 0.000000 0.000000 0.000000 7.886857

The threshold for high confidence matches: Computer says ‘Maybe’

Page 12: Sonia Ranade: 'Traces Through Time overview and next steps'

Refining the basic probabilistic model

• We’ve assumed independence of attributes:Is this a valid assumption?

• Spelling variants and similar stringse.g. ‘Sidney’ and ‘Sydney’

• Name variantse.g. ‘Jack’ and ‘John’ or ‘Henry’ and ‘Harry’

• What about incorrectly transcribed initials?

• What about dates?

Page 13: Sonia Ranade: 'Traces Through Time overview and next steps'

13

How independent are attributes?

Page 14: Sonia Ranade: 'Traces Through Time overview and next steps'

A stark example

Page 15: Sonia Ranade: 'Traces Through Time overview and next steps'

15

AIR 76 Person ADM 188 Person Name comparison scores DOB comparison scores Total

Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score

j j gabell jonathan joseph gabell 29 October 1898 5.271 0.967 0.967 0.000 0.000 0.000 0.000 7.205

e s mendoza elias sidney mendoza 28 June 1900 4.908 0.915 1.381 0.000 0.000 0.000 0.000 7.205

robert m blackwood robert maxwell blackwood 06 October 1887 4.428 1.764 1.012 0.000 0.000 0.000 0.000 7.204

a j loton alfred john loton 05 April 1897 5.260 0.975 0.967 0.000 0.000 0.000 0.000 7.202

h r rickey henry rickey 16 July 1851 6.633 1.268 -0.699 0.000 0.000 0.000 0.000 7.201

h v briscoe hugh villiers briscoe 25 March 1896 3.953 1.268 1.980 0.000 0.000 0.000 0.000 7.201

john wall 16 June 1884 john wall 16 June 1886 3.231 1.201 0.000 0.000 1.330 0.918 0.521 7.200

h h girdlestone horace howard girdlestone 12 January 1900 4.663 1.268 1.268 0.000 0.000 0.000 0.000 7.198

thomas francis taylor 1885 thomas francis taylor 13 February 1885 2.030 1.492 2.244 0.000 0.000 0.000 1.430 7.197

harold powell September 1881 harold powell 25 September 1881 2.687 2.100 0.000 0.000 0.000 0.918 1.488 7.193

frederick williams 31 January 1894 frederick williams 31 January 1894 1.957 1.681 0.000 0.000 1.330 0.918 1.304 7.190

c h munford charles henry munford 23 November 1897 4.591 1.330 1.268 0.000 0.000 0.000 0.000 7.189

patrick cashman

patrick cashman 01 June 1879 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188

patrick cashman 18 October 1878 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188

patrick cashman 16 September 1884 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188

patrick cashman 17 March 1878 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188

patrick cashman 04 April 1875 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188

patrick cashman 04 April 1870 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188

f g donnison frederick george donnison 28 December 1899 4.593 1.289 1.305 0.000 0.000 0.000 0.000 7.187

percival albert wright percival albert wright 10 October 1883 2.317 3.110 1.759 0.000 0.000 0.000 0.000 7.186

stanley b crick stanley benjamin charles crick 08 May 1899 4.005 2.267 1.608 -0.699 0.000 0.000 0.000 7.181

g h tidman george henry tidman 18 May 1900 4.608 1.305 1.268 0.000 0.000 0.000 0.000 7.181

william edward back william edward back 30 December 1855 4.132 1.201 1.847 0.000 0.000 0.000 0.000 7.180

f w doy frederick william doy 12 August 1891 4.679 1.289 1.212 0.000 0.000 0.000 0.000 7.180

An anomaly

Page 16: Sonia Ranade: 'Traces Through Time overview and next steps'

16

Where were all the Cashmans born?

Page 17: Sonia Ranade: 'Traces Through Time overview and next steps'

17

Clustering surnames and forenames…

Page 18: Sonia Ranade: 'Traces Through Time overview and next steps'

18

Attribute Similarity

Page 19: Sonia Ranade: 'Traces Through Time overview and next steps'

An example:

Name RankService Number

Date of DeathRegiment / Service

BOULONOIS, PERCY THOMAS

Private 28645 29/12/1917Royal Fusiliers

WO 372 –Percy J. BoulonoisGS/27645

MH 47 - Percy Thomas Boulonois

CWGC –Percy Thomas Boulonois28645

Page 20: Sonia Ranade: 'Traces Through Time overview and next steps'

Henry and Harry, an example

• Looked at high confidence matched pairs

• Approx 1% of Henrys are also recorded as Harry

• Tried three scenarios:

• Apply a weighting based on string similarity (our standard approach)

• Assume Henry and Harry are interchangeable

• Include 1% probability into calculation

• Ran tests with and without Date of Birth

Page 21: Sonia Ranade: 'Traces Through Time overview and next steps'

21

Browsing Individuals

Page 22: Sonia Ranade: 'Traces Through Time overview and next steps'

22

From Conscription to Henry VIII

Page 23: Sonia Ranade: 'Traces Through Time overview and next steps'

23

Future work…


Recommended