Date post: | 30-Jul-2015 |
Category: |
Education |
Upload: | historyspot |
View: | 237 times |
Download: | 0 times |
Sonia Ranade
23rd June 2015
Traces Through Time
Archives & Society/Digital History seminar
3
Introduction toTraces Through Time
The Traces Through Time project
• To trace individuals through records in the National Archives, and other institutions
• To handle the variation and inconsistency that find in historical records
• To assign confidence to the links that we find
• To develop methods that work for records from different time periods
• Funded by the AHRC
• In partnership with IHR and the Universities of Brighton and Leiden
• Project ran from January 2014 to March 2015
Aims
Datasets
• Highly structured
• Often only partiallytranscribed
• Big data
Early data
• Semi-structured narrative text
• Rich in context and relationships
• Small data
Modern data
6
Linking records
Linking records
A 4-stage pipeline for processing and linking:
Data cleansing and
Standardisation
Statistics
Optimisation
Linkingand
Confidence
Basic Probabilistic Model
• Attribute by attribute comparison
• Calculate ratio:
• Probability of comparison score given it’s the same person
vs.
• Probability of comparison score given they’re different people
• Use frequency tables to calculate scores
• Add up the scores
• Assumes that attributes are independent
9
AIR 76 ADM 188Name comparison
scores DOB scores Total
Name DOB Name DOB SnameFname
1Fname
2Fname
3 Day Month Year Score
denzil adair bartlett morle
19 May 1879
denzil adair bartlett morle
19 May 1879
5.4442 4.4379 6.2457 5.5897 1.3302 0.9180 1.5307 25.50
kingsley storrs stantonparker
02 January 1900
kingsley storrs stanton parker
02 January 1900
2.5287 4.8050 7.1751 5.2720 1.3302 0.9180 1.1603 23.19
Very high-scoring matches
10
AIR 76 ADM 188 Name comparison scores DOB scores Total
Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score
f e orton frank edward orton 06 December 1890 3.812 1.289 0.915 0.000 0.000 0.000 0.000 6.016
harry leonard gardner harry leonard garner 06 March 1897 1.839 1.939 2.237 0.000 0.000 0.000 0.000 6.016
charles measures charles measures 29 October 1868 4.362 1.653 0.000 0.000 0.000 0.000 0.000 6.015
charles measures charles measures 25 December 1861 4.362 1.653 0.000 0.000 0.000 0.000 0.000 6.015
g w lester george w lester 11 October 1853 3.542 1.305 1.168 0.000 0.000 0.000 0.000 6.014
george scarrott george scarrott 14 August 1878 4.592 1.422 0.000 0.000 0.000 0.000 0.000 6.014
w j cann william james cann 17 July 1858 3.834 1.212 0.967 0.000 0.000 0.000 0.000 6.013
william james cann 29 November 1867 3.834 1.212 0.967 0.000 0.000 0.000 0.000 6.013
d s gordon david stevenson gordon 29 August 1896 3.225 1.407 1.381 0.000 0.000 0.000 0.000 6.013
h c lyons henry charles lyons 16 November 1893 3.415 1.268 1.330 0.000 0.000 0.000 0.000 6.013
arthur edward john dobson arthur edward dobson 28 February 1885 3.181 1.685 1.847 -0.699 0.000 0.000 0.000 6.013
j w chilton james william chilton 15 July 1884 3.833 0.967 1.212 0.000 0.000 0.000 0.000 6.012
joseph wright chilton 07 October 1899 3.833 0.967 1.212 0.000 0.000 0.000 0.000 6.012
arthur william smith 25 July 1899 arthur smith 25 July 1899 1.693 1.685 -0.699 0.000 1.330 0.918 1.085 6.012
john bertram granville bradley 12 March 1899 john bradley 12 March 1899 2.875 1.201 -0.699 -0.699 1.330 0.918 1.085 6.011
h h england henry humphrey england 30 July 1898 3.475 1.268 1.268 0.000 0.000 0.000 0.000 6.010
herbert henry england 07 February 1889 3.475 1.268 1.268 0.000 0.000 0.000 0.000 6.010
albert edward monecroft albert edward morecroft 29 May 1896 2.404 1.759 1.847 0.000 0.000 0.000 0.000 6.010
w t randallwilliam thomas randall 13 December 1890 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010
william thomas randall 31 August 1894 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010
william thomas randall 06 June 1883 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010
e j breen edward james breen 23 May 1902 4.128 0.915 0.967 0.000 0.000 0.000 0.000 6.010
c w moran charles william moran 01 January 1886 3.462 1.330 1.212 0.000 0.000 0.000 0.000 6.004
christopher walter moran 21 February 1885 3.462 1.330 1.212 0.000 0.000 0.000 0.000 6.004
t h johnston thomas henry johnston 24 January 1894 3.188 1.548 1.268 0.000 0.000 0.000 0.000 6.004
a c m pym albert charles pym 08 January 1899 4.398 0.975 1.330 -0.699 0.000 0.000 0.000 6.004
c a walter charles alfred walter 18 April 1892 3.695 1.330 0.975 0.000 0.000 0.000 0.000 6.000
Lower confidence matches
11
AIR 76 ADM 188 Name comparison scores DOB scores Total
Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score
clifton james twine clipton james twine 14 November 1897 4.576208 2.009343 1.524698 0.000000 0.000000 0.000000 0.000000 8.110249
josiah c wedgewood josiah wedgewood 27 January 1897 5.215215 3.593108 -0.698970 0.000000 0.000000 0.000000 0.000000 8.109353
harold vaughan hicks 25 January 1897 harold hicks 25 January 1897 3.190312 2.100176 -0.698970 0.000000 1.330211 0.918030 1.265789 8.105548
rupert john goodman crouch 12 February 1897 rupert john goodman crouch 12 February 1892 3.633749 3.716703 1.200945 5.302958 1.330211 0.918030 -8.000000 8.102596
r h wrateroy holcombe wrate 29 May 1899 5.469066 1.313985 1.267723 0.000000 0.000000 0.000000 0.000000 8.050774
roy holcombe wrate 29 May 1899 5.469066 1.313985 1.267723 0.000000 0.000000 0.000000 0.000000 8.050774
john norman longfield 12 December 1884 john norman longfield 13 December 1884 4.829306 1.200945 2.417149 0.000000 -2.735969 0.918030 1.417104 8.046566
d f crittall daniel frederick crittall 06 August 1896 5.340946 1.406961 1.288987 0.000000 0.000000 0.000000 0.000000 8.036893
c c v terry christopher charles vincent terry 13 October 1899 3.375747 1.330257 1.330257 1.979755 0.000000 0.000000 0.000000 8.016016
william john davies 27 December 1894 william john davies 27 December 1894 2.057942 1.201008 1.200945 0.000000 1.330211 0.918030 1.303943 8.012078
h v phippen harold victor phippen 19 July 1892 4.742067 1.267723 1.979755 0.000000 0.000000 0.000000 0.000000 7.989546
edward heron 21 August 1893 edward george appelbe heron 21 August 1893 3.937212 1.846789 -0.698970 -0.698970 1.330211 0.918030 1.332133 7.966435
f r maddaford frank richard maddaford 16 December 1892 5.360395 1.288987 1.313985 0.000000 0.000000 0.000000 0.000000 7.963367
albert john penaluna albert john penaluna 06 May 1892 5.001995 1.759049 1.200945 0.000000 0.000000 0.000000 0.000000 7.961989
harry ward 10 May 1898 harry ward 10 May 1898 2.458691 1.939398 0.000000 0.000000 1.330211 0.918030 1.243949 7.890279
w h a rockett william henry albert rockett 12 May 1904 4.431952 1.212178 1.267723 0.975004 0.000000 0.000000 0.000000 7.886857
The threshold for high confidence matches: Computer says ‘Maybe’
Refining the basic probabilistic model
• We’ve assumed independence of attributes:Is this a valid assumption?
• Spelling variants and similar stringse.g. ‘Sidney’ and ‘Sydney’
• Name variantse.g. ‘Jack’ and ‘John’ or ‘Henry’ and ‘Harry’
• What about incorrectly transcribed initials?
• What about dates?
13
How independent are attributes?
A stark example
15
AIR 76 Person ADM 188 Person Name comparison scores DOB comparison scores Total
Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score
j j gabell jonathan joseph gabell 29 October 1898 5.271 0.967 0.967 0.000 0.000 0.000 0.000 7.205
e s mendoza elias sidney mendoza 28 June 1900 4.908 0.915 1.381 0.000 0.000 0.000 0.000 7.205
robert m blackwood robert maxwell blackwood 06 October 1887 4.428 1.764 1.012 0.000 0.000 0.000 0.000 7.204
a j loton alfred john loton 05 April 1897 5.260 0.975 0.967 0.000 0.000 0.000 0.000 7.202
h r rickey henry rickey 16 July 1851 6.633 1.268 -0.699 0.000 0.000 0.000 0.000 7.201
h v briscoe hugh villiers briscoe 25 March 1896 3.953 1.268 1.980 0.000 0.000 0.000 0.000 7.201
john wall 16 June 1884 john wall 16 June 1886 3.231 1.201 0.000 0.000 1.330 0.918 0.521 7.200
h h girdlestone horace howard girdlestone 12 January 1900 4.663 1.268 1.268 0.000 0.000 0.000 0.000 7.198
thomas francis taylor 1885 thomas francis taylor 13 February 1885 2.030 1.492 2.244 0.000 0.000 0.000 1.430 7.197
harold powell September 1881 harold powell 25 September 1881 2.687 2.100 0.000 0.000 0.000 0.918 1.488 7.193
frederick williams 31 January 1894 frederick williams 31 January 1894 1.957 1.681 0.000 0.000 1.330 0.918 1.304 7.190
c h munford charles henry munford 23 November 1897 4.591 1.330 1.268 0.000 0.000 0.000 0.000 7.189
patrick cashman
patrick cashman 01 June 1879 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188
patrick cashman 18 October 1878 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188
patrick cashman 16 September 1884 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188
patrick cashman 17 March 1878 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188
patrick cashman 04 April 1875 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188
patrick cashman 04 April 1870 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188
f g donnison frederick george donnison 28 December 1899 4.593 1.289 1.305 0.000 0.000 0.000 0.000 7.187
percival albert wright percival albert wright 10 October 1883 2.317 3.110 1.759 0.000 0.000 0.000 0.000 7.186
stanley b crick stanley benjamin charles crick 08 May 1899 4.005 2.267 1.608 -0.699 0.000 0.000 0.000 7.181
g h tidman george henry tidman 18 May 1900 4.608 1.305 1.268 0.000 0.000 0.000 0.000 7.181
william edward back william edward back 30 December 1855 4.132 1.201 1.847 0.000 0.000 0.000 0.000 7.180
f w doy frederick william doy 12 August 1891 4.679 1.289 1.212 0.000 0.000 0.000 0.000 7.180
An anomaly
16
Where were all the Cashmans born?
17
Clustering surnames and forenames…
18
Attribute Similarity
An example:
Name RankService Number
Date of DeathRegiment / Service
BOULONOIS, PERCY THOMAS
Private 28645 29/12/1917Royal Fusiliers
WO 372 –Percy J. BoulonoisGS/27645
MH 47 - Percy Thomas Boulonois
CWGC –Percy Thomas Boulonois28645
Henry and Harry, an example
• Looked at high confidence matched pairs
• Approx 1% of Henrys are also recorded as Harry
• Tried three scenarios:
• Apply a weighting based on string similarity (our standard approach)
• Assume Henry and Harry are interchangeable
• Include 1% probability into calculation
• Ran tests with and without Date of Birth
21
Browsing Individuals
22
From Conscription to Henry VIII
23
Future work…