Unexpected Repurposing: the British Library's digital collections and UCL teaching, research and infrastructure
Professor Melissa TerrasProfessor of Digital Humanities, UCL Dept of Information StudiesDirector, UCL Centre for Digital [email protected], @melissaterras
#openglam
British Library, 28th May 2008. https://web.archive.org/web/20110707135434/http://pressandpolicy.bl.uk/Press-Releases/The-British-Library-19th-Century-Book-Digitisation-Project-343.aspx
Returned to library in 2012, placed under a CCO-Public domain license for commercial and non-commercial use.
Optically Character Recognised (OCR) generated TextScanned Page
OCR XML Generated by ABBY Fine Reader
https://www.flickr.com/photos/britishlibrary
http://blpublicdomain.wikispaces.com/home
https://historicaltexts.jisc.ac.uk/results?filter=service%7C%7Cbl&tab=date
Data: what can we do with 65,000 books?
224GB compressed ALTO XML
http://www0.cs.ucl.ac.uk/staff/D.Mohamedally/
Staff and Students, working together
• James Baker, Adam Farquhar• Melissa Terras, Dean Mohamedally, Tim
Weyrich,• Stefan Alborzpour, Stelios Georgiou, Nektaria
Stavrou, Wendy Wong, Jonathan Lloyd, Meral Sahin, Divya Surendran, James Durrant, Muhammad Rafdi, Ali Sarraf
Approach
• How can we search the dataset differently?• Complex and multifaceted needs of humanities
researchers• Boolean and Advanced Search• Microsoft Azure 5 APIs were implemented that
functionally scale to the data • Offering unconventional services such as bulk
download of text based on metadata queries, word frequency lists, and OCR text previews.
picaguess.herokuapp.com, dx.doi.org/10.5281/zenodo.15980
James Baker, Tim Weyrich, Dean MohamedallyJonathan Lloyd, Meral Sahin,Divya Surendran
http://blbigdata.herokuapp.com/James Baker, Tim Weyrich, Dean Mohamedally,
Ali Sarraf, James Durrant, Muhammad Rafdi
github.com/UCL-dataspring
Method
• 65k books from the British Library:• 17th - 19th century• 224GB compressed ALTO XML• UCL High Performance Computing• Support from RITS and UCLDH• 4 humanities researchers• Turn research questions into computational
queries• Learn from the researchers about their needs,
wants, desires, and method.
Results
Taking Humanities data to HPC…
https://www.flickr.com/photos/epublicist/3546059144
Case Study 1: History of Medicine, Oliver Duke-Williams, UCL
Case Study 2: History of Images, Will Finley, Sheffield
What did this tell us?
• Best practice recommendations:– Derived datasets for home use– Documentating decisions– Fixed/defined dataset– Normalisations
Common Queries
• searches for all variants of a word • searches that return keywords in context traced
over time • NOT searches for a word or phrase that ignored
another word or phrase • searches for a word when in close proximity to a
second word • searches based on image metadata …. All returned in a derived dataset, in context.
Do try this at home…
1. Invest in research software engineer capacity to deploy and maintain openly licensed largescale digital collections from across the GLAM sector in order to facilitate research in the arts, humanities and social and historical sciences
2. Invest in training library staff to run these initial queries in collaboration with humanities faculty, to support work with subsets of data that are produced, and to document and manage resulting code and derived data.
github.com/UCL-dataspring
With thanks to
• BL Labs and Digital Curators: James Baker, Adam Farquhar, Mahendra Mahey, Ben O’Steen, Hana Lewis
• UCL CS Student Project Team: James Baker, Tim Weyrich, Dean Mohamedally
• Bluclobber Project Team: James Baker, James Hetherington, David Beavan, Anne Welsh, Helen O’Neill, Will Finley, Oliver Duke-Williams, Adam Farquhar.
• UCL Research IT Services: James Hetherington, Clare Gryce, Raquel Algere.