Date post: | 22-Jan-2018 |
Category: |
Data & Analytics |
Upload: | driireland |
View: | 121 times |
Download: | 1 times |
Experience with Ingestion of Large Collections
Stuart Kenny
Research ITTrinity College Dublin
Stuart Kenny
Research ITTrinity College Dublin
The Fairy Tales of Charles Perrault. Illustrated by Harry Clarke.
Intro. Thomas Bodkin. London: George G. Harrap, [1922].
Internet Archive version of a copy in the New York Public Library.
Web. 25 December 2012.
My what a big collection you have!
About DRI (https://repository.dri.ie/)
● DRI is an interactive trusted digital repository for
contemporary and historical, social and cultural
data held by Irish institutions
● RIA (lead), NUIM, TCD, DIT, NUIG, NCAD
● Partners: academic, cultural, social, government
Outline
• What’s our problem?
• Example collections
• Ingest solutions
• Current ingest process
• Possible future process
Ingesting Objects
• Ingest formo Suitable for single
objects/small collections
o Flat hierarchies
o Simple metadata standards
• Multiple standardso e.g., MARC, EAD
o XML upload
• How to handle complex
standards, many
objects?
Example Collection: Clarke Stained Glass
• MODS metadata
• 10,025 objects
• 42 sub-collections
• 20,047 files, 2.82 TB
• Problems:o Large number of objects
o Data transfer
Example Collection: TCD Children’s Books
• MARC metadata
• 207,889 objects
• 16 sub-collections
• Problems:o Large number of objects
o Very slow to ingest
o Timeouts and errors
Example Collection: Kilkenny Design Workshop
• EAD metadata
• 2,040 objects
• 2,734 series/files
• 2,231 files, 1.2GB
• Problems:o Very complex metadata standard
o Hierarchical structure
EAD, and why I don’t quite hate it as much as I did...
• Single XML file upload
• Structure encoded in metadata
• URLs to files
• Buto One-shot ingest
o How to edit/update?
o Slow to ingest
o Requires a lot of resources
Sufia Batch Upload
• Add multiple files
• New work for each
• Metadata for each
work
• How to handle
multiple standards?
• Different metadata
for each work?
Avalon Batch Ingest
• Ingest packageo Manifest file
o Plus content files
• Manifest file is spreadsheeto Metadata for items
o Names of content files
• Ingest package uploaded to Avalon DropBox
Approach up to now
• Command line cliento Enter text commands at ‘command prompt’
• Written in Ruby
• Run locally by user
• Metadata and asset files arranged in fixed directory structure
• Client iterates over directory creates each object as single
ingest
Problems
• Lack of user familiarity with command line
• Multiple platform supporto i.e., Windows
• Difficulty of installing
• Multiple single ingestso Slow
o Error prone
• Required lots of user support
• Mostly in the end ingests performed by dev team
Current Attempt
• Web-based UI
• Borrow heavily from Avalon approach
• Upload metadata XML plus assets to online storage
• Add manifest spreadsheeto Each row contains path to metadata
o Paths to zero or more asset files
o Paths relative to online storage directory
• Backend processes manifest and ingests as background task
• UI updates status
Current Attempt
UI
Online
Storage Repository
Select
manifest
Retrieve
remote
files
Ingest
Update
status
• Hydra BrowseEverythingo Gem to access cloud storage
o DropBox, Google Drive…
• User uploads files
• In UI selects collection
and manifest to ingest
• Everything handled
server side in
background
• Can view status in UI
Outstanding Issues
• Online storageo Dropbox type storage size limits
• Creating spreadsheet less easy than directory structure
• Possible solutionso Provide online storage
o Has to be per user
o Generate required manifest from uploaded directory structure