+ All Categories
Home > Data & Analytics > Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Date post: 22-Jan-2018
Category:
Upload: driireland
View: 121 times
Download: 1 times
Share this document with a friend
21
Experience with Ingestion of Large Collections Stuart Kenny Research IT Trinity College Dublin
Transcript
Page 1: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Experience with Ingestion of Large Collections

Stuart Kenny

Research ITTrinity College Dublin

Page 2: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Stuart Kenny

Research ITTrinity College Dublin

The Fairy Tales of Charles Perrault. Illustrated by Harry Clarke.

Intro. Thomas Bodkin. London: George G. Harrap, [1922].

Internet Archive version of a copy in the New York Public Library.

Web. 25 December 2012.

My what a big collection you have!

Page 3: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

About DRI (https://repository.dri.ie/)

● DRI is an interactive trusted digital repository for

contemporary and historical, social and cultural

data held by Irish institutions

● RIA (lead), NUIM, TCD, DIT, NUIG, NCAD

● Partners: academic, cultural, social, government

Page 4: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Outline

• What’s our problem?

• Example collections

• Ingest solutions

• Current ingest process

• Possible future process

Page 5: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Ingesting Objects

• Ingest formo Suitable for single

objects/small collections

o Flat hierarchies

o Simple metadata standards

• Multiple standardso e.g., MARC, EAD

o XML upload

• How to handle complex

standards, many

objects?

Page 6: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Page 7: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Example Collection: Clarke Stained Glass

• MODS metadata

• 10,025 objects

• 42 sub-collections

• 20,047 files, 2.82 TB

• Problems:o Large number of objects

o Data transfer

Page 8: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Example Collection: TCD Children’s Books

• MARC metadata

• 207,889 objects

• 16 sub-collections

• Problems:o Large number of objects

o Very slow to ingest

o Timeouts and errors

Page 9: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Example Collection: Kilkenny Design Workshop

• EAD metadata

• 2,040 objects

• 2,734 series/files

• 2,231 files, 1.2GB

• Problems:o Very complex metadata standard

o Hierarchical structure

Page 10: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

EAD, and why I don’t quite hate it as much as I did...

• Single XML file upload

• Structure encoded in metadata

• URLs to files

• Buto One-shot ingest

o How to edit/update?

o Slow to ingest

o Requires a lot of resources

Page 11: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Sufia Batch Upload

• Add multiple files

• New work for each

• Metadata for each

work

• How to handle

multiple standards?

• Different metadata

for each work?

Page 12: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Avalon Batch Ingest

• Ingest packageo Manifest file

o Plus content files

• Manifest file is spreadsheeto Metadata for items

o Names of content files

• Ingest package uploaded to Avalon DropBox

Page 13: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Approach up to now

• Command line cliento Enter text commands at ‘command prompt’

• Written in Ruby

• Run locally by user

• Metadata and asset files arranged in fixed directory structure

• Client iterates over directory creates each object as single

ingest

Page 14: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Page 15: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Problems

• Lack of user familiarity with command line

• Multiple platform supporto i.e., Windows

• Difficulty of installing

• Multiple single ingestso Slow

o Error prone

• Required lots of user support

• Mostly in the end ingests performed by dev team

Page 16: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Current Attempt

• Web-based UI

• Borrow heavily from Avalon approach

• Upload metadata XML plus assets to online storage

• Add manifest spreadsheeto Each row contains path to metadata

o Paths to zero or more asset files

o Paths relative to online storage directory

• Backend processes manifest and ingests as background task

• UI updates status

Page 17: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Current Attempt

UI

Online

Storage Repository

Select

manifest

Retrieve

remote

files

Ingest

Update

status

Page 18: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

• Hydra BrowseEverythingo Gem to access cloud storage

o DropBox, Google Drive…

• User uploads files

• In UI selects collection

and manifest to ingest

• Everything handled

server side in

background

• Can view status in UI

Page 19: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Page 20: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Page 21: Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

Outstanding Issues

• Online storageo Dropbox type storage size limits

• Creating spreadsheet less easy than directory structure

• Possible solutionso Provide online storage

o Has to be per user

o Generate required manifest from uploaded directory structure


Recommended