Repository Development at LC - Access 2009

Post on 08-Dec-2014

596 views 0 download

Tags:

description

Given at Access 2009 in Charlottetown, PEI. Watch video of the actual talk at http://hosting2.epresence.tv/UPEI/1/watch/72.aspx

transcript

RepositoryDevelopment

at LCDaniel Chudnov - 2009-10-01 - dchud at loc gov

Access 2009 - Charlottetown, PEI

who we arewhat we dowhat’s next

who we are

30ish peopledev, QA, PM, ops

from libs, uni, industry, etc.

OSIOffice of

Strategic Initiatives

“...capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to

those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of information forregistration, cataloging, indexing,

and preservation.”(search for “LC21”)

or, to be precise

“capture the digital artifact,

register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the

Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of

information forregistration, cataloging, indexing, and preservation.”

(search for “LC21”)

“capture the digital artifact,

register and/or deposit it for the Copyright Office,

pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in

the collection, with the optimum flow-through of information for

registration, cataloging, indexing, and preservation.”(search for “LC21”)

“capture the digital artifact, register and/or deposit it for the Copyright Office,

pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of

information forregistration, cataloging, indexing, and preservation.”

(search for “LC21”)

“capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to

include it in the Library, and

allow it to be incorporated digitally

in the collection,

with the optimum flow-through of information forregistration, cataloging, indexing, and preservation.”

(search for “LC21”)

“capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the Library, and allow it to be

incorporated digitally in the collection, with the optimum flow-through of information for

registration, cataloging,

indexing, and preservation.”

(search for “LC21”)

what we do

“capture thedigital artifact”

at scale

world scalethen

web scale

wdl.org

partnersall over

the world

content from all over

the world

usersall over

the world

wdl.org/ru/

wdl.org/zh/

wdl.org/ar/

launchedApril 2009

lots of press

9,026 req/s1.25 Gbit/son day one

no crashjust bugs

(yay!)

that wasnew for LC

how?

solarisapachenginxmysqlsolr

djangojquery

clean URIs

static pages

global edge caching

what we do

capture the artifact

pass it along

cataloging, indexing

chroniclingamerica.loc.gov

139,582 title records

1,442,462 pages

freely availablenow

download whole issues - tell friends - mash it up

100+ TB16 of 50+ states/terr.and growing quickly

how?

solarisapachemysqlsolr

django

clean URIs

page caching

capture the artifact

pass it along

cataloging, indexing,preservation

preservationstorage

“movage”

capture the artifact

BagIt

packing slipfor data

data in a Bag

.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | |-- 0001.tif| | |-- 0001.xml

identifiesa bag

.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | |-- 0001.tif| | |-- 0001.xml

where thedata starts

.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | |-- 0001.tif| | |-- 0001.xml

packingslip

.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | | ...|-- manifest-md5.txt`-- tagmanifest-md5.txt

71607ad119be88c842268a76f0b6b9e9 data/sn99021999/00206538107/1884091301/0621.pdfc602d2ac07508059ce5f5597e239b97f data/sn99021999/00206538120/1885100601/0831.xmla59795bd1584532d5cbc0b1d82f75cf8 data/sn99021999/00206538016/1880061401/0593.pdf3c64fac7e2d49671e0d93908ae42a779 data/sn99021999/00206539616/1888101801/0905.xml03158a560baa7479b3805d2b45ee02cd data/sn99021999/00206538028/1880111501/0405.tiffa56ea18580e1446939ed62709e5b2db data/sn99021999/00206538077/1883061901/1145.pdfbf4fb83ff8305e8256970a3466c1a12d data/sn99021999/00206538120/1885061501/0043.pdf8f3649fc812de74b9d9443ee90a8ac9c data/sn99021999/00206538120/1885111101/1109.tife0b83a7f9ca228271fdaecf6348e1cec data/sn99021999/00206538120/1885101201/0871.xml1c2f84e12792c123ba0aabedd0c0bbad data/sn99021999/00206538107/1884071401/0197.xml080e557fe9f68037605e5b80df4bc4ac data/sn99021999/0020653820A/1888050701/0543.tif532efe32c156459d9d9589caf618f502 data/sn99021999/00206538120/1885071401/0250.tifce607af59a96f2656d9448f38ffda072 data/sn99021999/0020653820A/1888052801/0731.pdf60b626d8fd40aca1b425e86a004bb055 data/sn99021999/00206539628/1888111801/0088.xmla467cd62350334c7aa83cf1e9056c1c6 data/sn99021999/00206539616/1888091701/0629.jp21a434f7a4d843a2c8ffe8d0824fafc3f data/sn99021999/00206538028/1880120801/0482.jp222996d89b4a3334256afaddcaa0238d8 data/sn99021999/00206538016/1874102001/0259.jp236f550da273ad4c592fee1761c98322a data/sn99021999/00206538016/1880052201/0518.jp27f7ccec3f2afae896338498372fd476e data/sn99021999/00206539616/1888080101/0200.pdfc247a5d74d0e7f857c534d935661adbe data/sn99021999/00206538107/1884072601/0286.jp24d497a18a154adcc8636239378ab340b data/sn99021999/00206539628/1889021101/0868.pdf2e8ca2558b54b5c49b2f20a355a60895 data/sn99021999/00206538065/1882092001/0136.xmlfb71493048e5010100f18012f5060d42 data/sn99021999/00206538028/1880123001/0569.xml40b100432890b055a5defbfbea815d57 data/sn99021999/00206538107/1884090901/0590.xml46f6d61480dadc1c988b0baa4de8b6c4 data/sn99021999/00206539628/1888122801/0463.pdf1cb8af0648e8c9df395b63226fe7371f data/sn99021999/00206538016/1874101501/0244.pdf9257834023c683b02f354888b2740b8f data/sn99021999/00206539616/1888102301/0956.xml0d52b3b2b1c5459b7e8d500a8566b0bf data/sn99021999/00206538120/1885080801/0425.tif

defines two things

1

what i thinki’m sending you

2

whether youreceived it

just likea

packing slip

works acrossspace

works acrosssystems

works acrossorgs

works acrosstime

easy to make

md5deep

BIL

BagItLibrary

bvar@sun9 /ingest/bvar/test $ bag create --dest new_bag test_data/*12:08:47,044 [main] INFO CommandLineBagDriver : Performing operation: create2.301112941466272:2.312:08:47,141 [main] INFO ManifestImpl : Creating manifest for manifest-md5.txt12:09:09,493 [main] INFO ManifestImpl : Creating manifest for tagmanifest-md5.txt12:09:09,511 [main] INFO AbstractBagImpl : Writing bag12:09:41,507 [main] INFO CommandLineBagDriver : Operation completed.12:09:41,508 [main] INFO CommandLineBagDriver : Returning 0bvar@sun9 /ingest/bvar/push/test_bag $ bag isvalid .11:55:45,582 [main] INFO CommandLineBagDriver : Performing operation: isvalid11:55:46,378 [main] INFO ManifestImpl : Creating manifest for manifest-md5.txt11:55:46,458 [main] INFO ManifestImpl : Creating manifest for tagmanifest-md5.txt11:55:46,540 [main] INFO AbstractBagImpl : Completion check: Result is true.11:56:21,273 [main] INFO AbstractBagImpl : Validity check: Result is true.11:56:21,273 [main] INFO CommandLineBagDriver : Result is true.11:56:21,274 [main] INFO CommandLineBagDriver : Returning 0bvar@sun9 /ingest/bvar/push/test_bag $

Bagger

free/open sourcereleasesfrom LC

sf.net/projects/loc-xferutils/

get yours today - tell friends - start trading bags

that wasnew for LC

pass it along

transferinventoryworkflow

transfer UI - inventory - workflow

how?

apachespring/mvchibernate

mysql

and otherautomationstrategies

lots ofwork

still to do

lots ofintegrationstill to do

register/depositfor

Copyright

not my area,but

we hope to supporteDeposit

with these tools

“Deposit Demand”

June 2009Federal Register

Proposed Rulemaking

stay tunedor

ask my colleagues :)

(ask me whom to ask)

but, not my area

“allow it to be...incorporated digitally

in the collection”

“allow it to be...

incorporateddigitally

in the collection”

how?

traditional approach:

catalog recordsexhibit sites

cost of integrating everything

is high

cost of updating everything

is high

cost ofconsistent web strategies

is low

for example

Linked Data

use URIs as names for thingsuse HTTP URIs

provide useful informationinclude links to other URIs

http://www.w3.org/DesignIssues/LinkedData.html

id.loc.gov

LCSHon the web

free

clean URIs

followyournose

formats

view source

<link rel="alternate" type="application/rdf+xml" href="/authorities/sh00009460.rdf" /><link rel="alternate" type="text/plain" href="/authorities/sh00009460.nt" /><link rel="alternate" type="application/json" href="/authorities/sh00009460.json" />

<rdf:RDF> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh00009460#concept"> <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-11-27T10:39:57-04:00</dcterms:modified> <skos:prefLabel xml:lang="en">National parks and reserves--Prince Edward Island</skos:prefLabel> <owl:sameAs rdf:resource="info:lc/authorities/sh00009460"/> <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#topicalTerms"/> <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-10-17T00:00:00-04:00</dcterms:created> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2002010534#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2008004743#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2003002637#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh00009458#concept"/> </rdf:Description> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh2002010534#concept"> <skos:prefLabel xml:lang="en">Prince Edward Island National Park (P.E.I.) </skos:prefLabel></rdf:Description>

<rdf:RDF> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh00009460#concept"> <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-11-27T10:39:57-04:00</dcterms:modified> <skos:prefLabel xml:lang="en">National parks and reserves--Prince Edward Island</skos:prefLabel> <owl:sameAs rdf:resource="info:lc/authorities/sh00009460"/> <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#topicalTerms"/> <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-10-17T00:00:00-04:00</dcterms:created> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2002010534#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2008004743#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2003002637#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh00009458#concept"/> </rdf:Description> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh2002010534#concept"> <skos:prefLabel xml:lang="en">Prince Edward Island National Park (P.E.I.) </skos:prefLabel></rdf:Description> explicit concepts, schema, meaning

a web of data...

...with precise meaning

at this URIis this

conceptwith this

meaning

a standard wayto refer toa heading

freely availablenow

download the whole thing - tell friends - amaze enemies

that wasnew for LC

another example

<link rel="resourcemap" type="application/rdf+xml" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25.rdf" /><link rel="alternate" type="image/jp2" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25.jp2" /><link rel="alternate" type="application/pdf" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25.pdf" /><link rel="alternate" type="application/xml" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.xml" /><link rel="alternate" type="text/plain" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.txt" />

<rdf:Description rdf:about="/lccn/sn83030214/1905-01-15/ed-1/seq-25#page"> <ore:isDescribedBy rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25.rdf"/> <foaf:depiction rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/thumbnail.jpg"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25.jp2"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.txt"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25.pdf"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.xml"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/thumbnail.jpg"/> <rdf:type rdf:resource="http://chroniclingamerica.loc.gov/terms#Page"/> <ore:isAggregatedBy rdf:resource="/lccn/sn83030214/1905-01-15/ed-1#issue"/> <dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1905-01-15</dcterms:issued> <ndnp:sequence rdf:datatype="http://www.w3.org/2001/XMLSchema#long">25</ndnp:sequence> <dcterms:title>New-York tribune. - 1905-01-15 - 25</dcterms:title></rdf:Description>

OAI-OREaggregation

this is apage

it has thesefiles in these formats

it is thissequence number

it ispart of this issue

it has thisissue date

it has this title

all explicit concepts

all exposedin the app

on the web

that wasnew for LC

the web is the API

the

webis the

API

there’s an API doc...

...it’s just abunch of links

“...make resources

availableand

useful...”

from the mission of the Library

“allow it to be...

incorporateddigitally

in the collection”

from the LC21 report

“...sustain and preservea

universalcollection...”

from the mission of the Library

each appconsistent

aboutmeaning

follow your noseto

concept definitions

in our appsand in yours

distributedconceptualintegration

the web is auniversal collection

this is a way toincorporate digitally

our digital artifactson our web

your digital artifactsin your web

our digital artifactsin your web

your digital artifactsin our web

available&

useful&c.

summary

content that scaleson the way in

apps that scaleon the way out

movagemovagemovage

transferinventoryworkflow

all in active development

the BagIt spec

try it - it works

free/open sourcesoftware releases

free datayou can use

web of dataavailable and useful

view source:

wdl.orgchroniclingamerica.loc.gov

id.loc.govsf.net/projects/loc-xferutils/

dchud at loc gov - @dchud