+ All Categories
Home > Documents > Metadata Extraction & Web Archives: Automating the Record Creation Process

Metadata Extraction & Web Archives: Automating the Record Creation Process

Date post: 12-Jan-2016
Category:
Upload: jase
View: 34 times
Download: 0 times
Share this document with a friend
Description:
Metadata Extraction & Web Archives: Automating the Record Creation Process. Abbie Grotke / [email protected] Gina Jones / [email protected] Library of Congress Office of Strategic Initiatives Web Capture Team. Library of Congress Web Archives. Since 2000, 20+ thematic, event-based collections - PowerPoint PPT Presentation
Popular Tags:
14
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / [email protected] Gina Jones / [email protected] Library of Congress Office of Strategic Initiatives Web Capture Team
Transcript
Page 1: Metadata Extraction & Web Archives: Automating the Record Creation Process

Metadata Extraction & Web Archives:

Automating the Record Creation Process

Abbie Grotke / [email protected] Jones / [email protected]

Library of CongressOffice of Strategic Initiatives

Web Capture Team

Page 2: Metadata Extraction & Web Archives: Automating the Record Creation Process

Library of Congress Web Archives

• Since 2000, 20+ thematic, event-based collections

• 100 TB+ of data collected

• 12,500+ URLs

http://www.loc.gov/lcwa

Page 3: Metadata Extraction & Web Archives: Automating the Record Creation Process

Web Archiving Tools• Crawling:

– Heritrix– WARC

• Access:– Wayback Machine– NutchWAX

International Internet Preservation Consortiumnetpreserve.org

Page 4: Metadata Extraction & Web Archives: Automating the Record Creation Process

LC’s Web Archive Workflow

• Identify & select URLs (LS or LAW)• Determine crawl strategy, create a

seed list for crawling (OSI)• Sites harvested by Internet Archive

or in-house crawlers (OSI), • Quality Review (OSI & curators)• Create “catalogers list” (OSI) and

XML MODS template (LS) for metadata extraction

Page 5: Metadata Extraction & Web Archives: Automating the Record Creation Process

Describing the Archives

• Collection-level MARC record in OPAC• Item-level MODS records in LCWA

– One record per recommended URL for each distinct collection

• With so many thousands of URLs to process, how do we streamline the process?

Page 6: Metadata Extraction & Web Archives: Automating the Record Creation Process

XML MODS Template

Page 7: Metadata Extraction & Web Archives: Automating the Record Creation Process

Metadata Extraction

• For each URL that will be cataloged:– Get archived web site metadata– Combine with URL Nominations Database

metadata– If elections/campaign web site, metadata also

pulled from our candidate Access database (used to create subject terms)

• Using XML template, we add collection and record level metadata

• Create a single file for delivery

Page 8: Metadata Extraction & Web Archives: Automating the Record Creation Process

Data Sources for Metadata Extraction

Page 9: Metadata Extraction & Web Archives: Automating the Record Creation Process

URL Nominations Database

• URL• Access Rights• Language(s)• Category• Subject Terms

Page 10: Metadata Extraction & Web Archives: Automating the Record Creation Process

Election Candidate Metadata

• Name• URL• Party Affiliation• State • Race• District (House)

Page 11: Metadata Extraction & Web Archives: Automating the Record Creation Process

Archived Web Site Metadata

From 1st capture:• Document Title• Keywords• Abstract• Mime Types

From Wayback index:

• Capture Dates (First & Last)

Page 12: Metadata Extraction & Web Archives: Automating the Record Creation Process

Combined Data in Template

Page 13: Metadata Extraction & Web Archives: Automating the Record Creation Process

Combined Data in Template

Page 14: Metadata Extraction & Web Archives: Automating the Record Creation Process

Combined Data in Template


Recommended