Metadata Extraction & Web Archives: Automating the Record Creation Process

transcript

Metadata Extraction & Web Archives:

Automating the Record Creation Process

Abbie Grotke / abgr@loc.govGina Jones / gjon@loc.gov

Library of CongressOffice of Strategic Initiatives

Web Capture Team

Library of Congress Web Archives

• Since 2000, 20+ thematic, event-based collections

• 100 TB+ of data collected

• 12,500+ URLs

http://www.loc.gov/lcwa

Web Archiving Tools• Crawling:

– Heritrix– WARC

• Access:– Wayback Machine– NutchWAX

International Internet Preservation Consortiumnetpreserve.org

LC’s Web Archive Workflow

• Identify & select URLs (LS or LAW)• Determine crawl strategy, create a

seed list for crawling (OSI)• Sites harvested by Internet Archive

or in-house crawlers (OSI), • Quality Review (OSI & curators)• Create “catalogers list” (OSI) and

XML MODS template (LS) for metadata extraction

Describing the Archives

• Collection-level MARC record in OPAC• Item-level MODS records in LCWA

– One record per recommended URL for each distinct collection

• With so many thousands of URLs to process, how do we streamline the process?

XML MODS Template

Metadata Extraction

• For each URL that will be cataloged:– Get archived web site metadata– Combine with URL Nominations Database

metadata– If elections/campaign web site, metadata also

pulled from our candidate Access database (used to create subject terms)

• Using XML template, we add collection and record level metadata

• Create a single file for delivery

Data Sources for Metadata Extraction

URL Nominations Database

• URL• Access Rights• Language(s)• Category• Subject Terms

Election Candidate Metadata

• Name• URL• Party Affiliation• State • Race• District (House)

Archived Web Site Metadata

From 1st capture:• Document Title• Keywords• Abstract• Mime Types

From Wayback index:

• Capture Dates (First & Last)

Combined Data in Template

Documents