Post on 27-Dec-2015
transcript
Wrangling DigiTool Data For LOCKSS
Brian Meuse - Digital Collections Systems Analyst
University Libraries
Boston College
MetaArchive Cooperative Annual Meeting
October 23, 2009
eTD@BC
• Electronic Theses and Dissertations– Undergraduate Honors Theses– Graduate Level Theses and Dissertations
• Archive and distribute
• Provide global Open Access to content– Embargoes when needed– No mandate to publish
What happens next?
• ProQuest processes students submission
• ProQuest ftp's back– Thesis (pdf) – Any additional files (3rd party permissions)– Descriptive metadata
• Once student uploads to ProQuest we get back within a day.
DigiTool
• Oracle backend– Maintains object relationships– Stores all associated MetaData (XML)– Original filenames
• File storage– Simple directories on filesystem– Renamed to Unique Identifier (PID)
DigiTool To LOCKSS
• Export ETD files from DigiTool– Export function – Duplicate data– Current ETD collection is ~1GB– Bobbie Hanvey, ~30,000 photo-negatives
~600GB
DigiTool To LOCKSS• Direct URL links
– MetaData– Objects (Viewers for different formats)
• Direct links not persistent– Redirected to URL with session id– Every node is different– Not good for polling.
DigiTool To LOCKSS
• DigiTool API– SOAP web service– Can query database– Retrieve XML
• MetaData• Links to objects
DigiTool To LOCKSS• createMetaArchiveAU.pl
#!/usr/bin/perl -w
use strict;
use SOAP::Lite;
use FileHandle;
use Getopt::Long;
use LWP::Simple;
use Time::localtime;
use XML::LibXSLT;
use XML::LibXML;
….
DigiTool To LOCKSS
• Query DigiTool …
<x_condition><type>contains</type><element>type</element><value>electronic thesis dissertation</value>
</x_condition>
<x_condition><type>after</type><element>createDate</element>
<value>FROM</value>
</x_condition>
…
• XML response is list of pid’s
DigiTool To LOCKSS
• Retrieve digital entity for each PID
• XML contains– All Metadata for object– PID’s of related objects– Filename and path of file on server
DigiTool To LOCKSS• Metadata <md shared="false"> <name>descriptive</name>
<type>etd-ms</type> <value><![CDATA[<thesis><title>The Impact of Pension Policy on Older Adults' Life Satisfaction: an Analysis of Longitudinal Mulitlevel Data</title><creator>Calvo, Esteban</creator><subject>aging</subject><subject>individualization</subject><subject>life satisfaction</subject><subject>pension policy</subject><subject>redistribution</subject><subject>subjective well-being</subject><publisher>Boston College</publisher><contributor role="advisor">Williamson, John B.</contributor><date>2009</date><type>Electronic Thesis or Dissertation</type><type>text</type><format>application/pdf</format><identifier>http://hdl.handle.net/2345/752</identifier><language>English</language><rights>I hereby allow Boston College to include and preserve my dissertation/thesis in electronic form in the Boston College Institutional Repository, which shall include the right to publicly post my dissertation/thesis on the World Wide Web. I will retain copyright ownership, but I grant to Boston College the non-exclusive right to copy, distribute, and publicly display my dissertation/thesis in any form as may be necessary or convenient in the future as file formats, storage media, and distribution mechanisms evolve.</rights><degree><name>PhD</name><level>Doctoral</level><discipline>Sociology</discipline><grantor>Boston College. Graduate School of Arts & Sciences.</grantor></degree></thesis>]]></value> </md>
DigiTool To LOCKSS
• Related objects <relations>
<relation> <type>manifestation</type> <pid>106483</pid> </relation> <relation> <type>manifestation</type> <pid>108561</pid> </relation> <relation> <type>manifestation</type> <pid>108562</pid> </relation>
</relations>
DigiTool To LOCKSS
• Filename and path <stream_ref>
<file_name>Calvo-Esteban.pdf</file_name> <file_extension>pdf</file_extension> <mime_type>application/pdf</mime_type> <directory_path>/exlibris1/bcd03storage/2009/08/27/file_1/106484</directory_path> <file_id>1</file_id> <storage_id>1005</storage_id> <external_type>-1</external_type> <file_size_bytes>349524</file_size_bytes>
</stream_ref>
DigiTool To LOCKSS
• Retrieve each related item to get filename and path for those items
<relations> <relation> <type>manifestation</type> <pid>106483</pid> </relation> <relation> <type>manifestation</type> <pid>108561</pid> </relation> <relation> <type>manifestation</type> <pid>108562</pid> </relation>
</relations>
DigiTool To LOCKSS• Generate script to generate links
– Symbolic link for AU– From manifest web directory to object
ln -s /exlibris1/bcd03storage/2009/08/27/file_1/106484 18640905-20090930/106484/Calvo-Esteban.pdf
• When file is harvested, it will be given the original filename.
DigiTool To LOCKSS
<relations> <relation> <type>manifestation</type> <pid>106483</pid>
<!– OA Permission --> </relation> <relation> <type>manifestation</type> <pid>108561</pid>
<!– Fulltext Index --> </relation> <relation> <type>manifestation</type> <pid>108562</pid>
<!– Thumbnail --> </relation></relations>
DigiTool To LOCKSS<html xmlns:xb="http://com/exlibris/digitool/repository/api/xmlbeans"><head> <title>Manifest for Calvo, Esteban 2009</title> </head> <body> <h2> Electronic Theses and Dissertations at Boston College </h2><h3> Manifest for Calvo, Esteban 2009</h3> <ul> <li><a href="http://dcollections.bc.edu/webclient/DeliveryManager?
metadata_request=true&GET_XML=1&pid=106484"> Metadata and Relationships</a></li>
<li><a href="Calvo-Esteban.pdf"> ETD PDF</a></li> <li><a href="Calvo-Esteban-permission.txt"> Permissions/Suppressed
file</a></li> <li><a href="_106484_pdf_thumbnail.jpg"> Thumbnail</a></li> </ul> </body>
</html>