+ All Categories
Home > Documents > Mining the Blob: There's Gold in the Directory!

Mining the Blob: There's Gold in the Directory!

Date post: 03-Jan-2016
Category:
Upload: kareem-owens
View: 20 times
Download: 1 times
Share this document with a friend
Description:
Mining the Blob: There's Gold in the Directory!. Kathryn Lybarger @ zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014. MARC. Patron searching. Staff searching. Using what?. Though the catalog: was created from raw MARC - PowerPoint PPT Presentation
42
Mining the Blob: There's Gold in the Directory! Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014
Transcript
Page 1: Mining the Blob:  There's Gold in the Directory!

Mining the Blob: There's Gold in the Directory!

Kathryn Lybarger @zemkatELUNA 2014 (Montreal) #ELUNA2014May 1, 2014

Page 2: Mining the Blob:  There's Gold in the Directory!

MARC

Page 3: Mining the Blob:  There's Gold in the Directory!
Page 4: Mining the Blob:  There's Gold in the Directory!

Patron searching

Page 5: Mining the Blob:  There's Gold in the Directory!

Staff searching

Page 6: Mining the Blob:  There's Gold in the Directory!

Using what?

Though the catalog:was created from raw MARCactually contains MARC still

We are using indexesHigh-speed access to common elementsRelationship between elementsSome elements grouped into one indexSome elements modified to be more useful

Page 7: Mining the Blob:  There's Gold in the Directory!
Page 8: Mining the Blob:  There's Gold in the Directory!
Page 9: Mining the Blob:  There's Gold in the Directory!

Modified fields : call number

MFHD_MASTER.DISPLAY_CALL_NOCall number as it would displayPS7 .A6 1937PS1000.A8 G53 1856

MFHD_MASTER.NORMALIZED_CALL_NONormalized spacing so that things sort properlyPS 7 A 6 1937PS 1000 A 8 G 53 1856

Page 10: Mining the Blob:  There's Gold in the Directory!

Grouped elements

Many indexed fields (including all 6XX) are indexed together in BIB_INDEXYou can distinguish between them using the INDEX_CODE

Page 11: Mining the Blob:  There's Gold in the Directory!

Functions

String functionsMID – extract only part of a field

MID(FIELD, start, length)LEFT, RIGHT – extract left or right side

Aggregate functionsMIN, MAX – minimum or maximumCOUNT – how many match?

Page 12: Mining the Blob:  There's Gold in the Directory!

A common conversation…

How do I get this data?Just use the common backbone.

How about this other data?Link in these more obscure tables.

And this data?Use functions to extract and group.

And what about this other thing?Oh… you’ll have to use the BLOB. Sorry.

Page 13: Mining the Blob:  There's Gold in the Directory!
Page 14: Mining the Blob:  There's Gold in the Directory!

The BLOB

Binary Large ObjectA lump of binary data stored as a single entity in a databaseNot indexed into its individual meaningful partsOften, slower than pre-defined indexes

In VoyagerBIB_DATA (and friends)

BIB_ID – record IDSEGMENT – actual binary data (990 bytes)SEQNUM – which segment (record may be longer)

Page 15: Mining the Blob:  There's Gold in the Directory!

BLOB functions

FunctionsGetAuthBlob(AUTH_ID)GetBibBlob(BIB_ID)GetMFHDBlob(MFHD_ID)

Examples:GetField(GetBibBlob([BIB_TEXT].[BIB_ID]),’6’,1)GetSubField(GetFieldRaw(GetBibBlob([BIB_TEXT].[BIB_ID]),’650’,1),’x’,2)

Page 16: Mining the Blob:  There's Gold in the Directory!

MARC (binary) structure

“Binary” is a bit misleadingMostly readable charactersSome control characters

Subfield delimiter, end-of-field, end-of-record

MARC record structureLeaderDirectory

An index to the variable fieldsVariable fields

Page 17: Mining the Blob:  There's Gold in the Directory!

MARC in a text editor

Page 18: Mining the Blob:  There's Gold in the Directory!

MARC leader

First 24 characters of the MARC recordSome of these will always be the same

10 – indicator count (always 2 – ind1, ind2)11 – subfield code length (always 2, $b)

Some vary with the record00-04 – record length12-16 – base address of data

Page 19: Mining the Blob:  There's Gold in the Directory!

A few questions

What is the shortest record in the catalog?

What is the longest record in the catalog?

(are these records any good?)

Page 20: Mining the Blob:  There's Gold in the Directory!

Shortest record (SQL)

MID(BIB_DATA.RECORD_SEGMENT, 1, 5)First five characters of the segmentWe want the smallest one of theseOnly check the first segment of a record

SELECT MIN(MID(RECORD_SEGMENT, 1, 5)) FROM BIB_DATA WHERE SEQNUM=‘1’;

This only gives the lengthWe want the actual shortest record(s)

Page 21: Mining the Blob:  There's Gold in the Directory!

Shortest record

SELECT BIB_ID FROM BIB_DATA WHERE SUBSTR(RECORD_SEGMENT,1,5) = (SELECT MIN(SUBSTR(RECORD_SEGMENT,1,5)) FROM BIB_DATA WHERE SEQNUM='1');

Page 22: Mining the Blob:  There's Gold in the Directory!

GitHub

To avoid filling slides with SQL queries, I’ve made them available in a GitHub repository

I encourage you to try my queries in Access, and let me know what you find in your catalog!

http://github.com/zemkat/Voyager/queries

Page 23: Mining the Blob:  There's Gold in the Directory!

Our smallest unsuppressed record

Page 24: Mining the Blob:  There's Gold in the Directory!

More small unsuppressed…

Page 25: Mining the Blob:  There's Gold in the Directory!

Largest unsuppressed record?

416 links – yikes!

Page 26: Mining the Blob:  There's Gold in the Directory!

MARC directory

Occurs right after the leaderFrom byte 24 to (base address of data) -1

Each variable field has a triplet (12 bytes) in the directory

(three bytes) – tag(next four bytes) – length of field(next five bytes) – starting position

Variable fields have indicators, subfield codes, subfield data

Page 27: Mining the Blob:  There's Gold in the Directory!

How many fields?

Directory length =base data address- 24 (for leader)- 1 (field terminator for leader)- 1 (field terminator for directory)+ 1 (since we start counting at zero)

Directory length / 12 = number of variable fields

A record with this leader has 22 variable fields:01012nam a22002897 4500

Page 28: Mining the Blob:  There's Gold in the Directory!

A few questions

Which record has the most fields?

Which record has the fewest fields?

(are these records any good?)

Page 29: Mining the Blob:  There's Gold in the Directory!
Page 30: Mining the Blob:  There's Gold in the Directory!

All large or small?

Page 31: Mining the Blob:  There's Gold in the Directory!

Size outliers

Too many fieldsLinks to every issue of a serialISBNs for every volumeTitles of every Slovak folk song ever

Too few fieldsProvisional order records that never got overlaidNon-Roman alphabet “[ARABIC CHARACTERS]”Mystery from past catalog

Page 32: Mining the Blob:  There's Gold in the Directory!

Too many problems!

What are the smallest records with call numbers in their holdings?

What are the smallest records with OCLC numbers?

Smallest records by location?

Page 33: Mining the Blob:  There's Gold in the Directory!

Tag Report

Which fields are we using? (how many records does each appear in?)Are we using any that we shouldn’t?

What percentage of our records does each field appear in?

SoftwareRuns on linux, outputs Excel filesOpen source, available on github

Page 34: Mining the Blob:  There's Gold in the Directory!
Page 35: Mining the Blob:  There's Gold in the Directory!

Missing or repeating fields

The Voyager tag table specifies:What is mandatoryWhat may repeat

Tag table can be bypassed:Preference in Cataloging moduleBulk import

How to find missing mandatory fields? (GDC!)Repeating fields that shouldn’t? (GDC!)

Page 36: Mining the Blob:  There's Gold in the Directory!

Triple Nickel

How are we using this tag in Voyager?

Are we using it consistently?

How would this look in our OPAC display?

SoftwareRuns on linux, outputs Excel filesTwo sheets: raw, groupedOpen source, available on github

Page 37: Mining the Blob:  There's Gold in the Directory!

506 report

Grouped to show common usage:

Page 38: Mining the Blob:  There's Gold in the Directory!

Sorted to show variance

Restrict to subscribers.Restricted subscribers.Restricted to subscribers.Restricted to scubscribers.Restricted to subesribers.Restricted to subsacribers.Restricted to subscibers.Restricted to subscribres.

Restricted to subscribrs.Restricted to subsribers.Restricted to suscribers.Restricted to users.Restrictged to subscribers.Restrictricted to subscribersRestrocted to subscribers.Restrticted to subscribers.

Page 39: Mining the Blob:  There's Gold in the Directory!

Record growth

OCLC Bibliographic Notification(now in Worldshare Metadata Collection Manager)Informs you which of your records have changed in OCLCDownload those records

ChallengesAvoid overriding local improvementsToo many records change to look at them all!How much have they changed (for us) ?

Page 40: Mining the Blob:  There's Gold in the Directory!

Record growth

Focus on:Which records have grown the most?(Byte size or number of fields)

Compare size of new record against equivalent in Voyager

Sort this list to only look at large improvements

(software unreleased)

Page 41: Mining the Blob:  There's Gold in the Directory!

EigenRecords

What do our MARC records look like, on average?

What types do we have? How do they vary?

Investigate using principal component analysis of directory data

(in progress)

Page 42: Mining the Blob:  There's Gold in the Directory!

Any questions?

Kathryn Lybarger @[email protected]

GitHub:http://github.com/zemkat/Voyager/

Blogs:http://pc.blog.zemows.orghttp://problem-cataloger.tumblr.comhttp://library-computer.tumblr.com


Recommended