+ All Categories
Home > Documents > Another Way to Attack the BLOB:

Another Way to Attack the BLOB:

Date post: 14-Jan-2016
Category:
Upload: buffy
View: 30 times
Download: 3 times
Share this document with a friend
Description:
Another Way to Attack the BLOB:. Server-side Access via PL/SQL and Perl. Why Server-side?. Your choice of tools to handle queries and generate reports Complete programmatic control Easier to write complex reports No (well, fewer) limitations Easier to restrict database access to the masses. - PowerPoint PPT Presentation
76
Another Way to Attack the BLOB: Server-side Access via PL/SQL and Perl
Transcript
Page 1: Another Way to Attack the  BLOB:

Another Way to Attack the

BLOB:Server-side Access via

PL/SQL and Perl

Page 2: Another Way to Attack the  BLOB:

Why Server-side?

• Your choice of tools to handle queries and generate reports

• Complete programmatic control• Easier to write complex reports• No (well, fewer) limitations• Easier to restrict database access to the masses

Page 3: Another Way to Attack the  BLOB:

Syllabus

• Brief MARC record review• The BLOB Plan of Attack• Data Retrieval via PL/SQL• Required tools for Perl: getting DBD & DBI• Data Retrieval via Perl

Page 4: Another Way to Attack the  BLOB:

• Brief MARC record review• The BLOB Plan of Attack• Data Retrieval via PL/SQL• Required tools for Perl: getting DBD & DBI• Data Retrieval via Perl

Page 5: Another Way to Attack the  BLOB:

MARC?

• MARC is an acronym forMAchine Readable Cataloging

Page 6: Another Way to Attack the  BLOB:

MARC

• MARC is an acronym forMAchine Readable Cataloging.

• It’s a standard format for storing an item’s data.

Page 7: Another Way to Attack the  BLOB:

MARC

• MARC is an acronym forMAchine Readable Cataloging.

• It’s a standard format for storing an item’s data.

• It’s machine readable, but not so easy for us humans to read.

Page 8: Another Way to Attack the  BLOB:

MARC

• MARC is an acronym forMAchine Readable Cataloging.

• It’s a standard format for storing an item’s data.

• It’s machine readable, but not so easy for us humans to read.

• With a bit of practice, a raw MARC record can be parsed by hand.

Page 9: Another Way to Attack the  BLOB:

MARC

• MARC is an acronym forMAchine Readable Cataloging.

• It’s a standard format for storing an item’s data.

• It’s machine readable, but not so easy for us humans to read.

• With a bit of practice, a raw MARC record can be parsed by hand.

• However, doing so is about as exciting and satisfying as trying to thread a needle one-handed.

Page 10: Another Way to Attack the  BLOB:

A MARC record’s three pieces:

• Leader• Directory• Data

Page 11: Another Way to Attack the  BLOB:

01551nam 22003738a 4500001001300000003000600013005001700019008004100036

010001700077035001800094040001800112043001200130049003000142050002500172

074000900197082001600206086001700222099001700239100001800256245011000274

260011200384300003800496490005400534500016500588500007500753500003400828

500003900862504005200901650004600953650005000999650004901049710002901098

830005001127ocm10726696 OCoLC19961223115432.0840406s1996 dcuab

b f000 0 eng a 84600065 a(GPO)97054409 dGPOdDLCdMvI an-us-az awdoc,sudci3114100999573400aQE611.5.U6bF84 1996 a06 /\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\

turalzArizonazMohave County.2 aGeological Survey (U.S.) 0aGeologic

al Survey professional paper ;v1266.

Partial view of a MARC record

this is the leader

Page 12: Another Way to Attack the  BLOB:

01551nam 22003738a 4500001001300000003000600013005001700019008004100036

010001700077035001800094040001800112043001200130049003000142050002500172

074000900197082001600206086001700222099001700239100001800256245011000274

260011200384300003800496490005400534500016500588500007500753500003400828

500003900862504005200901650004600953650005000999650004901049710002901098

830005001127ocm10726696 OCoLC19961223115432.0840406s1996 dcuab

b f000 0 eng a 84600065 a(GPO)97054409 dGPOdDLCdMvI an-us-az awdoc,sudci3114100999573400aQE611.5.U6bF84 1996 a06 /\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\

turalzArizonazMohave County.2 aGeological Survey (U.S.) 0aGeologic

al Survey professional paper ;v1266.

Partial view of a MARC record

this is the directory

Page 13: Another Way to Attack the  BLOB:

01551nam 22003738a 4500001001300000003000600013005001700019008004100036

010001700077035001800094040001800112043001200130049003000142050002500172

074000900197082001600206086001700222099001700239100001800256245011000274

260011200384300003800496490005400534500016500588500007500753500003400828

500003900862504005200901650004600953650005000999650004901049710002901098

830005001127ocm10726696 OCoLC19961223115432.0840406s1996 dcuab

b f000 0 eng a 84600065 a(GPO)97054409 dGPOdDLCdMvI an-us-az awdoc,sudci3114100999573400aQE611.5.U6bF84 1996 a06 /\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\

turalzArizonazMohave County.2 aGeological Survey (U.S.) 0aGeologic

al Survey professional paper ;v1266.

Partial view of a MARC record

this is the data

Page 14: Another Way to Attack the  BLOB:

01551nam 22003738a 4500001001300000003000600013005001700019008004100036

010001700077035001800094040001800112043001200130049003000142050002500172

074000900197082001600206086001700222099001700239100001800256245011000274

260011200384300003800496490005400534500016500588500007500753500003400828

500003900862504005200901650004600953650005000999650004901049710002901098

830005001127ocm10726696 OCoLC19961223115432.0840406s1996 dcuab

b f000 0 eng a 84600065 a(GPO)97054409 dGPOdDLCdMvI an-us-az awdoc,sudci3114100999573400aQE611.5.U6bF84 1996 a06 /\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\

turalzArizonazMohave County.2 aGeological Survey (U.S.) 0aGeologic

al Survey professional paper ;v1266.

Dissection of MARC record leader

record length

data starts at this offset, the base address

(pertinent details)

Page 15: Another Way to Attack the  BLOB:

Dissection of MARC record directory

01551nam 22003738a 4500001001300000003000600013005001700019008004100036

010001700077035001800094040001800112043001200130049003000142050002500172

01551nam 22003738a 4500 header

001 0013 00000

003 0006 00013

005 0017 00019

008 0041 00036

010 0017 00077

035 0018 00094

040 0018 00112

etc.

tag len offset

how to parse it

Each 12-character “triplet” is associated with one field.

Page 16: Another Way to Attack the  BLOB:

Where in the record does a field’s data start?

01551nam 22003738a 4500001001300000003000600013005001700019008004100036

010001700077035001800094040001800112043001200130049003000142050002500172

01551nam 22003738a 4500 header

001 0013 00000

003 0006 00013

005 0017 00019

008 0041 00036

010 0017 00077

035 0018 00094

040 0018 00112

etc.

tag len offset Where a field’s data starts is determined by adding its offset to the base address.

Data for the first field, tag 001, begins at position 373, tag 003 begins at 386, tag 005 begins at 392, etc.

Page 17: Another Way to Attack the  BLOB:

01551nam 22003738a 4500001001300000003000600013005001700019008004100036

010001700077035001800094040001800112043001200130049003000142050002500172

074000900197082001600206086001700222099001700239100001800256245011000274

260011200384300003800496490005400534500016500588500007500753500003400828

500003900862504005200901650004600953650005000999650004901049710002901098

830005001127ocm10726696 OCoLC19961223115432.0840406s1996 dcuab

b f000 0 eng a 84600065 a(GPO)97054409 dGPOdDLCdMvI an-us-az awdoc,sudci3114100999573400aQE611.5.U6bF84 1996 a06 /\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\

turalzArizonazMohave County.2 aGeological Survey (U.S.) 0aGeologic

al Survey professional paper ;v1266.

Partial view of a raw MARC record, data section

The “box characters” below are the MARC format binary separation characters.

Page 18: Another Way to Attack the  BLOB:

01551nam 22003738a 4500001001300000003000600013005001700019008004100036010001700077035001800094040001800112043001200130049003000142050002500172074000900197082001600206086001700222099001700239100001800256245011000274260011200384300003800496490005400534500016500588500007500753500003400828500003900862504005200901650004600953650005000999650004901049710002901098830005001127<TAG>ocm10726696 <TAG>OCoLC<TAG>19961223115432.0<TAG>840406s1996 dcuab b f000 0 eng <TAG> <SUB>a 84600065 <TAG> <SUB>a(GPO)97054409<TAG> <SUB>dGPO<SUB>dDLC<SUB>dMvI<TAG> <SUB>an-us-az<TAG> <SUB>awdoc,sudc<SUB>i31141009995734<TAG>00<SUB>aQE611.5.U6<SUB>bF84 1996<TAG> <SUB>a06/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\tural<SUB>zArizona<SUB>zYavapai County.<TAG>0<SUB>aGeology, Structural<SUB>zArizona<SUB>zMohave County.<TAG>2 <SUB>aGeological Survey (U.S.)<TAG> 0<SUB>aGeological Survey professional paper ;<SUB>v1266.<TAG><EOR>

Partial view of a raw MARC record, data section

The MARC format uses the following characters:

<TAG> hex 1e tag delimiter<SUB> hex 1f subfield delimiter<EOR> hex 1d end of record indicator

Page 19: Another Way to Attack the  BLOB:

Programmer’s MARC format review• Get the record length from the 1st 5 columns.

Page 20: Another Way to Attack the  BLOB:

Programmer’s MARC format review• Get the record length from the 1st 5 columns.• Get the data base-address from columns 13-17.

Page 21: Another Way to Attack the  BLOB:

Programmer’s MARC format review• Get the record length from the 1st 5 columns.• Get the data base-address from columns 13-17.• Parse through the directory for the desired field by looking at the 1st 3 columns of each tag’s 12-character “triplet”. Get the tag’s length (next 4 columns) and offset (last 5 columns of the “triplet”).

Page 22: Another Way to Attack the  BLOB:

Programmer’s MARC format review• Get the record length from the 1st 5 columns.• Get the data base-address from columns 13-17.• Parse through the directory for the desired field by looking at the 1st 3 columns of each tag’s 12-character “triplet”. Get the tag’s length (next 4 columns) and offset (last 5 columns of the “triplet”).

• Read the tag’s data by: Adding the tag’s offset to the record’s base address. Starting at that position, read the tag’s data for tag length columns.

Page 23: Another Way to Attack the  BLOB:

Programmer’s MARC format review• Get the record length from the 1st 5 columns.• Get the data base-address from columns 13-17.• Parse through the directory for the desired field by looking at the 1st 3 columns of each tag’s 12-character “triplet”. Get the tag’s length (next 4 columns) and offset (last 5 columns of the “triplet”).

• Read the tag’s data by: Adding the tag’s offset to the record’s base address. Starting at that position, read the tag’s data for tag length columns.

• Make sure the position you’re reading from is not beyond the end of the record.

Page 24: Another Way to Attack the  BLOB:

Programmer’s MARC format review• Get the record length from the 1st 5 columns.• Get the data base-address from columns 13-17.• Parse through the directory for the desired field by looking at the 1st 3 columns of each tag’s 12-character “triplet”. Get tag’s length (next 4 columns) and offset (last 5 columns of the “triplet”).

• Read the tag’s data by: Adding the tag’s offset to the record’s base address. Starting at that position, read the tag’s data for tag length columns.

• Make sure the position you’re reading from is not beyond the end of the record.

Beware of the common “off by 1” error. Depending on the language you’re using, you could be off by 1 in either direction regarding your position within the record.

Page 25: Another Way to Attack the  BLOB:

• Brief MARC record review• The BLOB Plan of Attack• Data Retrieval via PL/SQL• Required tools for Perl: getting DBD & DBI• Data Retrieval via Perl

Page 26: Another Way to Attack the  BLOB:

The BLOB Plan of Attack• Voyager’s BLOB data is stored the same way for the Auth, Bib, and Mfhd data tables.

table_data (where “table” is auth, bib, or mfhd)

table_id

record_segment

seqnum

Page 27: Another Way to Attack the  BLOB:

The BLOB Plan of Attack

table_data (where “table” is auth, bib, or mfhd)

table_id

record_segment

seqnum

A MARC record is typically stored entirely in

one row in the table. Longer records which are

longer than the record_segment size have

to be stored in more than one row.

Page 28: Another Way to Attack the  BLOB:

The BLOB Plan of Attack

table data (where “table” is auth, bib, or mfhd)

table_id

record_segment

seqnum

Each table_id is unique to an item’s record.

However, if more than one row makes up a record,

we will have duplicate table_ids. In that case,

we’ll have seqnum = 1, 2, 3, etc., for that

record.

Page 29: Another Way to Attack the  BLOB:

The BLOB Plan of Attack

auth_id record_segment seqnum

635406 MARC data 1

An example of a record contained completely in one row.

This record is ready to be processed after extraction from the record_segment.

Page 30: Another Way to Attack the  BLOB:

The BLOB Plan of Attack

auth_id record_segment seqnum

635406 MARC data 1

635406 MARC data 2

635406 MARC data 3

This longer record is spread across 3 rows.

Assemble the MARC record by concatenating MARC data in seqnum order:

MARC-record = record_segment<-seqnum1 +record_segment<-seqnum2 +record_segment<-seqnum3

This record is then ready to be processed.

Page 31: Another Way to Attack the  BLOB:

• Brief MARC record review• The BLOB Plan of Attack• Data Retrieval via PL/SQL• Required tools for Perl: getting DBD & DBI• Data Retrieval via Perl

Page 32: Another Way to Attack the  BLOB:

PL/SQL Example

The example code retrieves a few MARC

records, and displays them on the

screen in human-readable format, along

with some diagnostics.

(The code examined in the following

slides starts on Page 2 of the handout.)

Page 33: Another Way to Attack the  BLOB:

Use a cursor to retrieve data

PL/SQL Example

Also declare necessary variables in this section

Page 34: Another Way to Attack the  BLOB:

PL/SQL ExampleOpen the cursor and start looping through the rows

Page 35: Another Way to Attack the  BLOB:

PL/SQL Example

Get a row from the cursor into the program variables

Page 36: Another Way to Attack the  BLOB:

PL/SQL Example

Assemble the marc record. The typical record fits into one row, thus seqnum = 1 and we skip the loop.

Page 37: Another Way to Attack the  BLOB:

PL/SQL ExampleFor a longer, multi-segment record (from an earlier example), we 1st have seqnum=3 & put it into marc. Then we have seqnum=2 and PREPEND that to marc. Last we exit the loop since now seqnum=1 and the last statement here takes care of that.

Page 38: Another Way to Attack the  BLOB:

Why go “backwards” in assembling a MARC record?

If we predicate the segment-to-marc-record assembly on when the auth_id changes in our loop structure, once it changes we've gone too far and can't go back to get the last segment to completely assemble the now previous record.

It’s simpler to predicate looping on seqnum in reverse order because there will always be a seqnum of 1.

If there are multiple segments, we'll always end with a seqnum of 1 and still be on the same auth_id and can go on processing the record.

This reasoning is not for PL/SQL only, although that is “where” the idea came from.

PL/SQL Example

Page 39: Another Way to Attack the  BLOB:

PL/SQL Example

Now that we have a MARC record, let’s get the record length and data base-address. We set our pointer to the start of the directory and start looping through the directory.

Page 40: Another Way to Attack the  BLOB:

PL/SQL Example

As we loop through the directory, we read the tag id, its length, and its offset in the data part. The actual tag address where we get the data is the data base-address plus the offset.

Page 41: Another Way to Attack the  BLOB:

PL/SQL Example

In the last line here, the subfield indicators (hex 1f = dec 31) are replaced by the vertical bar character “|” for better readability.

Page 42: Another Way to Attack the  BLOB:

PL/SQL Example

Along with the subfield indicator character substitution, we add some space formatting to further increase readability.

Thus, instead of

0aPetroleumxDrilling fluids

we get

0|a Petroleum |x Drilling fluids

for tag data.

Page 43: Another Way to Attack the  BLOB:

PL/SQL Example

Page 44: Another Way to Attack the  BLOB:

PL/SQL Example

Now we can output the tag’s data. Output is broken into 80 character chunks to get around the 255 character limit of dbms_output and for better readability.

Page 45: Another Way to Attack the  BLOB:

PL/SQL Example

We’re done with this tag, so we move on to the next tag in the directory. At the end, close loops and clean up.

End looping for directory traversal

End looping for cursor

Don’t forget that this ending character is required for your PL/SQL code to run!

Page 46: Another Way to Attack the  BLOB:

Demo…

example.pls

PL/SQL Example

Page 47: Another Way to Attack the  BLOB:

• Brief MARC record review• The BLOB Plan of Attack• Data Retrieval via PL/SQL• Required tools for Perl: getting DBD & DBI• Data Retrieval via Perl

Page 48: Another Way to Attack the  BLOB:

Additional tools required for Perl to talk to Oracle:

• DBI, the generic DataBase Interface software.

• DBD, the specific DataBase Driver, for Oracle in our case.

Page 49: Another Way to Attack the  BLOB:

Getting and installing DBI and DBD

Point your browser to:

http://www.cpan.org/authors/id/TIMB/

Complete the above URL with“DBD-Oracle-1.12.tar.gz” to get DBD software“DBI-1.20.tar.gz” to get DBI software

Page 50: Another Way to Attack the  BLOB:

Getting and installing DBI and DBD

•gunzip each file.

•un-tar each file.

•READ the instructions!

•Installation takes 4 or 5 steps and requires you to be root.

•If you don’t have root access, or if you’re uncomfortable doing any of this, seek out your SysAdmin for assistance.

Page 51: Another Way to Attack the  BLOB:

• Brief MARC record review• The BLOB Plan of Attack• Data Retrieval via PL/SQL• Required tools for Perl: getting DBD & DBI• Data Retrieval via Perl

Page 52: Another Way to Attack the  BLOB:

Perl Example

The following real-world example lets you

retrieve an arbitrary range of MARC records

from your choice of Auth, Bib, or Mfhd.

Output goes to <stdout>, and can be raw MARC

data, or formatted for human readability.

(The code examined in the following

slides starts on Page 5 of the handout.)

Page 53: Another Way to Attack the  BLOB:

Perl Example

Must pull in DBI stuff

Handle program

arguments and

show how to

use it if

necessary

Page 54: Another Way to Attack the  BLOB:

Perl Example

Here we create the database connection and assign its context to a database handle. We need to specify what type of data (Oracle), the name of the machine to which we’re connecting, the SID, and the username and password.

Page 55: Another Way to Attack the  BLOB:

Perl Example

We saw this query in the PL/SQL example. Here we build the query statement, inserting the program arguments where needed. This allows this query to work with any MARC table type and an arbitrary table_id range.

Page 56: Another Way to Attack the  BLOB:

Perl Example

Execute the statement and receive a return code.

Create the query context and assign it to a statement handle.

Page 57: Another Way to Attack the  BLOB:

Perl Example

This is how we get rows from the result set of the query, via the statement handle. The three columns in the row fall into the list of three variables.

Page 58: Another Way to Attack the  BLOB:

Perl Example

Output last record here

Raw output:

On record transition, output the MARC record we just built, reset the ID variable, and store the MARC data for the record we just started reading.

If on the same record, keep on storing MARC data.

Page 59: Another Way to Attack the  BLOB:

Perl ExampleFormatted (not raw) output:

On record transition, store the accumulated MARC record and start building a new one, else just prepend to the present marc record.

Store last record here

(We’re effectively building a MARC file in memory, a virtual file, in the $marcstuff variable.)

Page 60: Another Way to Attack the  BLOB:

Perl Example

Release the resources associated with the statement handle and the database handle.

Page 61: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

MARC data contains no CR-LFs; instead it uses the hex 1d character to delimit the end of a MARC record. Create the array of MARC records here.

Page 62: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

Start looping through the array of MARC records.

Page 63: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

We get and output the leader, and then get the record length and the data base-address. Then we position ourselves at the start of the directory.

Page 64: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

Loop through the directory

Page 65: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

Get the tag id, its length, and its offset. Then read the tag’s data. The actual tag address where we get the data is the data base-address plus the offset.

Page 66: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

Now do some formatting for readability. We substitute the vertical bar character “|” for the subfield delimiter, and remove the other delimiters.

Page 67: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

Output the tag’s parameters, and the data. Then go to the next tag in the directory.

Page 68: Another Way to Attack the  BLOB:

Perl ExampleExecuting this part for formatted, readable output

End of program stuff. Close loops and show count of records output.

Page 69: Another Way to Attack the  BLOB:

Demo…

example.pl

Perl Example

Page 70: Another Way to Attack the  BLOB:

Perl

•PROBLEM: if you’re reading the entire table, you can still run into problems with too much data at one time.

•SOLUTION: process your data in small chunks.

•Dividing the table into chunks of about 50,000 rows has worked very well for us.

•The following method has proven useful:

Page 71: Another Way to Attack the  BLOB:

Perl

Large Table Solution in a Nutshell

•This example uses the BIB_DATA table

in your setup section,set a db_increment variable to 50,000

set max_bib_id to highest bib_id from table

set beginning_bib_id to 0,ending_bib_id to db_increment

Page 72: Another Way to Attack the  BLOB:

This outer loop goes through the entire table:

while beginning_bib_id < max_bib_id

call chunkthrudb

set beginning_bib_id to (ending_bib_id + 1)

increment ending_bib_id by db_increment

end while

Perl

Large Table Solution in a Nutshell

Page 73: Another Way to Attack the  BLOB:

sub chunkthrudb select bib_id,

record_segment,seqnum

from bib_data where bib_id >= beginning_bib_id and bib_id < ending_bib_id order by bib_id asc, seqnum desc build the MARC record and call processrecend sub

This inner loop goes through db_increment-sized chunks:

Perl

Large Table Solution in a Nutshell

Page 74: Another Way to Attack the  BLOB:

sub processrec

process the MARC record as needed

end sub

Perl

Large Table Solution in a Nutshell

Page 75: Another Way to Attack the  BLOB:

Page 8 of the handout has a diagram illustrating this process.

Perl

Large Table Solution in a Nutshell

Page 76: Another Way to Attack the  BLOB:

Questions?Email: [email protected]: 616.387.3885

Thanks for listening.


Recommended