Crushing, Blending, and Stretching Transactional
DataData Warehousing and Mining Data from Voyager and Other
Library and University Systems for Assessment of Library
OperationsVALE Users’/NJLA CUS/NJ ACRL ConferenceBusch Campus Center, Rutgers University,
Piscataway, New Jersey, Friday, January 8, 2010
Ray Schwartz, Systems Specialist Librarian
Cheng Library, William Paterson University, Wayne, New Jersey, USAschwartzr2 @ wpunj.edu
2
Outline
• Assessment and Why Now?• What is Data Mining and Data
Warehousing and Why Do We Do It?• Our Library and University• Groups and Services• Steps• Reporting
3
Recent Extent of Assessment
• ILSs collect transactional data for circulation and allocation of collection funds.
• ILL and Document Delivery services supply general transactional data.
• Reports from other vendor services– Bibliographic utilities– Subscription agents– Book jobbers
• Many other ways of collecting transactional data.– Gate counts– Reference transaction counts– Reshelving counts
4
5
6
7
What is different now?
• Most ILSs have search and web server logs• Most (if not all) Full-Text Databases have
usage reports• Link Resolver logs• Proxy Server logs• Google Scholar Library Links
8
What would we like to see?
• Breakdowns by department and majors.
• Combined usage by department/majors of more than one library service.
9
What is Data Mining and Data Warehousing
• Extracting data from legacy systems and other resources;
• cleaning, scrubbing and preparing data for decision support;
• maintaining data in appropriate data stores; • accessing and analysing data using a variety
of end user tools; • and mining data for significant relationships.
• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.
10
• The primary purpose of these efforts is to provide easy access to specifically prepared data that can be used with decision support applications such as management reports, queries, decision support systems, executive information systems and data mining.
• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.
11
Of course there are many ways to measure
– Scott Nicholson’s
Measurement Model
12
Knowledge states and User citations to materials• How useful is the library
system?• Focus groups, User Citation
tracking
Usability• Effectiveness of the system
for the staff and institution.External (User)
Recorded interactions with interface & materials• Bibliomining• Transaction/Web Log
Analysis• Observation of User
Behavior
Procedures and Standards• Staff survey and interviews• Audits of collections,
systems, or staff
Internal (Library System)
UseLibrary SystemPerspectiveTopic
Nicholson, Scott (2004). A Conceptual framework for the holistic measurement and cumulative evaluation of library services. Journal of Documentation 60(2) p.164-181
Measurement Matrix with methodologies
13
Our University
• 9000 undergraduates• 1000 graduates (mostly education
majors)• 400 faculty• 800 adjuncts• 1000 staff
14
Our Library
• 19 librarians and 26 library staff• 350,000 volumes• 18,000 audiovisual items• 22,000 print and electronic periodicals • 100 general and subject specific
databases
15
Our Systems since 2005
• Voyager ILS • Online Periodical Database (OPD)• Clio ILL Software• EZProxy Server• Banner – University ERP• University Networked Drive K:• University Email Server• University Web Server
16
Online Periodicals Database
DBMS
Integrated Library System
Voyager
Patrons Searches
Banner
SIS HRS
Web Server
Circulation Media Scheduling
Serials Solutions A to Z
Other Vendors‘ Database Services & Usage Reports
Proxy Server
Off Campus Dbase Hits & ILL Form
( EZProxy Log )
University Networked Drive K:
ILL ( Cliodata )
Patrons MaterialsUniversity Email Server
Current Relationships
Internal
only WPUNJ Server
Externally accessibleWPUNJ Server
NonWPUNJ Server
Scripting Language
( University ERP System )
OCLC – Bibliographic Utility
WorldCat
ILL
Systems Chart – ca. 2005
Materials
Patrons
www.wpunj.edu Scripting Language
Web ServerILL Form
Page
ER Micro Form
Serials Form
17
Vendor Services
• Serials Solutions• OCLC – Bibliographic Utility• Blackwell – Book Jobber• Ebsco – Subscription Agent• Marcive – Authority Control• Database Vendors
18
The Question
Which categories of patrons are accessing which services?
19
First Step – Patron Statistical Categories
20
• Voyager Patron Database allows a maximum of 10 statistical categories per patron record.
• Decide which statistical categories are
needed for each patron group defined.
• Work with your University Information Systems Department to extract the relevant data from the relevant sources.
21
Groups and Services
• Major• Status
– Undergrad or Grad– Faculty, Adjunct Faculty
or Staff
• Department• College• Degree• No. of Credits• Year of Study• Campus Location
• Circulation– Books– Media– Reserve– By Fund Code– Location
• ILL / Document Delivery• Databases• Library Web Pages
– Subject Area Resource Guides
– Reference Requests• Catalog• Other Vendor Services
– Serials Solutions
22
History Department - 12 months -Feb. 2008
Library Total = declared undergrad & grad majors, adjuncts & full time faculty borrowers
BORROWER = any member who borrowed materials
MEMBER = declared major or department member
EQUIPMENT CIRCULATION = camcorders, overhead & data projectors, laptops, easels, DVD players, etc.
MEDIA CIRCULATION = audio & video materials, including media reserves
BOOK CIRCULATION = books, book disks, maps, oversize, Curriculum materials, reserve books, NJ History, Leisure Lounge
DEFINITIONS:
10.597.1167% 4,981 7,418 52,756 20,703 8,713 23,370 LIBRARY TOTALS
19.9315.6679% 242 308 4,824 988 443 3,393 HISTORY TOTALS
20.3519.5096% 23 24 468 194 115 159 FULL-TIME FACULTY
9.255.7863% 20 32 185 20 65 100 ADJUNCT FACULTY
39.0836.2993% 13 14 508 76 13 419 GRADUATE STUDENTS
19.6915.3978% 186 238 3,663 698 250 2,715 UNDERGRADUATE STUDENTS
CIRC/ BORROWER
CIRC/ MEMBER
% BORROW
INGBORROWERSMEMBERSTOTAL CIRCEQUIP CIRCMEDIA CIRCBOOK CIRCPATRON STATUS
23
Problems with Configuration of Services
• Little to no linkage of data
• Need to search multiple services to get complete picture of serial holdings
• Multiple user IDs for authentication
24
Retirement the the OPD
• Serials holdings data was extracted from the OPD and added to Voyager catalog
• From Voyager catalog, serials holdings data is extracted and added to Serials Solutions A to Z list
25
Email Reports from the ILS
26
Voyager Overdue and Fine Notices - Daily
27
Quarterly Extract for Serials Solutions AtoZ
Service
28
Authentication of ILL and other forms are routed through the EZProxy server
29
New Services Added
• Serials Solutions MARC Record Service
• Serials Solutions Link Resolver
• OCLC Worldcat Collection Analysis
30
Second Step – Setup an Application Server
31
Our Systems in 2008
• Voyager ILS• Shared Application Server• Clio ILL Software• EZProxy Server• Banner – University ERP• University Networked Drive K:• University Email Server• University Web Server
32
Integrated Library System
Voyager
Patrons Searches
Banner
SIS HRS
Web Server
Circulation Media Scheduling
University Networked Drive K:
ILL ( Cliodata )
Patrons Materials
Proxy Server
Off Campus Dbase Hits & ILL Form
( EZProxy Log )
University Email Server
Application Server
Scripting Language
Web Server
DBMS
Usage by
OffCampus Dbase
Patron Groups
ILL Patrons/
Materials Requested
ILL Patrons/Materials Received
Current Relationships
Internal
only WPUNJ Server
Externally accessibleWPUNJ Server
NonWPUNJ Server
Scripting Language
( University ERP System )
Systems Chart - 2008
Other Vendors‘ Database Services & Usage Reports
www.wpunj.edu Scripting Language
Web ServerILL Form
Page
ER Micro Form
Serials Form
Serials SolutionsA to Z
MARC Records
Link Resolver
OCLC – Bibliographic Utility
WorldCat
ILL
WCA
33
What is an Application Server?
• A machine or its software that works in conjunction with a web server to deliver application services such as the dynamic creation of a webpage from content stored in a database. From http://www.webtools.ca.gov/help/Glossary.asp• Web Server Software (Apache or IIS)
• Database Management System – DBMS (MySQL, Oracle, MS SQL Server)
• Scripting Language (Perl, PHP, ColdFusion, ASP)
34
Why an Application Server?
• Relevant data in logfiles need to be in a database to be analyze.
• Need your own DBMS to create new tables and queries.
35
• Decide how you will use the Application Server.
• Decide on the best and most plausible configuration.
36
Daily and Weekly Email Reports from the
Application ServerCirc Fines Audit Daily Report - Daily at 6:05 AM.
Dupe Patron Record Report - Daily at 5:56 AM.
Hobart Media Services Equipment Pickup Summary - Daily at 6:58 AM.
Media Service Scheduling Rooms Report - Daily at 6:02 AM.
Media Services Equipment Pickup Summary - Daily at 7:00 AM.
Received Title Alert - Daily at 6:59 AM.
Reserves Overdues - Daily at 5:59 AM.
Scheduled LIS Tasks - Daily at 6:00 AM.
ILL Borrowing Overdues Report - Weekly at 5:59 AM.
ILL Lending Reports - Weekly at 6:15 AM.
37
Monthly Email Reports from the Application Server
Circ Fines Audit - Monthly at 6:10 AM. Circulation by Location and Item Type - Monthly at 6:21 AM. Circulation Lost and Paid - Monthly at 6:25 AM. Circulation Online Renewal Count - Monthly at 6:30 AM. Media Circulation - Monthly at 6:35 AM. Reserve Circulation - Monthly at 6:40 AM.
38
39
On Demand Reports
40
Lists of patrons with fines between $10 and $19.99 • Student and Alumni fines list - Sorted by either Name, Amount or Notice
Date.• PALS and Courtesy Patron fines list - Sorted by Name.• All other Patron fines list - Sorted by Name.
Lists of patrons with fines over $19.99 • Student and Alumni fines list - Sorted by either Name, IID, Amount, Notice
Date or Notes.• PALS and Courtesy Patron fines list - Sorted by Name.• VALE Patron fines list - Sorted by Name.• All other Patron fines list - Sorted by Name.
Lists of patrons with overdues older than 30 days • Student and Alumni overdues list - Sorted by either Name, IID or Notes.• PALS and Courtesy Patron overdues list - Sorted by Name. • All other Patron overdues list except VALE - Sorted by Name.
Lending Services Reports
41
Lists of VALE patrons with overdues older than 6 months • VALE patron overdues list - Sorted by Name.
Miscellaneous Reports • Patrons with the word "Collection Agency" or "CA" in their notes.• Patrons with the word "FINE" in one of their notes. • Patrons with the word "SOILS" in their notes. • Patrons with the word "FALL07 SOILS" in their notes. • Patrons with the word "HOLD" in their notes. • Combined list of HOLD, FINE, and CA.
Circulation Reports by Item Type from 2003 to the present• All Staff.• All Colleges • Undergraduates by Major. • Graduates by Major • Patrons that have reached a total fine balance of $10 or more after 31-
Dec-2009 and 30-Nov-2009
Lending Services Reports, cont.
42
One of Our Projects• Mining EZProxy logfiles and linking to
patron statistical categories from the Voyager Patron Database
– What majors and departments are accessing which database services?
– What majors and departments are accessing the ILL services?
43
Integrated Library System
Voyager
Patrons Searches
Banner
SIS HRS
Web Server
Circulation Media Scheduling
University Networked Drive K:
ILL ( Cliodata )
Patrons Materials
Serials SolutionsA to Z
MARC Records
Link Resolver
Proxy Server
Off Campus Dbase Hits & ILL Form
( EZProxy Log )
University Email Server
Application Server
Scripting Language
Web Server
DBMS
Usage by
OffCampus Dbase
Patron Groups
ILL Patrons/
Materials Requested
ILL Patrons/Materials Received
Current Relationships
ILL Collection and Patron Group Analyses
Off Campus Database Hits by Patron Group
Internalonly
WPUNJ Server
Externally accessibleWPUNJ Server
NonWPUNJ Server
( University ERP System )
OCLC
WorldCat
ILL
Systems Chart - 2008
Other Vendors‘ Database Services & Usage Reports
www.wpunj.edu Scripting Language
Web ServerILL Form
Page
ER Micro Form
Serials Form
WCA
Scripting Language
44
ILL request form authentications by major
Major90M- History28M- Non-Degree25M- Pub Pol & Intl Affairs20M- Spanish18M- English16M- Undecided14M- Art14M- Education11M- Sociology10M- Biology
9M- Music9M- Special Programs8M- Psychology7M- Biotechnology7M- Political Science6M- Anthropology6M- Music - Jazz Studies4M- Business4M- Communication4M- Nursing
Book CountMajor
62M- Psychology60M- Sociology42M- Applied Clinical Psych35M- Education31M- History30M- Spanish29M- Nursing
1919M- Communication14M- Biotechnology14M- Counseling14M- English12M- Non-Degree10M- Community/Sch Health
7M- Biology7M- Political Science6M- Undecided5M- Comm Media Studies5M- Reading4M- Business
Article Count
M- Communication Disorders
45
Which Databases are accessed by Majors and
Departments?
46
By Major and HostMajor Count HostM- Nursing 3377 ebscohost.comM- Non-Degree 3010 ebscohost.comM- Psychology 2303 ebscohost.comM- Counseling 1487 ebscohost.comM- Communication 1359 ebscohost.comM- Education 1267 ebscohost.comM- Business 1246 proquest.umi.comM- Sociology 1152 ebscohost.comM- Business 1145 lexis-nexis.comM- Undecided 1100 ebscohost.comM- Applied Clinical Psych 1075 ebscohost.comM- English 1034 ebscohost.comM- Sociology 916 csa.comM- Business 794 ebscohost.comM- Accounting 738 lexis-nexis.comM- Reading 683 ebscohost.comM- Physical Education 653 ebscohost.comM- Special Programs 600 ebscohost.comM- Non-Degree 463 ereserve.wpunj.edu
47
By Dept and Host
Department Count HostS- Information Systems 933 webscript.exe?fs.scrS- Psychology Dept. 742 ebscohost.comS- Accounting and Law 559 lexis-nexis.comS- Political Sci Dept. 308 lexis-nexis.comS- Nursing Dept. 204 ebscohost.comS- Market & Mgt. Dept. 175 proquest.umi.comS- Library 167 ebscohost.comS- Sociology Dept. 151 ebscohost.comS- Sociology Dept. 134 csa.comS- History Dept. 121 serials.abc-clio.comS- Exercise & Mov Sci 110 ebscohost.comS- Political Sci Dept. 104 ebscohost.comS- Library 103 ILL_article.cfmS- Library 100 webscript.exe?fs.scrS- History Dept. 94 webscript.exe?fs.scr
48
By Dept and Service
Department Count ServiceS- Information Systems 933 http://www.wpunj.edu/scripts/webscript.exe?fs.scrS- Accounting and Law 549 http://www.lexis-nexis.com/universeS- Psychology Dept. 364 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psychS- Nursing Dept. 114 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=c8hS- Sociology Dept. 96S- Sociology Dept. 75 http://search.ebscohost.com/login.asp?profile=asp
S- Philosophy Dept. 74S- Library 65 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=aspS- Anthropology Dept. 62 http://www.sciencedirect.com/S- History Dept. 61 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=AHLS- Psychology Dept. 61 http://search.ebscohost.com/login.asp?profile=psyartS- History Dept. 58 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=HAS- Psychology Dept. 54 http://search.ebscohost.com/login.asp?profile=psychS- Psychology Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psyartS- English Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=mzh
http://www.csa.com/htbin/dbrng.cgi?&db=socioabs-set-c&adv=1
http://webspirs4.silverplatter.com:8900/c119646?sp.form.first.p=srchmain.htm&sp.dbid.p=S(PHIL
49
Admin VLANs Labs VLANs
Vlan ID Vlan Name Vlan ID Vlan Name
2 Servers 3 Lab Servers
4 Admin 9 Imaging
5 Science 160 Lib Labs
6 Test Servers 174 STU VPN
7 NAS 175 Ben Shahn Lab
101 Energy Management 178 Hobart Lab
102 Diebold 179 SCI Lab
104 Xerox 187 CS Lab
150 Media Services 192 Atrium
161 Dorms Offices 209 Labs
162 RBI 212 Resnet Labs
163 Police 214 Raub Labs
164 Maintenance 228 VR Labs
IP Address Location = 149.151.VlanID.*
50
FY08/09 On Campus Hits to Databases by Class C IP Address
51
Some concerns
Patron Privacy and Standards
52
Using Voyager as the model for Patron Privacy
53
• Active Circ transactions are stored in a table with patron ID and statistical categories.
• Completed Circ transactions are stored in a table without the patron ID, but still with the patron statistical categories.
• The Patron Table contains the total counts of transactions for each patron, but no link to which transactions they are.
54
• EZProxy transactions would be stored in one table with patron statistical categories, but without the user ID.
• User ID s would be stored in another table with counts for each service divided by academic year.
• Logs are collected monthly and loaded and deleted monthly.
55
Example of EZProxy log entry
nj.dhcp.embarqhsd.net
-
theuser
1/1/2008 4:25:15 AM
GET
http://ezproxy.wpunj.edu:2048/connect?session=sGHMbeSss121YxZa&url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr
HTTP/1.1
302
537
http://ezproxy.wpunj.edu:2048/login?url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
• Ip address
• (Not used)
• user id
• date/time
• Method
• page retrieved
• Version
• response code
• no. of bytes
• Referring URL
• User agent
56
Perl Script for loading ezproxy log into MySQL
use strict;my %month=(Jan=>'01',Feb=>'02',Mar=>'03',Apr=>'04',May=>'05',Jun=>'06',Jul=>'07',Aug=>'08',Sep=>'09',Oct=>'10',Nov=>'11',Dec=>'12');while (<>){ my $pattern = '^(\S*) (\S*) (\S*) (\S*) '. '\[(..)\/(...)\/(....):(..):(..):(..) .....\]'. ' "(\S*) (\S*) (\S*)" '. '(\d*) (-|\d*) "([^"]*)" "([^"]*)"'; if (m/$pattern/){ my ($tgt,$ref,$agt) = (esc($12),esc($16),esc($17)); my $byt = $15 eq '_'?'NULL':$15; print "INSERT INTO ezproxylogs VALUES ('$1','$2','$3',". " TIMESTAMP '$7/$month{$6}/$5 $8:$9:$10','$11','$tgt',". "'$13',$14,$byt,'$ref','$agt');\r."; }else{ print "--Skipped line $.\n"; }}
sub esc{ my ($p) = @_; $p =~ s/'/''/g; return $p;}
57
Created table to assist the linking
SELECT PATRON_ADDRESS.ADDRESS_TYPE,Left([ADDRESS_LINE1],InStr([ADDRESS_LINE1],"@")-1) AS usr,PATRON_ADDRESS.PATRON_ID, PATRON_ADDRESS.ADDRESS_STATUS,PATRON_ADDRESS.EFFECT_DATE, PATRON_ADDRESS.EXPIRE_DATE,PATRON_ADDRESS.MODIFY_DATE, PATRON_ADDRESS.MODIFY_OPERATOR_ID INTOemailprefixFROM PATRON_ADDRESSWHERE (((PATRON_ADDRESS.ADDRESS_TYPE)="3"));
58
Immediate Tasks
SIS/HRS extracts to import into MySQL DMBS on the application server. • To be able to store more statistical
categories.
Export Patron SIF from MySQL into Voyager Patron database.
59
Integrated Library System
Voyager
Patrons Searches
Banner
SIS HRS
Web Server
Circulation Media Scheduling
University Networked Drive K:
ILL ( Cliodata )
Patrons Materials
Serials SolutionsA to Z
MARC Records
Link Resolver
Proxy Server
Off Campus Dbase Hits & ILL Form
( EZProxy Log )
University Email Server
Application Server
Scripting Language
Web Server
DBMS
Usage by
ILL Patrons/
Materials Requested
ILL Patrons/Materials Received
Current Relationships
ILL Collection and Patron Group Analyses
Off Campus Database Hits by Patron Group
Internalonly
WPUNJ Server
Externally accessibleWPUNJ Server
NonWPUNJ Server
( University ERP System )
OCLC
WorldCat
ILL
Systems Chart - 2010
Other Vendors‘ Database Services & Usage Reports
ILL Form
Page
ER Micro Form
Serials Form
WCA
Scripting Language
www.wpunj.edu Scripting Language
Web Server
OffCampus Dbase
Patron Groups
Patrons
60
Reporting and Standards
• Reporting– Emailed periodically - e.g., daily
dossiers, and other event triggered reports.
– On demand, via email, web pages or a printer.
• Standards– Share data for comparative research. – Groups of libraries and consortia
61
Questions?
Ray Schwartz, Systems Specialist Librarian
Cheng Library, William Paterson University,
Wayne, New Jersey, USAschwartzr2 @ wpunj.edu