Post on 27-Dec-2015
transcript
Digital Libraries: An Aid to Education through Interoperable
Open Archives of Resources
U. KentuckyFebruary 24, 2000
Edward A. Fox
fox@vt.edu http://fox.cs.vt.edu
CC CS DLRL Internet TIC
Virginia Tech, Blacksburg, VA, USA
Acknowledgements (Selected)
Sponsors: ACM, Adobe, IBM, Microsoft, NSF, OCLC, US Dept. of Education, …
Co-PIs: Marc Abrams, Robert Akscyn, John Eaton, Gail McMillan
Students: Fernando Das Neves, Robert France, Neill Kipp, Paul Mather, Constantinos Phanouriou, James Powell, Ohm Sornil, David Watkins, Chang Zhang, Jianxin Zhao
Remember!
VT (education and technology)PetaPlex, Envision, MARIAN, NRGDL, 5S (to understand and build DLs)CSTC, CRIM (add to, use) -> NSDLOAI (convention, meetings, proposals)
Virginia Tech Background Largest university in Virginia, land-grant, town
population 35K plus 25K students Blacksburg Electronic Village, since 1992, with 80% of
community on Internet Net.Work.Virginia, largest ATM network, with over 750
sites, for education, research, government LMDS, Local Multipoint Distribution Service, gigabit
wireless networking - 1/3 of Virginia Math Emporium, 500 workstations Faculty Development Initiative, round 2
Supporting Authors (Teachers and Learners)
FacultyDevelop.Initiative
ETDSupport
Virginia TechDigital Library
UniversityLibraries
Classifying/Cataloging/Preserving
Collaboration
Visualization
MM
IR
EPub
HCI
Model Classroom of the 21st CenturyTechnology Showcase ATM Video Conf. Develop MM
New MediaCenter
Dig. Library & Archives
McBryde 110
Model Classroom of 21st Century ATM-based VTEL system Apple G3, Media 100, 120G, BetaCam SP,
FireWire, one of almost any device Large Smart Board IBM Multimedia PC, … Supports spring multimedia class (CS4624) Tom Wilkinson’s staff and systems supporting
innovation in learning grants
ACITC Advanced Communications and Information Technology Center,
opening summer 2000 Connects to the library, with a focus on IT 1/3 high-tech (multimedia) classrooms 1/3 digital/electronic library (reading room) 1/3 research labs: 10, including:
– Digital Library Research Laboratory (DLRL)– Center for Applied Technologies in the Humanities– Center for Human-Computer Interaction (HCI) – extending 5 year $2M
NSF Research Infrastructure project that has usability laboratories (individuals, 2-person teams, groups)
– HPC; Multimedia; Visualization (CAVE), ...
End-to-End Innovation
OC3 OC3
OC3
NET.WORK.VIRGINIAWorld’s Most Advanced Public Network
Statewide Access
Regional / National Access
Blacksburg Electronic VillageLMDS Wireless Technology
Multimedia Service Access Point
Local Community Access
Internet 2 / NGIMultimedia Network Access Point
PetaPlex
Digital Library Machine (“super” object store) Parallel computer / storage utility for scale of 1000
to 100,000,000 gigabytes (1 Tbyte - 100 Pbyte) Knowledge Systems Incorporated is supplying VT-
PetaPlex-1 for $250,000 with– high speed backbone connection(s)
– 2.5 terabytes through 100 “nanoservers”:
– Each = Network connection + IBM 25GB disk + 233 MHz Pentium II + Linux
PetaPlex Complex
FRONT END MACHINERS/6000, 1G RAM, 4 Proc.
Nanoserver
Nanoserver
Nanoserver
Nanoserver
Nanoserver
Nanoserver
Nanoserver
Nanoserver
Nanoserver
Nanoserver Nanoserver
Nanoserver
Nanoserver Nanoserver
Service
Machine 1
Service
Machine 2
Service
Machine 3
Service
Machine 4
PetaPlex Service Machine Possibilities
Front-end provides handle/repository abstraction through hashing
Small object server Large object server
– video on demand– streaming audio
Information retrieval server Proxy / cache server (e.g., 1 terabyte server
of 1000 worldwide for Comsat/Intelsat)
Comparison Network of
Workstations (NOW)
Beowulf PetaPlex
Archi- tecture
Cluster of general purpose workstation class machines using off-the-shelf network interconnect
General purpose PCs, interconnected with a custo- mized network
Special purpose architecture tuned for superstorage. Uses a mix of off-the-shelf PC compo- nents and specialized network interconnects.
Cost per node
Workstation prices. Between $2000-$2500/node
Mid to low-end PC prices. Between $1200-$1800 per node
Mass produced components will reduce price to around $100/node
Target area
Computation Computation Storage; computation is a secondary function
Filesystem support
UNIX flavors UNIX flavors Replaces location dependant files with location independent fine-grained URN named objects
ENVISION
NSF “A User-Centered Database from the Computer Science Literature” (1991-93)
Collected bib/typesetter data, converted to SGML Scanned thousands of page images MARIAN search engine - can be made available (also
applied to the Virginia Tech library catalog) used as part of a prototype object-based DL, with tailored visualization interface (L. Nowell dissertation)
MARIAN
Multiple Access Retrieval of Information with ANnotations
(Musical: Marian the Librarian …) Evolved from 1980’s CODER system to a
distributed Online Public Access Catalog (OPAC), then DL backend, now becoming a full DL system
From C/C++ to Java by Jianxin Zhao Future uses: NDLTD, NUDL, PetaPlex
MARIAN Layers
Database Layer
Search Engine Layer
User Information Layer
User Interface Layer
User User User User
MARIAN Parallelism
Java part response time vs. query rate comparation
(type 1 requests)
01000200030004000
0 100 200 300 400 500
query rate (#/min)
resp
onse
tim
e (m
s)
all modules in one machine one "webgate"
two "webgate"s four "webgate"s
MARIAN Response Time
Four "webgate"s, decomposed time delay vs. query
rate
0
1000
2000
3000
4000
0 100 200 300 400 500
query rate (#/min)
time
dela
y (m
s)
system after Java server
France Dissertation
Key developer since CODER Applying computational linguistics efforts
with machine readable dictionaries Applying opportunistic handling of term
lists for ranking, usable displays (“to be or not to be, that is the”)
Developing and evaluating variety of interfaces
Network Research Group NSF 3 year grant on WWW logging,
characterization, and optimization: Abrams, Fox, Pollard (CNS)
Core member of Web Characterization Activity of World-Wide Web Consortium
Providing DL to support WCA (at http://www.w3c.org/WCA):– logs– tools– publications
Example: NRG Tools
WebJamma: Artificial HTTP traffic generator
WebWatcher: HTTP traffic monitoring and logging system
CLFmunge: Anonymizes common log format
HTTPdump: Protocol decode for tcpdump
Caching proxy simulator
Splus programs
Log description and validation interface & routines
How do universities anddigital libraries relate?
Each U. will have its own digital library. Hence there will be large numbers (i.e., critical mass).
All students will learn how to use and how to “feed” digital libraries (and bring those habits to future work as needs and skills).
All digital library problems (esp. federation, flexibility, personalization) appear at U’s (so they are a good type of testbed, with willing collaborators in-place for developing solutions).
Start with NDLTD, extend to NUDL
Digital Libraries --- Virginia Tech
MARIAN (NLM) CS DL Prototype - ENVISION (NSF, ACM) TULIP (Elsevier, OCLC) BEV History Base (NSF, Blacksburg) DL for CS Education - EI (NSF, ACM) WATERS, NCSTRL (NSF) NDLTD (SURA, US Dept. of Education) CSTC (NSF, ACM), CRIM (NSF, SIGMM) WCA (Log) Repository (W3C) VT-PetaPlex-1 (Knowledge Systems)
Digital Libraries --- Objectives
World Lit.: 24hr / 7day / from desktop Integrated “super” information systems: 5S: streams,
structures, spaces, scenarios, societies Ubiquitous, Higher Quality, Lower Cost Education, Knowledge Sharing, Discovery Disintermediation -> Collaboration Universities Reclaim Property Interactive Courseware, Student Works Scalable, Sustainable, Usable, Useful
DLs: Why of Global Interest? National projects can preserve antiquities and
heritage: cultural, historical, linguistic, scholarly Knowledge and information are essential to economic
and technological growth, education DL - a domain for international collaboration
– wherein all can contribute and benefit– which leverages investment in networking– which provides useful content on Internet & WWW– which will tie nations and peoples together more strongly
and through deeper understanding
DL Challenges
Preservation - so people with trust DLs
Supporting infrastructure - networks, ...
Scalability, sustainability, interoperability
DL industry - critical mass by covering libraries, archives, museums, corporate info, govt info, personal info - “quality WWW” integrating IR, HT, MM, ...
– Need tools & methods to make them easier to build
Computing (flops)Digital content
Com
mun
icat
ions
(ban
dwid
th, c
onne
ctiv
ity)
Locating Digital Libraries in Computing andCommunications Technology Space
Digital Libraries technologytrajectory: intellectualaccess to globally distributed information
less more
D ig ita l L ib ra r y C o n te n t
A rtic le s ,R e p o rts,
B o o ks
T e xtD o cum e n ts
S p ee ch ,M u s ic
V id eoA u d io
(A e ria l)P h o tos
G e og rap h icIn fo rm ation
M o d e lsS im u la tio ns
S o ftw a re ,P ro g ra m s
G e no m eH u m a n,a n im a l,
p la n t
B ioIn fo rm ation
2 D , 3 D ,V R ,C A T
Im ag es a ndG ra p h ics
C o n te n tT yp e s
Definition: Digital Libraries are complex systems that
help satisfy info needs of users (societies)provide info services (scenarios)organize info in usable ways (structures)present info in usable ways (spaces)communicate info with users (streams)
Definition: 5S FrameworkSocieties: interacting people (, computers) Scenarios: services, functions, operations, methodsSpaces: domains + constraints (e.g., distance,
adjacency): 2D, vector, probabilityStructures: relations, trees, nodes and arcsStreams: sequences of items (text, audio, video,
network traffic) (5 Element System: Fire, Wood, Earth, Metal, Water)
5S: Components
Societies: roles, rituals, reasons, relationships, artifacts Scenarios: acquire, index, consult, administer, preserve Spaces: physical, temporal, functional, presentational,
conceptual Structures: architectures, taxonomies, schema,
grammars, links, objects Streams: granularities, protocols, paths, flows,
turbulences
5S: Combinations
Societies + Scenarios = user model Societies + Scenarios + Spaces = user
interface Streams + Structures = markup Streams + Structures + Scenarios = object Structures + Scenarios = DBMS
How to Build a Digital Library
Understand the problem (using the 5S
Framework)
Solve the problem (using the Star
Methodology)
– design, develop, evaluate,
– refine, operate
Neill Kipp Dissertation
Training interested groups about 5S and the Star Methodology, refining the Framework to have solid mathematical foundation
Case studies of projects at Virginia Tech or involving VT staff/students: CSTC, NDLTD, NARA (National Archives, with SAIC), Lexis, ...
Open also to study DL projects elsewhere Focusing too on the design artifacts developed and
related issues of efficient description and representation (esp. with markup, hypermedia)
N D L T DN e tw o rke d D L
o f Th e se s &D isse rta tio ns
S tu de n t P o rfo liosS e lf-A rch iv ing
G ra y L ite ra tu re(D e p t. o f E d u c .)
W 3 C W C AR e p o s ito ry
L o g s, T oo ls,P u b lica tio ns
C S T CC S
T e a ch ingC e n te r
C R IMC u rricu lu mR e so u rcesIn te r. M M
C o m p u te rS c ie n ce
(w ith N S Fa n d A C M )
DigitalLibraries
In te rac tiveExperiences
E n hanc ingL earning
Enhancing Learning with DLs
DigitalLibraries
A u tho ring(te x t, m ark u p ,h yp erm ed ia ,
ca ta lo g in g -D C )
S u b m itt ingW o rk (E T D )(M eta da ta ,P D F , X M L)
P re s erv ing(u s in g s td s,m ig ra tin g ,
ve rs ion in g )
A dd ing toD ig ita lL ib ra ry
(s tu de n t)
D isc o v e rin g ,B ro w s in g ,S e a rch in g ,R e trie v ing
A nn o ta tin g ,D o w n lo ad in g ,
In s ta llin g ,F e e db a ck
5 S F ra m e w o rk:S o c ie tie s ,S c e na rio s,
S tre am s,S p ac e s,S tru ctu res
U s in gD ig ita l L ib ra ry
(d ire c t)(in fo lite ra cy)
In d ire c tly U s ingD ig ita l L ib ra ry(e m b ed de d ,b y ag en t, . . .)
U s in g D LC o n ten ts (to o ls,d a ta se ts , en v 's,co u rse w a re , . . .)
C o lla bo ra tion(in /a ro u nd D L
a n d its a rt ifa c ts -d is tan c e e du c .)
O th erIn te rac tiveL e arn ingA c tiv it ies
In te rac tiveExperiences
E n hanc ingL earning
NSF Education Innovation (EI) NSF “Interactive Learning with a Digital Library in
Computer Science” (1993-98) 45 online courses (esp. Internet, IR, MM,
Professionalism, overall EI project pages): 100+K accesses/wk
Tools: SWAN (visualization), QUIZIT Evaluation
– traditional– network logging and analysis– tools for visualization
Digital Library Courseware
http://ei.cs.vt.edu/~dlib/ WWW pages or large PDF copy files Online quizzes based on book by Michael Lesk
(Morgan Kaufmann Publishers) Contents based on book, with several other popular
topics added (e.g., agents) Separate pages to supplement: Definitions,
Resources (People, Projects), and References
CS -> CSTC -> CRIM NSF and ACM Education Committee are funding a 2
year project “A Computer Science Teaching Center” - CSTC - http://www.cstc.org/
College of NJ, U. Ill. Springfield, Virginia Tech Focus initially on labs, visualization, multimedia Multimedia part is also supported by a 2nd grant to
Virginia Tech and The George Washington University: http://www.cstc.org/~crim/ (with curricular guidelines also under development)
CS Teaching Center (CSTC) Instead of building large, expensive multimedia packages, that become
obsolete and are difficult to re-use, concentrate on small knowledge units.
Learners benefit from having well-crafted modules that have been reviewed and tested.
Use digital libraries to build a powerful base of support for learners, upon which a variety of courses, self-study tutorials & reference resources can be built. [See NSF NSDL - National Science (math, engineering, technology education) Digital Library (formerly SMETE-lib) at http://www.dlib.org/smete/public/smete-public.html]
ACM Education Board and SIG support, new NSF grant with COLLEGIS Research Institute and others …
CRIM Rationale
MM field needs properly trained personnel Support this with resources + curricula Together these help us move toward a DL
for Interactive MM -> CS -> NSDL Benefits will go to teachers (who have more
to build upon) and students (who will have a richer environment for learning
CRIM Project Activities
Workshops, other ways to involve community WWW site including DL in CSTC re MM
– Devised cataloging schema, designed interface
– Referring to all MM syllabi and curriculum
– Inviting learning resources for the CRIM DL, with reviews, reuse certifications
Publish report on MM curriculum through ACM and IEEE, after careful review
CSTC, CRIM will lead to ACM Journal of Educational Resources in Computing (JERiC)
Virginia Tech CRIM Related Courses
Art: Digital Art and Design course (Photoshop) CS: 1604 Introduction to the Internet (1 cr.) CS: 3604 Professionalism in Computing CS: 4624 Multimedia, Hypertext and
Information Access (3 cr.) CS: 5604 Information Storage & Retrieval (3 cr.) CS: 6604 Digital Libraries (3 cr.)
SMETE Library -> NSDL(from www.dlib.org to NSF DLI-2)
Context: Global movement toward Digital Libraries (see April 1998 CACM)
NSF effort: Science, Mathematics, Engineering, and Technology Education Digital Library (focussed on undergraduates)– 3 workshops, yearly increasing funds / new calls
SMETE Library likely to operate as distributed federation, with separate parts for each key discipline, and to lead to a global effort
Open Archives Initiative History
xxx at LANL = Los Alamos National Laboratory (Ginsparg) for high-energy physics - 1991
CSTR + WATERS = NCSTRL (Lagoze) - 1994 xxx + NCSTRL = CoRR collaboration - 1998 UPS (Universal Preprint Service) – 1999 mtg
– Herbert Van de Sompel (U. Ghent, SFX) …– Dublin Core (DC), XML– Dienst protocol and software (Lagoze)
Renamed late 1999 as OAI
OAI Philosophy
Self-archiving = submission mechanism Long-term storage system = archive Open interface = harvesting mechanism Data provider + service provider Start with e-prints / pre-prints
Open Archives (protoproto)
ArXiv & Los Alamos National LabCogPrints & U. SouthamptonNACA & NASA (reports)NCSTRL & Cornell U.NDLTD & Virginia TechRePEc & U. Surrey(Washington U. & EconWPA)
Open Archives Members Original Participants in the Open Archives Initiative
– Caroline Arms, Library of Congress– Leslie Carr, University of Southampton– Mark Doyle, American Physical Society– Dale Flecker, Harvard University– Edward A. Fox, Virginia Tech– Michael Friedman, HighWire Press, Stanford University– Paul M. Gherman, Vanderbilt University– Paul Ginsparg, Los Alamos National Laboratory & xxx– Stevan Harnad, University of Southampton– Thomas Krichel, University of Surrey & RePEc– Carl Lagoze, Cornell University– Rick Luce, Los Alamos National Laboratory– Clifford Lynch, Coalition for Networked Information– Kurt Maly, Old Dominion University– Michael L. Nelson, NASA Langley Research Center– John Ober, California Digital Library– Bob Parks, Washington University & EconWPA– Herbert Van de Sompel, University of Ghent– Eric F. Van de Velde, California Institute of Technology– Don Waters, The Andrew W. Mellon Foundation– Ken Weiss, California Digital Library
Others Joining (selected)– University of Virginia – Jim French, Worthy Martin, Thornton Staples, – NEC Research Institute - C. Lee Giles and Steve Lawrence– Internet Archive - Kurt Bollacker, Marlita Kahn– India - University of Mysore – Shalini Urs– Mexico – University of Monterrey - David Garza Salazar
VT Open Archives – Initial Set
NDLTD – global (DC – listserv)NDLTD – VT (MARC, DC)CSTC (DC format, ACM format)W3C WCA logs (XML, atomic)
Approaches to Open Archives
Build ByDiscipline
Build By Institution
AuthorCategoryInterdisciplinaryYearLanguageQuery …
Institutions / Disciplines
Universities: part, all, sets of Disciplines: buy in as in Germany
– Physics, Chemistry, Math, Sociology, Educ.
Basis for Federation:– Language – German, Spanish, French, CJK
– Politics – OhioLink, National Library of Portugal, ISTEC for Latin America
– Economics – Developing Countries (UNESCO)
Open Archives Initiative (OAI)www.openarchives.org
Santa Fe meeting, Oct. 21-22, 1999 and protoproto Next mtg June 3, San Antonio, between HT’00 & DL’00 LANL, CNI, DLF, Mellon, … Convention (see Feb. D-Lib Magazine) Archives -> Open Archives
– Support unique archive identifiers
– Implement Open Archives Metadata Set (DC-based, using XML)
– Implement Dienst harvesting interface
– Register the archive
Build tools, layer other services: linking, searching, …
Figure 1. Layers Related to Open Archives Initiative
Services
…
Search/Browse
Authoring Citation Checking Submission
Metadata Creation
Editorial: Reviewing, Certification
Registry
Archives: Name, ID, Description, Terms and Conditions, …
Metadata Formats: Name, XML DTD, …
…
Archive Formats: Name, Standard, Preservation Process, …
Protocols Tools
Services
Copy-Edit / Add Value Citation DB Updating
Authority Control
Preservation Conversion
Text/MM Editing
Gazetteer Cataloging
Collaboration
Annotation
Summarization
Citation / Linking
SFX
CiteSeer
Repository NCSTRL Repository
…
EconWPA Repository
RePEc Repository
Repository for NDLTD Open Archives Harvesting Protocol
Metadata Formats: OA Metadata Set, NDLTD Standard (DC-based) Set
Transaction Log
Training Resources
VT Partition
Record (Metadata)
Record (Full Content)
… …
UVA Partition
Metadata Content
Caltech Partition
Metadata Content
Interoperability for NDLTD
Naming Data exchange: share MARC records Performance, reliability:
replication(mirroring) Federated searching
– Query on content, metadata, links/relationships Dynamic linking / extended services Browsing, viz., working in concept space Annotating/reviewing/certifying
Perspective/goals: removing barriers
Mechanisms
Sharing– Join federation, run software– Make metadata and archive available
Aggregating– By discipline– By institution– By genre
Automating– Workflow– Harvesting and providing services– Federated searching– Dynamic linking
OAI-Related Proposals
CNPQ – collaboration with PUC Rio
CONACyT – collaboration with UDLA and Monterrey (Mexico)
FIPSE preproposal – GSDI + OAI – with Caltech, U. Cincinatti (OhioLink), U. Kentucky, U. Iowa, USF (FL center for library automation)