Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 222 times |
Download: | 5 times |
National Partnership for Advanced Computational Infrastructure
Data Intensive Computing
Information Based Computing
Digital Libraries / Metacomputing Services
Reagan W. MooreSan Diego Supercomputer Center
[email protected]://www.npaci.edu/DICE
National Partnership for Advanced Computational Infrastructure
Distributed Archives
Application
Digital Library
Data Mining
Information Based Computing
Information Discovery
CollectionBuilding
National Partnership for Advanced Computational Infrastructure
Co-evolution of Technology
• Supercomputer Centers and Digital Libraries• Both support large scale processing & storage of data
• Will the supercomputer centers of the future be digital libraries?
National Partnership for Advanced Computational Infrastructure
Researchers
Chaitanya BaruAmarnath Gupta
Bertram LudaescherRichard Marciano
Yannis PapakonstantinouArcot Rajasekar
Wayne SchroederMichael Wan
National Partnership for Advanced Computational Infrastructure
Outline
• Two views of computing• Executionenvironment - metacomputing systems• Data Management environment - digital library
• Analysis for moving data to the process or the process to the data
• Data Management Environment• Information Based Computing
National Partnership for Advanced Computational Infrastructure
Dig
ital
Lib
rari
es
Mul
tim
edia
/ G
IS /
MV
D /
XM
L /
LD
AP
/ C
OR
BA
/ Z
39.5
0
Publication / Services Environment
Presentation Interface
Object Based Information Model
Data Management for publication
Data Resources
Parallel I/O - MPI
Constructors: turning data sets into objects
Data Resources
Data Management for execution
Metacomputing Environment
Execution Environment
National Partnership for Advanced Computational Infrastructure
Choice between Environments
• Should we provide services for manipulating information• Move the process to the data
• Should we provide execution environments • Move data to the process
National Partnership for Advanced Computational Infrastructure
Data Distribution Comparison
Data Handling Platform
Supercomputer
Execution rate r < RBandwidths linking systems are B & bOperations per bit for analysis is OOperations per bit for data transfer is o
Reduce size of data from S bytes to s bytes and analyze
Should the data reduction be done before transmission?
Data B b
National Partnership for Advanced Computational Infrastructure
Distributing ServicesCompare times for analyzing data with size reduction from S to s
Read Data
Reduce Data
TransmitData
Network ReceiveData
Read Data
Reduce Data
TransmitData
Network ReceiveData
S / B O S / r o s / r s / b o s / R
o S / Ro S / r S / b O S / RS / B
Data Handling Platform Supercomputer
Data Handling Platform Supercomputer
National Partnership for Advanced Computational Infrastructure
Comparison of Time
T(Super) = S/B + OS/r + os/r + s/b + os/R
Processing at supercomputer
Processing at archive
T(Archive) = S/B + oS/r + S/b + oS/R + OS/R
National Partnership for Advanced Computational Infrastructure
Optimization Parameter Selection
Have algebraic equation with eight independent variables.
T (Super) < T (Archive)
S/B + OS/r + os/r + s/b + os/R < S/B + oS/r + S/b + oS/R + OS/R
Which variable provides the simplest optimizationCriterion?
National Partnership for Advanced Computational Infrastructure
Scaling Parameters
Data size reduction ratio s/SExecution slow down ratio r/RProblem complexity o/OCommunication/Execution balance r/(ob)
When r/(ob) = 1, the data processing rate is the same as the data transmission rate.
Optimal designs have r/(ob) = 1
Note (r/o) is the number of bits/sec that can be processed.
National Partnership for Advanced Computational Infrastructure
Complexity Analysis
Moving all of the data is faster, T(Super) < T(Archive)Sufficiently complex analysis
O > o (1-s/S) [1 + r/R + r/(ob)] / (1-r/R)
Note, as the execution ratio approaches 1, the required complexity becomes infinite
Also, as the amount of data reduction goes to zero,the required complexity goes to zero.
National Partnership for Advanced Computational Infrastructure
Bandwidth Optimization
Moving all of the data is faster, T(Super) < T(Archive)Sufficiently fast network
b > (r /O) (1 - s/S) / [1 - r/R - (o/O) (1 + r/R) (1 - s/S)]
Note the denominator changes sign when
O < o (1 + r/R) / [(1 - r/R) (1 - s/S)]
Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small.
National Partnership for Advanced Computational Infrastructure
Execution Rate Optimization
Moving all of the data is faster, T(Super) < T(Archive)Sufficiently fast supercomputer
R > r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)]
Note the denominator changes sign whenO < o (1 - s/S) [1 + r/(ob)]
Even with an infinitely fast supercomputer, it is better toprocess at the archive if the complexity is too small.
National Partnership for Advanced Computational Infrastructure
Data Reduction Optimization
Moving all of the data is faster, T(Super) < T(Archive)Data reduction is small enough
s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]}
Note criteria changes sign whenO > o [1 + r/R + r/(ob)] / (1 - r/R)
When the complexity is sufficiently large, it is faster toprocess on the supercomputer even when data can be reduced to one bit.
National Partnership for Advanced Computational Infrastructure
Is the Future Environment a Metacomputer or a Digital Library?
• Sufficiently high complexity• Move data to processing engine
• Digital Library execution of remote services• Traditional supercomputer processing of applications
• Sufficiently low complexity• Move process to the data source
• Metacomputing execution of remote applications• Traditional digital library service
National Partnership for Advanced Computational Infrastructure
The IBM Digital Library Architecture Application(DL client)
Metadata inDB2 or Oracle
Videocharger DB2 ADSM Oracle
Library Server
Text and Image indices
“Federated” search
Object Server
Distributed storage resources
(SRB)(MCAT)
National Partnership for Advanced Computational Infrastructure
Generalization of Digital Library• Scaling transparency
• Support for arbitrary size data sets• Support for arbitrary data type
• Location transparency• Access to remote data• Access to heterogeneous (non-uniform) storage systems• Remove restriction of local disk space size
• Name service transparency• Support for multiple views (naming conventions) for data
• Presentation transparency• Support for alternate representations of data
National Partnership for Advanced Computational Infrastructure
Describing Information Content
Information Level Infrastructure -Scientific Data
Infrastructure - Text
Federation Ontology Digital Library
Data Collection Schema Dublin Core
Data Set Metadata Provenance
Features XML XML
Logical type Vector bundle Mime Type
Structure MPI Datatype DTD
File Format HDF v5 Electronic record
National Partnership for Advanced Computational Infrastructure
State-of-the-art Information Management: Digital Library
Infrastructure Levels Language
Data Flow Systems Data Control
Format Presentation
OntologiesSchema Definition
Schema Manipulation
Access Discovery
MetadataMetadata Definition
Metadata Manipulation
Database Handling
ArchiveCollection Layout
Storage Management
Media Storage
National Partnership for Advanced Computational Infrastructure
High Performance Storage
• Provide access to tertiary storage - scale size of repository• Disk caches• Tape robots• Manage migration of data between disk and tape
• High Performance Storage System - IBM• Provides service classes • Support for parallel I/O• Support for terabyte sized data sets• Provide recoverable name space
National Partnership for Advanced Computational Infrastructure
State-of-the-art Storage: HPSS
• Store Teraflops computer output• Growth - 200 TB data per year • Data access rate - 7 TB/day = 80 MB/sec• 2-week data cache - 10 TB• Scalable control platform
• 8-node SP (32 processors)
• Support digital libraries• Support for millions of data sets • Integration with database meta-data catalogs
National Partnership for Advanced Computational Infrastructure
HPSS Archival Storage System
108 GB
SSA RAID
High Performance Gateway Node
High Node Disk Mover HiPPI driver
Wide Node Disk Mover HiPPI driver
54 GB
SSA RAID
108 GB
SSA RAID
108 GB
SSA RAID
54 GB
SSA RAID
108 GB
SSA RAID
108 GB
SSA RAID
Silver NodeStorage / PurgeBitfile / Migration Nameservice/PVL Log Daemon
Silver NodeTape / disk mover DCE / FTP /HIS Log Client
160 GB
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HIS Log Client
830 GB
MaxStrat RAID
9490 RobotFourDrives
3490 Tape
RS6000Tape MoverPVR (9490)
HiPPISwitch
Trail-Blazer3Switch
Silver Node Tape / disk mover DCE / FTP /HIS Log Client
Silver Node Tape / disk mover DCE / FTP /HIS Log ClientSilver Node Tape / disk mover DCE / FTP /HIS Log ClientSilver Node Tape / disk mover DCE / FTP /HIS Log ClientSilver Node Tape / disk mover DCE / FTP /HIS Log Client
Magstar3590 Tape
9490 RobotEight Tape
Drives
Magstar3590 Tape
9490 RobotSeven Tape
Drives
National Partnership for Advanced Computational Infrastructure
• SDSC has achieved:
• Striping required to achieve desired I/O rates
HPSS Bandwidths
Node-HPGN 90 MB/sTexas Memory Box 80 MB/sMax Strat disk 60 MB/sSSA Raid 20-30 MB/s
National Partnership for Advanced Computational Infrastructure
Turning Archives into Digital Libraries
• Meta-data based access to data sets• Support for application of methods (procedures) to data
sets• Support for information discovery• Support for publication of data sets
• Research issue - optimization of data distribution between database and archive
National Partnership for Advanced Computational Infrastructure
Database TableC4 C5C1 C2 C3
DB2/HPSS Integration
DB2
HPSS
DB2Disk
buffer
HPSSDiskcache
• Collaboration with IBM TJ Watson Research Center• Ming-Ling Lo, Sriram
Padmanabhan, Vibby Gottemukkala
• Features:• Prototype, works with DB2 UDB
(Version 5) • DB2 is able to use a HPSS file as
a tablespace container• DB2 handles DCE authentication
to HPSS• Regular as well as long (LOB)
data can be stored in HPSS• Optional disk buffer between DB2
and HPSS
National Partnership for Advanced Computational Infrastructure
Generalizing Digital Libraries
• SRB - Location transparency• Access to heterogeneous systems• Access to remote systems
• MCAT - Name service transparency• Extensible Schema support
• MIX - Presentation transparency• Mediation of information with XML• Support for semi-structured data
• Access scaling• MPI-I/O access to data sets using parallel I/O
National Partnership for Advanced Computational Infrastructure
SRB
UniTree HPSS DB2 Illustra Unix
SRB Software Architecture
SRB APIs
User AuthenticationDataset LocationAccess ControlTypeReplicationLogging
MetadataCatalogMCAT
Application(SRB client)
National Partnership for Advanced Computational Infrastructure
14 Installed SRB Sites
U Michigan
U Maryland
Washington U
UTexasU Houston
UC DavisUC BerkeleyUC Santa Barbara
UCLAUCSD
Caltech
RutgersNCSA
Montana State University
Large Archives
National Partnership for Advanced Computational Infrastructure
SRB / MCAT Features• Support for Collection
hierarchy• allows grouping of hetero-
geneous data sets into a single logical collection
• hierarchical access control, with ticket mechanism
• Replication• optional replication at the time of
creation• can choose replica on read
• Proxy operations• supports proxy (remote) move
and copy operations
• Monitoring capability
• Supports storing/querying of system- and user-defined “metadata” for data sets and resources
• API for ad hoc querying of metadata
• Ability to extend schemas and define new schemas
• Ability to associate data sets with multiple metadata schemas
• Ability to relate attributes across schemas
• Implemented in Oracle and DB2
National Partnership for Advanced Computational Infrastructure
MCAT Schema Integration
• Publish schema for each collection• Clusters of attributes form a table• Tables implement the schema
• Use Tokens to define semantic meaning• Associate Token with each attribute
• Use DAG to automate queries• Specify directed linkage between clusters of attributes• Tokens - Clusters - Attributes
National Partnership for Advanced Computational Infrastructure
PublishingA NewSchema
National Partnership for Advanced Computational Infrastructure
AddingAttributes
to theNew
Schema
National Partnership for Advanced Computational Infrastructure
Displaying Attributes
From SelectedSchemas
National Partnership for Advanced Computational Infrastructure
Security
• Integration of SDSC Encryption Authentication system (SEA) with Globus GSI• Kerberos within security domain• Globus for inter-realm authentication
• Access control lists per data set• Audit trails of usage
• Need support for third-party authentication• User A accesses data under the control of digital library B
when the data is stored at site C
National Partnership for Advanced Computational Infrastructure
XMAS query
XMAS query “fragment”
MIX: Mediation of Information using XML
Mediator
Wrapper
ActiveView 1
Convert XMAS query to local query language,and data in native format to XML
SQL Database
Wrapper Wrapper
Spreadsheet HTML files
XML data
XML data
Support for “active” views
ActiveView 2
BBQ Interface BBQ Interface
Local Data Repository
National Partnership for Advanced Computational Infrastructure
Integration of Digital Librarywith Metacomputing Systems
• NTON OC-192 network (LLNL - Caltech - SDSC)• HPSS archive• Globus metacomputing system• SRB data handling system• MCAT extensible metadata• MIX semi-structured data mediation using XML• ICE collaboration environment• Feature extraction
National Partnership for Advanced Computational Infrastructure
INFO
RM
ATIO
N S
ER
VIC
ES
Data Intensive and High-Performance Distributed Computing
Local Resource Management
Data Repositories
Resources Layer
Fault Detection
Resource Management
Generic Services Layer
Domain Specific Services Layer
Application Toolkits
Network Caching
Metadata
Communication Libs. Grid-enabled Libs Visualization
Resource Discovery Resource Brokering
End-to-End QoS
Remote Data Access
Interdomain Security
Scheduling
National Partnership for Advanced Computational Infrastructure
Research Activities
• Support for remote execution of data manipulation procedures• Globus - SRB integration
• Automated feature extraction• XML based tagging of features• XML query language for storing attributes into the
Intelligent Archive
• Integration with RIO - parallel I/O transport
National Partnership for Advanced Computational Infrastructure
Views of Software Infrastructure
• Software infrastructure supports user applications
• Reason for existence of software is to provide explicit capabilities required by applications
• What is the user perspective for building new software systems?
• Is the integration of digital library and metacomputing systems the final version?
National Partnership for Advanced Computational Infrastructure
Software Integration Projects• NSF
• Computational Grid - Middleware using distributed state information to support metacomputing services
• DOE• Data Visualization Corridor - collaboratively visualize multi-
terabyte sized data sets
• NASA• Information Power Grid - integrate data repositories with
applications and visualization systems
• DARPA• Quorum - provide quality of service guarantees
National Partnership for Advanced Computational Infrastructure
User Requirements - Five Software Environments
• Code Development• Resources support
• Run-time• Parallel Tools and Libraries
• Distributed Run-Time • Metacomputing environment
• Interaction Environments• Collaboration, presentation
• Publication / Discovery / Retrieval• Data intensive computing environment
National Partnership for Advanced Computational Infrastructure
Metacomputing Environment Data Flow Perspective
Archival Storage System
Remote Data Manipulation
Data Handling System
Data Staging System
Data Caching System
Distributed Execution Environment
Object Oriented Interface
Application
National Partnership for Advanced Computational Infrastructure
Publication Environment Data Flow Perspective
Archival Storage System
Remote Data Manipulation
Data Handling System
Collection Management Software
Digital Library Services
Data Set Constructor
Run-time Access
Application
National Partnership for Advanced Computational Infrastructure
Run-time Environment Data Flow Perspective
Archival Storage System
Data Handling System
Data Caching System
Library Interoperation
Data Structures Library
Memory Tiling
Parallel I/O Library
Application
National Partnership for Advanced Computational Infrastructure
Interaction Environment Data Flow Perspective
Archival Storage System
Data Manipulation System
Data Caching System
Data Formatting System
Rendering System
Visualization Environment
Collaboration Environment
Application
National Partnership for Advanced Computational Infrastructure
Taxonomy of User Requirements
Environment Capabilities
Code Development Run-time
Distributed Run-Time / Metacomputing
Collaboratories / Interaction / Presentation
Publication / Discovery / Retrieval
Data manipulation Data caching Data subsetting Data analysisData discovery API
Common directory
Information discovery
Data naming / aggregation
location transparency
File system federation
collection federation
Data accessSmall file manipulation Parallel I/O Remote I/O Remote data access
distributed data access
Data organization Data structures Data format schemas
ArchivesVersion management
High-performance archive Large data storage
Persistent archive
National Partnership for Advanced Computational Infrastructure
Comparison of Environments
Environment Capabilities
Code Development Run-time
Distributed Run-Time / Metacomputing
Collaboratories / Interaction / Presentation
Publication / Discovery / Retrieval
Product sharingApplication publication
persistent objects
Visualization modules
Data collection building
Reuseable software
Math / Thread libraries
Parallel thread libraries
Distributed thread libraries
Application building
Debuggers, compilers Task graph Data flow systems
Interoperability shared dataLanguage interoperation view control
schema interoperability
Performance PerformanceResource utilization
Useability GUI
Look and feelDesktop environment
Distributed desktop
presentation architecture
digital library workspace
Reservation resource reservation
instrument reservation
disk space reservation
Queuing local queuing global queuing
Schedulingjob mix scheduling
Distributed scheduling
National Partnership for Advanced Computational Infrastructure
Comparison of Environments
Environment Capabilities
Code Development Run-time
Distributed Run-Time / Metacomputing
Collaboratories / Interaction / Presentation
Publication / Discovery / Retrieval
Communication software
Heterogeneous network
Dynamic controlreal-time steering teleinstrumentation
Execution job execution Load balancingDistributed execution
collaboration service remote service
Operating System Clusters
Distributed clusters
Authorization access control global access access control
Authenticationauthentication for CPU Single sign-on
authentication for data
National Partnership for Advanced Computational Infrastructure
PACI Environments
Environment Capabilities
Code Development Run-time
Distributed Run-Time / Metacomputing
Collaboratories / Interaction / Presentation
Publication / Discovery / Retrieval
Data manipulation
Data caching-ADR
Data subsetting-SRB, Vis5D
Data analysis tools-Rocke
Data discovery API
Common directory structure
Objec ID-Legion, Pathname-Globus
Information discovery-MCAT, Infobus
Data naming / aggregation
location transparency-DFS
File system federation- Legion
collection federation-MCAT
Data access
Small file manipulation-Unix
Parallel I/O-MPI, PANDA Remote I/O
Remote data access-Corba/SRB
distributed data access-SRB, Infobus
Data organization
Data structures-KeLP,SDDA,CARTE Data format-HPFv5 schemas-MCAT
Archives
Version management-CVS,RCS GASS
Large data storage-UDB
Persistent archive, HPSS
National Partnership for Advanced Computational Infrastructure
PACI EnvironmentsEnvironment Capabilities
Code Development Run-time
Distributed Run-Time / Metacomputing
Collaboratories / Interaction / Presentation
Publication / Discovery / Retrieval
Product sharing
Application publication-LDAP
persistent objects-Legion
Visualization modules-AVS
Data collection building-MCAT
Reuseable software
Netsolve, Symera DCOM support
Application building
Debuggers, compilers-Titanium, P compiler
Task graph, AppLeS, Treadmarks-distributed shared memory
Data flow systems-AVS
Interoperabilityshared data-HPSS
Language interoperation-Metachaos view control-ICE
schema interoperability-MCAT
PerformancePerformance-Pablo, Paradyne
Resource utilization-NWS
Useability Documentation GUI-Pancake
Look and feel
Desktop environment-Unix
Distributed desktop-?
presentation architecture-ICE, CORBA
digital library workspace-ADL,ELIB, UMDL, MSU, MSD, ESA
National Partnership for Advanced Computational Infrastructure
PACI EnvironmentsEnvironment Capabilities
Code Development Run-time
Distributed Run-Time / Metacomputing
Collaboratories / Interaction / Presentation
Publication / Discovery / Retrieval
Reservation
Job performance monitor
resource reservation-Maui scheduler
instrument reservation-?
disk space reservation-?
Queuinglocal queuing-LSF,Loadleveler
Generic batch interface - Globus
Scheduling
job mix scheduling-MAUI
Distributed scheduling-Vernon
Communication software
Heterogeneous network-Nexus
Dynamic controlreal-time steering-?
teleinstrumentation-ICE
Executionjob execution-Unix
Load balancing-KeLP
Distributed execution-Globus, Legion, Condor, HPVM, High-performance Java
collaboration service-ICE, Java, Habanero, Tango, Virtual Director
remote service-ELIB, ADL
Operating System Clusters-NOW
Distributed clusters-Millenium
Authorizationaccess control-Unix
global access-Globus/LDAP
access control-MCAT/SEA
National Partnership for Advanced Computational Infrastructure
Future Systems
• Automation of • Information discovery • Application execution• Publication of results
• Integration of• Code Development• Run-time support• Distributed computing• Collaborative analysis• Information publication
National Partnership for Advanced Computational Infrastructure
Further Information
http://www.npaci.edu/DICE