Date post: | 18-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
Slide 1
ISTORE Software Runtime Architecture
Aaron Brown, David Oppenheimer, Kimberly Keeton, Randi Thomas, Jim Beck, John
Kubiatowicz, and David Patterson
http://iram.cs.berkeley.edu/istore
1999 Winter IRAM Retreat
Slide 2
ISTORE RuntimeSoftware Architecture
• Runtime system goals for the ISTORE meta-appliance(1) Provide mechanisms that allow network service
applications to exploit introspection (monitor + adapt)
(2) Allow appliance designer to tailor runtime system policies and interfaces
• How the goals are achieved (1) Introspection: layered local and global runtime system
libraries that manipulate and react to monitoring data
(2) Specialization: runtime system is extensible using domain- specific languages (DSLs)
Slide 3
Roadmap• Layered software structure• Example of introspection• Runtime system extensibility using DSLs• Conclusion
Slide 4
Device Interface& RTOS
Device Interface& RTOS
Layered software structure
Device nodes
Front-end node(s)
SwitchLAN/WANHW Device
HW Device (NIC)
Local Runtime
Local Runtime
DistributedGlobal RuntimeD
istrib
uted
Glo
bal R
untim
e DistributedGlobal Runtime
Parallel App.Worker Code Pa
ralle
l App
. Wor
ker Co
de Application Front-End Code
Slide 5
Device Interface& RTOS
Device Interface& RTOS
Device Interface Layer
Device nodes
Front-end node(s)
SwitchLAN/WANHW Device
HW Device (NIC)
Slide 6
Device interface layer• Microkernel OS modules• Traditional OS services
– Networking, mem management, process scheduling, threads, …
• Device-specific monitoring– Raw access patterns– Utilization statistics– Environmental parameters– Indications of impending failure
• Self-characterization of performance, functional capabilities
Slide 7
Device Interface& RTOS
Device Interface& RTOS
Local runtime layer
Device nodes
Front-end node(s)
SwitchLAN/WANHW Device
HW Device (NIC)
Local Runtime
Local Runtime
Slide 8
Local runtime layer• Non-distributed mechanisms needed by network
service applications• Feeds information to global layer or performs local
operations on behalf of global layer• Example mechanisms
– Application-specific filtering/aggregation of device monitoring data
» Example: OLTP server vs. DSS server
– Data layout and naming» Example: record-based interface for DB, file-based for web
server
– Device scheduling» Example: maximize TPS vs. maximize disk bandwidth utilization
– Caching» Coherence essential vs. coherence unnecessary» More efficient caching implementation possible in second case
Slide 9
Device Interface& RTOS
Device Interface& RTOS
Global runtime layer
Device nodes
Front-end node(s)
SwitchLAN/WANHW Device
HW Device (NIC)
Local Runtime
Local Runtime
DistributedGlobal RuntimeD
istrib
uted
Glo
bal R
untim
e DistributedGlobal Runtime
Slide 10
Global runtime layer• Aggregate, process, react to monitoring data• Relies on local per-device runtime mechanisms to
provide monitoring data, implement control actions• Provides application interface that hides
distributed implementation of runtime services• Example services
– High-level services» Load balancing: replicate and/or migrate heavily used data
objects when a disk becomes over-utilized» Availability: replicate data from failed or failing component to
restore required redundancy» Plug-and-play: integrate new devices into the system
– Low-level services used to implement high-level global services
» Distributed directory tracks data and metadata objects» Migration, replication, caching» Inter-brick communication» Distributed transactions
Slide 11
Device Interface& RTOS
Device Interface& RTOS
Distributed application worker code
Device nodes
Front-end node(s)
SwitchLAN/WANHW Device
HW Device (NIC)
Local Runtime
Local Runtime
DistributedGlobal RuntimeD
istrib
uted
Glo
bal R
untim
e DistributedGlobal Runtime
Parallel App.Worker Code Pa
ralle
l App
. Wor
ker Co
de
Slide 12
Distributed application worker code
• Runs on top of global runtime system• Written by appliance designer• Application-specific
– Database» scan, sort, join, aggregate, update record, delete
record, ...
– Transformational web proxy» fetch web page (from disk or remote site), apply
transformation filter, update user preferences database, ...
• System administration tools implemented at this level
– Customized runtime system defines administrative interface tailored to application
Slide 13
Device Interface& RTOS
Device Interface& RTOS
Application front-end code
Device nodes
Front-end node(s)
SwitchLAN/WANHW Device
HW Device (NIC)
Local Runtime
Local Runtime
DistributedGlobal RuntimeD
istrib
uted
Glo
bal R
untim
e DistributedGlobal Runtime
Parallel App.Worker Code Pa
ralle
l App
. Wor
ker Co
de Application Front-End Code
Slide 14
Application front-end code• Runs on front-end interface bricks • Accepts requests from LAN/WAN connection
– Incoming requests made using standard high-level protocols
» HTTP, NFS, SQL, ODBC, …
• Invokes and coordinates appropriate worker code components that execute on internal blocks
– Takes into account locality and load balancing– Database: front-end performs SQL query optimization,
invokes distributed relational operators on data storage devices
– Transformational proxy: front-end invokes distiller thread on appropriate device brick
» if data is cached, invoke on disk node » otherwise, fetch data from web and invoke on compute node
or disk node
Slide 15
Roadmap• Layered software structure• Example of introspection• Runtime system extensibility using DSLs• Conclusion
Slide 16
From introspection to adaptation
• Example: slowly-failing data disk in large DB system(1) Detect problem
(2) Repair problem while continuing to handle incoming requests
(3) Return to normal system operation
Intelligent HWcomponents
Continuous monitoring
Extensible, application-tailored runtime system
Adaptive, self-maintaining appliance+
Slide 17
Failing disk: detection• Microkernel monitoring module continuously
monitoring disk’s health detects exceptional condition, e.g.
– ECC failures– Media errors– Increased rates of ECC retries
• Notifies global fault handling mechanism
Slide 18
Failing disk: reaction• Global fault handling mechanism…
– Prevents system from sending more work to failed device
» Modifies global directory to remove entries corresponding to failed component’s data
– Application-specific response to impending failure» Transactional system: discard work currently in
progress on failing device, reissue to another data replica
» Non-transactional system w/o coherent replicas: checkpoint computation, restore on another data replica
» Transformational web proxy: do nothing
– Instruct disk runtime system to shut disk down» Disk device is considered failed
Slide 19
Failing disk: return to normal operation
• Global fault handling mechanism...– Rebuilds data redundancy
» By allocating space for a new replica on a functioning disk and copying data to it from existing replicas
» Using an application-specific data replication mechanism•Where to allocate new replicas, how to copy data,
how to lay out data for new replicas, how to update global directory
•Example in upcoming slide
• Life returns to normal– Degree of fault-tolerance has been restored
• Failed component can be replaced during regularly-scheduled maintenance
Slide 20
Roadmap• Layered software structure• Example of introspection• Runtime system extensibility using DSLs• Conclusion
Slide 21
Runtime system extensibility• Two ways of looking at system
– Partitioned on functional/mechanism boundaries» Collection of libraries: failure detection, transactions, ...» Mechanisms are isolated
application
libfail
librepl
libtrxn
libcache
OS
Slide 22
Runtime system extensibility• Two ways of looking at system
– Partitioned on functional/mechanism boundaries» Collection of libraries: failure detection, transactions, ...» Mechanisms are isolated
– Partitioned on global system properties» This is how the programmer thinks about the system (high-
level)» e.g. application-specific data availability policy
•Failure detection (which devices to monitor, …) •Replication (used to restore redundancy)•Transactions (how to restart work in progress)•Caching (how to handle dirty cached objects during
failure)
application
libfail
librepl
libtrxn
libcache
OS
Slide 23
Runtime system extensibility• Two ways of looking at system
– Partitioned on functional/mechanism boundaries» Collection of libraries: failure detection, transactions, ...» Mechanisms are isolated
– Partitioned on global system properties» This is how the programmer thinks about the system (high-
level)» e.g. application-specific data availability policy
•Failure detection (which devices to monitor, …) •Replication (used to restore redundancy)•Transactions (how to restart work in progress)•Caching (how to handle dirty cached objects during
failure)
application
libfail
librepl
libtrxn
libcache
OS
policy
compiler
Customized runtimesystem library
Slide 24
Extensibility using DSLs• DSLs are languages specialized for a particular task• Each ISTORE DSL
– Encapsulates high-level semantics of one system behavior– Allows declarative specification of
» Behavior of one aspect of the system (a “policy”)» Interfaces to coordinated mechanisms that implement the policy
– Is compiled into an implementation that might coordinate several local and/or global base runtime system mechanisms
» May be implemented as background and/or foreground tasks
• Analysis tools can potentially infer unspecified emergent system behaviors from the specifications
– e.g. what impact will a new redundancy policy have on transaction commit time
• Extensions compiled together with local and global base mechanisms form the distributed runtime system
Slide 25
Extensibility using DSLs: Example
Avail::FailureDetected(Device d) {Object o; ObjList objs;Transaction t; TxnList txns;Replica x, c, r;
Directory::MarkDeviceDisabled(d);Admin::AlertFailure(d);objs = Directory::GetObjects(d); objs stored on failed deviceforeach o (objs) {
x = Directory::GetReplica(o,d) find o’s replica on dDirectory::DeleteReplica(x); delete from global directorytxns = Txn::GetActiveTxns(x);foreach t (txns) {
Txn::AbortTxn(t); abort pending txns for o on d
}c = Directory::GetReplica(o); find still-accessible copyr = LoadBalancer::AllocateReplica(o); get space for new replLocalRuntime::CopyObject(c,c->device,r,r->device); copy itDirectory::AddReplica(r,r->device); update directoryforeach t (txns) {
Txn::IssueTxn(txn,r); reissue txns on new replica
}
}
}
Slide 26
Extensibility using DSLs (cont.)• Similar specification written for each
extension to base library• Other examples of extensible system
behaviors– Transaction response time requirements– Prioritizing operations based on type of data
processed– Resource allocation– Backup policy– Exported administrative interface
Slide 27
Why use DSLs?• Possible choices
– Each appliance designer writes runtime system from scratch
» Similar to exokernel operating systems
– All designers use single parameterized runtime system library
» Similar to tunable kernel parameters in modern OSs
– Designer writes high-level specification of system behavior
» DSL compiler automatically translates specification into runtime system extensions that coordinate base mechanisms
» Advantages include• Programmability• Performance• Reliability, verifiability, safety• Artificial diversity
Slide 28
DSL advantages (cont.)• Programmability
– High-level specification close to designer’s abstraction level
» Easier to write, reason about, maintain, modify runtime system code
» Simple enough to allow site-specific customization at installation time
• Performance– Aggressive DSL compiler can take advantage of high-
level semantics of specification language– Base library mechanisms can be highly optimized;
optimization complexity hidden from appliance designer
– Web example: infer that TCP checksums should be stored with web pages
Slide 29
DSL advantages (cont.)• Reliability
– Automatically generate code that’s easy to forget or get wrong
» Example: synchronization operations to serialize accesses to distributed data structure
• Verifiability– Of input code (DSL specification)
» More abstract form of semantic checking» e.g. DSL supports types natural to behavior being specified =>
type-checking verifies some semantic constraints• e.g. “ensure no unencrypted objects are written to disk”
– Of output code (coordinated use of base mechanisms)» DSL compiler writer satisfied DSL compiler is correct =>
appliance designer inherits verification effort
• Safety (prevent runtime errors)– Whole classes of general programming errors not possible
» DSLs hide details: runtime memory management, IPC, ...» Compiler automatically adds code: synchronization, ...
Slide 30
DSL advantages
(cont.)
• Artificial diversity– Potentially allow system to continue operation in face of
internal bugs or malicious attack» Multiple implementations of component run simultaneously on
different data replicas» Continuously check each other with respect to high-level
behavior» Non-traditional fault-tolerance, but related to process pairs
– Potentially usable to enhance performance» Select best-performing implementation(s) for future use;
periodically reevaluate choice
– Examples of possible implementation differences» Low-level: runtime memory layout, code ordering and layout» High-level: system resource usage (recompute vs. use stored
data, general space/time/bandwidth tradeoffs)
Specification
DSL compiler
Implementation 1 Implementation 2 Implementation 3 ...
Slide 31
ISTORE software summary• ISTORE software architecture provides an
extensible runtime environment for distributed network service application code
– Layered local and global mechanism libraries provide introspection and self-maintenance
– Mechanisms can be customized using DSL-based specifications of application policy
» DSL code coordinates base mechanisms to implement application semantics and interfaces
» DSL-based extension offers significant advantages in programmability, performance, reliability, safety, diversity
Slide 32
ISTORE summary• Network services are increasing in importance
– Self-maintaining scaleable storage appliances match the needs of these services
• ISTORE provides a flexible architecture for implementing storage-based network service apps
– Modular, intelligent, fault-tolerant hardware platform is easy to configure, scale, and administer
– Runtime system allows applications to leverage intelligent hardware, achieve introspection, and provide self-maintenance through
» Layered runtime software structure» DSL-based extensibility that allows easy application-
specific customization
Slide 33
Agenda• Overview of ISTORE: Motivation and
Architecture• Hardware Details and Prototype Plans • Software Architecture• Discussion and Feedback
Slide 34
Backup slides
Slide 35
What ISTORE is not• An extensible operating system
– Use commodity OS, only add hardware monitoring module» MM could just be a device driver => no need for microkernel
OS» ISTORE could be built on top of an extensible operating
system for even greater flexibility
• An attempt to make commodity OS’s extensible– Extensible runtime system allows designer to customize
higher-level operations than OS extensions do– Closest to an extensible distributed operating system built
on top of a commodity single-node operating system
• A multiple-protection-domain system– Assumes non-malicious programmer– If user-downloaded code permitted, sandbox must be
implemented as part of (trusted) application– DSLs specify resource allocation/scheduling policies,
appliance designer responsible for ensuring fairness
• A framework for building generic servers
Slide 36
ISTORE boot process(1) Initially, undifferentiated ISTORE system(2) On boot, each device block contacts system
boot server
(3) Device blocks download customized runtime system and application worker code
– Front-end blocks also download application front-end code
• Runtime system libraries structured as shared libraries => hot upgrade
Slide 37
Example Appliances• E-commerce• Web search engine• Transformational web/PDA proxy• Election server• Mail server• News server• NFS server• Database server: OLTP, DSS, mixed OLTP-DSS• Video server