Red Hat Enterprise MRGMessaging, Realtime, Grid
'merge'http://www.redhat.com/mrg
Presentation version 7, May 2008
2
MRG = Messaging, Realtime, Grid
Integrated platform for high performance distributed computing
Messaging● High speed, interoperable, open standard messaging middleware● Fast, reliable, built on AMQP
Realtime● Predictable, low-latency, quality of service Linux kernel
Grid● Scheduler – high performance computing (HPC) for distributed
workloads, cycle stealing● Managing large pools of computers & tasks
The 3 components are complimentary, but can be used independently
3
4
MRG Messaging – ob jectives
Interoperability = AMQP 0-10, clients in Pyhton, C++, Java, JMS, Ruby, .NET
Platforms = RHEL differentiated, Non RHEL inter-op / support Performance = Scale up and out for transient and durable messaging Quality Of Service (QoS) = Reliable, Transactional (5 modes) through
2PC(acid), Manageable, Clustering, HA, Federation, Infiniband Application & Infrastructure = Used at both application and infrastructure
level Ecosystem
● AMQP Working Group● Community involvement
Features: http://www.redhat.com/mrg/messaging/features/
5
MRG Realtime – p roduct objectives Determinism = ability to schedule high priority tasks predictably and
consistently Non invasive = no application changes/ recompile required Priority = ensure that highest priority applications are not blocked by low
priority Quality Of Service (QoS) = trustworthy, consistent response times Proven results
● Context switch latency under 25 μ s. 99.9999% under 20 μ s (from interrupt to commencing running new process)
● Average of 38% improvement over stock RHEL5● Timer event precision enhanced to μ s level, rather than ms
Features: http://www.redhat.com/mrg/realtime/features/
6
MRG Grid – product objectives
Scale up/out = commodity/SMB hardware in dedicated or part use farms.
Resource management = better asset utilization, and work dynamic re-prioritization
Managable = single job interface for RT jobs, batch, virt, cycle or bare-metal execution
Flexibility = seamless flexible High Throughput Computing (HTC) and High Performance Computing (HPC) across● Local / Remote grids● Remote clouds (Amazon EC2)● Cycle-stealing from desktop PCs
Features: http://www.redhat.com/mrg/grid/features/
7
Putting it all together
Messaging and Realtime go hand in hand● (determinism)
Messaging and Grid go hand in hand ● (scale)
Compound use cases / and integrated platform saves bespoke integration ● (TCO)
One stop management of the MRG platform ● (operational flexibility)
Partners: http://www.redhat.com/mrg/partners/● AMD, Cisco, IBM, Intel, UW Madison● Realtime Java (IBM)
8
What we will manage
Manage 1 to N servers from a single interface Instrumentation data
● Current / historical / treads● OS / Messaging / Grid / RT overlayed
Configuration ● Messaging - logs, federation, clustering, HA, QoS, ...● RT – taskset, priority, tuning, ...● Grid – pools, scheduler policy, targets, profiles, ...
Sample Actions ● kill clients, purge queue, increase pool size, close sessions, ...
Messaging - What is all the fuss about?
Common complaints in deploying distributed /highly scaled systems
Cost most cases the deployment typology is dictated by licenses not architecture
Features most deployments have to create a set of services before they can start, these include routing, replay, IVQ, LVQ SoWQ etc..
Openness no way to create open market exchanges Performance many have created proprietary solutions to meet throughput
requirements Standards failures ws* soup Scope need to be able to use messaging from the OS up to the
application level Management need a better way to manage it...
AMQP born out of users frustration ...
John O'Hara from JPMC started working on defining AMQP out of frustration of seeing money wasted on patching around the core issues
In moving from messaging as a necessary evil, to messaging as an enabler for services liquidity required a standard to underpin major investment in long term projects
Messaging to simply solve 80% of enterprise use cases (current work) Pub/sub patterns Large message transfer (including file) Consumer patterns Eventing patterns Dynamic configuration . Wiring patterns Task queue patterns Fan out
High volume use cases (future work) Fan out using multicast High through put optimization
What is AMQP?
An Open Standard for Middleware:
Middleware: software that connects other software together. Middleware connects islands of automation, both within an enterprise and out to external systems.
Why it is different:
A straight-forward and complete solution for business messaging
Cost effective for pervasive deployment
Totally open (developed in partnerships)
Created by users and technologists(Messaging, OS, and Network) working together
Made to satisfy real needs (needs to also provide things like IVQ, LVQ, Replay, ...)
AMQP = practical standard for long term services architecture
The AMQP ModelThe AMQP Architecture specifies modular components and rules as the
building blocks
Exchanges
The “ Exchange” receiv es messages from publisher applications and routes these to queues, based on arbitrary criteria—typic ally topic & message headers
Queues
The “ Queue” s tores messages until they can be safely processed by a consumer application (or multiple applications)
Bindings
The “ Binding” defines the relationship between a queue and an exchange and provides the message routing criteria
The AMQP 0-10 Architecture/StackThe AMQP is built in layers
execution
stateful interceptors frameset sequencing frameset (dis)assembly "subchannel" demultiplex frame flow control replication & recovery stateless interceptors
(de)multiplexing by channelstateless interceptorsframing, heartbeating, integrity check
TCP
Model (queues, exchanges, messages, transactions, ...)L4
L3
(de)multiplexing by channelstateless interceptorsframing
SCTP
...
...
...
other
L2.4L2.3L2.2L2.1cL2.1bL2.1aL2.0
L1.3L1.2L1.1
L0
Make is go fast...
Issues, trade-off's, asynchronous patterns, using the OS
Bubbles in IO pipelines
Decoupling the ack (sync versus/async)
Dealing with correlation
Context switches (latency versus throughput for multi-cores)
For fast middleware there are quite a few issues to be overcome, some can be dealt with better when done in conjunction with the OS.
16 32 48 64 80 96 112 128250
260
270280
290300
310
320330
340
350360
370380
390
400
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
3.00
3.25
3.50
Message Throughput (single pipeline)(Msg size = 100 bytes)
Write
ReadPage Latency
Page Size (KiB)
Mes
sage
Thr
ough
put (
k m
sgs/
sec)
Page Latency (us) – one way
RHM journal test bed data
So what does the data tell us for durable messages
Using async IO, full DMA and pipelined IO in the Red Hat Messaging journal
We can trade CPU for latency
We hit the wall in the trade-off in a single core at ~50 byte messages
At about ~15k msg sizes CPU becomes negligible and the fibre-channel becomes limiting factor
Larger messages are easier to sustain the IO rate with.
Best to not have to page any given queue (this will happen if you have a slow consumer and the queue depth backs up), else rate will be <= 40% non-paged rate
Schemes for dealing with no paging include— Terminate slow consumers— Throttle publisher— Most likely need to design for 30% - 35% of max write rate for target message size
on given hardware if slow consumers will not be terminated.
Ignore specific data rates, as rate will be entirely determined by the submission and consumption pattern of your application
Best rates case be achieved with well pipelined data flow, and async ack on both publisher and consumer.
17
Clustering
When deploying a cluster in a messaging system there, any configuration will trade off throughput, latency, and connection management.● Using RHEL 5 technology● Handle commodity and central storage scenarios
Throughput
ConnectionsLatency
q1q3fa-2
q2q1-1
q3faq1-2
q3fbq3fa-1
q3fb-1
q3fb-2
q2-1
q2-2
Deployments trade offs Cluster config examples
Example – Python (Publish / Get)
client = connect()chan = client.channel(0)chan.channel_open()
chan.exchange_declare(0, "test", "direct")chan.queue_declare(queue="test-queue")chan.queue_bind(queue="test-queue", exchange="test", routing_key="key")
reply = chan.basic_consume(queue="test-queue")print "consumer: %s" % reply.consumer_tagqueue = client.queue(reply.consumer_tag)
BODY = "Hello World!"chan.basic_publish(exchange="test", routing_key="key", content=Content(BODY))msg = queue.get()print "received: %s" % msg.content.bodyassert msg.content.body == BODY, "bad message body: %s" % msg.content.bodychan.basic_ack(msg.delivery_tag, True)
Red Hat Confidential
Realtime - Illustrating determinism
20
What do you tune...? Turn stuff off
● Cpuspeed, desktop, Sendmail, Rpc stuff, nfs, etc, Console mouse, (gpm), Anacron jobs
What is your precision need – na noseconds or milliseconds? (non RT)● Pmtimer: Precise, but slow and not scalable● TSC: Fast but Pstate drift issues on some harware
Process Affinity / IRQ balance (tuna)● Isolate IRQ and Processes and OS on Multicore boxes
Interrupt Affinity (tuna)● They also use CPU's
Configure the Network stack, Numa etc. Write better code
● malloc()/free() over and over and over● Lock contention● Don't call gettimeofday() more than needed● rereading and reparsing config files,rebuilding caches
21
Realtime How? - Community Project Upstream -rt developer participants and approximate contribution rate
● 45% - Ingo Molnar – lead developer and overall upstream leader. Focus on scheduler, locking, interrupts. Red Hat fulltime employee
● 35% - Thomas Gleixner – developer, primarily concentrating on timers. Contractor to Red Hat
● 10% - Steven Rostedt – de veloper. Red Hat fulltime employee● 10% - all other participants (IBM, Monta Vista, Timesys, Novell, etc)
All efforts are ultimately shaped towards long-term mainstream inclusion Substantial additional Red Hat internal staffing for productization
● Testing & test development● Tool development – sy stem mgt & performance monitoring
● RHEL5 based tools remain relevant● Gdb, OProfile Frysk, SystemTap, kprobe ,kexec/kdump
● Latency Tracer – new RHEL-RT feature● Runtime trace capture of longest latency codepaths● Selectable triggers for threshold tracing● Detailed kernel profiles based on latency triggers
22
Real-time kernel work upstream
Items 2.6.18 and prior are in RHEL5
Work over the last 2+ years. 90% Red Hat, 120,000 lines
Mostly maintained in -rt tree
Over the last year, many patches have moved from -rt to mainline kernel:
BKL preemptable (2.6.8) Mutex patch (2.6.16) Semaphore-to-Mutex conversion
(ongoing ~85% done) Hrtimers subsystem (2.6.16) Robust futexes (2.6.17) Lock validator (2.6.18)
Priority inheritance futexes (PI-futex) (2.6.18)
Generic IRQ layer (2.6.18) Core time re-write (2.6.18) Sleepable RCU (2.6.19) Latency Tracer (circa 2.6.18) High-res+dynticks (2.6.21) CFS – comp letely fair scheduler
(2.6.23) Conversion of spin-locks to mutex
(2.6.23+) All Interrupt handling in threads
(~2.6.23+) Full rt-preempt (~2.6.24+)
23
Improve kernel lock synchronization Improve granularity – id entify and correct contention points Mutex rather than semaphores
● Mutexes are lighter weight Lock validator
● Efficient runtime confirmation of lock ordering● Can detect race conditions without actually hitting them
Priority Inheritance (PI)● Prevents low priority processes blocking higher priority. Problem scenario:
● Low priority process takes lock● High priority process needs lock, but must wait● Long running medium priority process preempts low priority process
● Solution: temporarily boost low pri process to allow completion● Required for realtime java – 1000's of threads
24
Timer precision & interrupt handling Timer enhancements
● Infrastructure cleanup – factor common code, increase fields to represent nanosecond precision
● Timer precision – utilize high resolution hardware timers at microsecond precision rather than approximate periodic time interrupt millisecond precision
● Generic timeof day – cle anly accommodate diverse clock sources● VDSO gettimeofday() - performance enhancement for millisecond accuracy● Dynamic ticks – power savings – no need to to timer interrupt 1000 times
per second on idle system – trans ition to low power state (great for OLPC) Interrupt handling
● Generic IRQ mechanism – infrastructure cleanup – factor common code● More fine-grained hardware interrupt control
CFS – Compl etely fair scheduler● Provides fair interactive response times in almost all situations● Includes modular scheduler framework – rea ltime task scheduler first
25
No application changes required
All of the realtime enhancements are in the kernel – under the hood from an application perspective.
No application changes are required to benefit from realtime enhancements. ● Applications which are latency bottlenecked due to kernel scheduling and
interrupt handling will see benefit.● Latencies introduced entirely in userspace (suboptimal application code,
unbounded java garbage collection, etc) are not eliminated. Recompilation is not required (same gcc/glibc as standard RHEL5)
● Applications recompiled on RHEL5 benefit from pi mutex glibc implementation enhancements to avoid syscall overhead on uncontested locks.
26
How? - Realtime Java (RTSJ)
Versions of Java which are more deterministic – primarily by removing garbage collection unpredictability and inter-JVM communication
RHEL-RT is the only Linux kernel having the prerequisites (ie, Priority Inheritance, preemption)
Working closely w/ IBM● IBM WebSphere Real Time● Realtime spec conformant – 200,000 rt thread capable● Exclusive realtime garbage collector● 1ms max GC pause time● Uses at most 30% cpu in any 10ms
window Deployed by US Navy
● DDG Destroyer program
MRG Grid Brings advantages of scale-out and flexible deployment to
any application
Delivers better asset utilization, allowing applications to to take advantage of all available computing resources
Dynamically provisions additional peak capacity for “Christmas Rush”-like situations
Executes across multiple platforms and in virtual machines
Provides seamless and flexible High Throughput Computing (HTC) and High Performance Computing (HPC) across
● Local grids
● Remote grids
● Remote clouds (Amazon EC2)
● Cycle-stealing from desktop PCs
MRG Grid is Based on Condor MRG Grid is based on the Condor Project created and hosted
by the University of Wisconsin, Madison
Red Hat and the University of Wisconsin have signed a strategic partnership around Condor:
● University of Wisconsin makes Condor source code available under OSI-approved open source license
● Red Hat & University of Wisconsin jointly fund and staff Condor development on-campus at the University of Wisconsin
Red Hat and the University of Wisconsin's partnership will:
● Add enhanced enterprise features, management, and supportability to Condor and MRG Grid
● Add High Throughput Computing capabilities to Linux
Red Hat is Initially Adding To Condor: Enterprise Supportability
● Break out Condor from statically-linked blob to multiple well-maintained and individually patchable rpm's
Web-Based Management Console
● Unified management across all of MRG for job, system, and workload management/monitoring
Virtualization Support via libvirt Integration
● Support scheduling of virtual machines on Linux using libvirt API's
AMQP Messaging Integration
● Enable job submission to Condor via AMQP Messaging clients
● Enable sub-second, low-latency scheduling for sub-second jobs
Amazon EC2 Integration
● Enable automatic Ec2 provisioning, job submission, results storage, teardown via Condor scheduler
● Runs as a job, so it can be a dependency for other jobs or executed based on rules (e.g. add capacity at EC2 if local grid out of capacity)
MRG Grid Features Management Tools
Desktop Cycle-Stealing
Cloud scheduling (Amazon EC2)
AMQP Messaging Integration – sub-second, messaging API for job submission
Virtualization – submit a (VM) as a user job; supports migration of the VM
Policies
Federated Grids/Clusters
Multiple Standards-Based APIs
Workflow Management
High Availability
Disk Space Management
Database Support
Compute On-Demand
Dynamic Pool Creation
Priority Based Scheduling
Accounting
Security
Parallel Universe - extensible framework for running parallel (including MPI) jobs
Java Universe
Time Scheduling for Job Execution
Backfill
File Staging
Dedicated and Undedicated Node Management
Master-Worker (MW) - a single master process can allocate and manage multiple worker processes
Condor-C – move jobs across queues
31
Red Hat Enterprise MRG Availability
MRG Announcement & Beta Launch: December 2007● Public & Interactive beta program
MRG v1.0: First half 2008● Messaging, Realtime, Management console● MRG Grid Technology Preview● Support: Hub from North America / UK
MRG v1.1: Late 2008● Spec level Security, Expanded mgmt console, Grid● Support: World Wide
Supported Platform matrix: http://www.redhat.com/mrg/hardware/
32
Grid
Management