Red Hat Enterprise MRG Messaging, Realtime,...

Red Hat Enterprise MRGMessaging, Realtime, Grid

'merge'http://www.redhat.com/mrg

Presentation version 7, May 2008

2

MRG = Messaging, Realtime, Grid

Integrated platform for high performance distributed computing

Messaging● High speed, interoperable, open standard messaging middleware● Fast, reliable, built on AMQP

Realtime● Predictable, low-latency, quality of service Linux kernel

Grid● Scheduler – high performance computing (HPC) for distributed

workloads, cycle stealing● Managing large pools of computers & tasks

The 3 components are complimentary, but can be used independently

3

4

MRG Messaging – ob jectives

Interoperability = AMQP 0-10, clients in Pyhton, C++, Java, JMS, Ruby, .NET

Platforms = RHEL differentiated, Non RHEL inter-op / support Performance = Scale up and out for transient and durable messaging Quality Of Service (QoS) = Reliable, Transactional (5 modes) through

2PC(acid), Manageable, Clustering, HA, Federation, Infiniband Application & Infrastructure = Used at both application and infrastructure

level Ecosystem

● AMQP Working Group● Community involvement

Features: http://www.redhat.com/mrg/messaging/features/

http://www.redhat.com/mrg/messaging/features/

5

MRG Realtime – p roduct objectives Determinism = ability to schedule high priority tasks predictably and

consistently Non invasive = no application changes/ recompile required Priority = ensure that highest priority applications are not blocked by low

priority Quality Of Service (QoS) = trustworthy, consistent response times Proven results

● Context switch latency under 25 μ s. 99.9999% under 20 μ s (from interrupt to commencing running new process)

● Average of 38% improvement over stock RHEL5● Timer event precision enhanced to μ s level, rather than ms

Features: http://www.redhat.com/mrg/realtime/features/

http://www.redhat.com/mrg/realtime/features/

6

MRG Grid – product objectives

Scale up/out = commodity/SMB hardware in dedicated or part use farms.

Resource management = better asset utilization, and work dynamic re-prioritization

Managable = single job interface for RT jobs, batch, virt, cycle or bare-metal execution

Flexibility = seamless flexible High Throughput Computing (HTC) and High Performance Computing (HPC) across● Local / Remote grids● Remote clouds (Amazon EC2)● Cycle-stealing from desktop PCs

Features: http://www.redhat.com/mrg/grid/features/

http://www.redhat.com/mrg/grid/features/

7

Putting it all together

Messaging and Realtime go hand in hand● (determinism)

Messaging and Grid go hand in hand ● (scale)

Compound use cases / and integrated platform saves bespoke integration ● (TCO)

One stop management of the MRG platform ● (operational flexibility)

Partners: http://www.redhat.com/mrg/partners/● AMD, Cisco, IBM, Intel, UW Madison● Realtime Java (IBM)

http://www.redhat.com/mrg/partners/

8

What we will manage

Manage 1 to N servers from a single interface Instrumentation data

● Current / historical / treads● OS / Messaging / Grid / RT overlayed

Configuration ● Messaging - logs, federation, clustering, HA, QoS, ...● RT – taskset, priority, tuning, ...● Grid – pools, scheduler policy, targets, profiles, ...

Sample Actions ● kill clients, purge queue, increase pool size, close sessions, ...

Messaging - What is all the fuss about?

Common complaints in deploying distributed /highly scaled systems

Cost most cases the deployment typology is dictated by licenses not architecture

Features most deployments have to create a set of services before they can start, these include routing, replay, IVQ, LVQ SoWQ etc..

Openness no way to create open market exchanges Performance many have created proprietary solutions to meet throughput

requirements Standards failures ws* soup Scope need to be able to use messaging from the OS up to the

application level Management need a better way to manage it...

AMQP born out of users frustration ...

John O'Hara from JPMC started working on defining AMQP out of frustration of seeing money wasted on patching around the core issues

In moving from messaging as a necessary evil, to messaging as an enabler for services liquidity required a standard to underpin major investment in long term projects

Messaging to simply solve 80% of enterprise use cases (current work) Pub/sub patterns Large message transfer (including file) Consumer patterns Eventing patterns Dynamic configuration . Wiring patterns Task queue patterns Fan out

High volume use cases (future work) Fan out using multicast High through put optimization

What is AMQP?

An Open Standard for Middleware:

Middleware: software that connects other software together. Middleware connects islands of automation, both within an enterprise and out to external systems.

Why it is different:

A straight-forward and complete solution for business messaging

Cost effective for pervasive deployment

Totally open (developed in partnerships)

Created by users and technologists(Messaging, OS, and Network) working together

Made to satisfy real needs (needs to also provide things like IVQ, LVQ, Replay, ...)

AMQP = practical standard for long term services architecture

The AMQP ModelThe AMQP Architecture specifies modular components and rules as the

building blocks

Exchanges

The “ Exchange” receiv es messages from publisher applications and routes these to queues, based on arbitrary criteria—typic ally topic & message headers

Queues

The “ Queue” s tores messages until they can be safely processed by a consumer application (or multiple applications)

Bindings

The “ Binding” defines the relationship between a queue and an exchange and provides the message routing criteria

The AMQP 0-10 Architecture/StackThe AMQP is built in layers

execution

stateful interceptors frameset sequencing frameset (dis)assembly "subchannel" demultiplex frame flow control replication & recovery stateless interceptors

(de)multiplexing by channelstateless interceptorsframing, heartbeating, integrity check

TCP

Model (queues, exchanges, messages, transactions, ...)L4

L3

(de)multiplexing by channelstateless interceptorsframing

SCTP

...

...

...

other

L2.4L2.3L2.2L2.1cL2.1bL2.1aL2.0

L1.3L1.2L1.1

L0

Make is go fast...

Issues, trade-off's, asynchronous patterns, using the OS

Bubbles in IO pipelines

Decoupling the ack (sync versus/async)

Dealing with correlation

Context switches (latency versus throughput for multi-cores)

For fast middleware there are quite a few issues to be overcome, some can be dealt with better when done in conjunction with the OS.

16 32 48 64 80 96 112 128250

260

270280

290300

310

320330

340

350360

370380

390

400

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

2.50

2.75

3.00

3.25

3.50

Message Throughput (single pipeline)(Msg size = 100 bytes)

Write

ReadPage Latency

Page Size (KiB)

Mes

sage

Thr

ough

put (

k m

sgs/

sec)

Page Latency (us) – one way

RHM journal test bed data

So what does the data tell us for durable messages

Using async IO, full DMA and pipelined IO in the Red Hat Messaging journal

We can trade CPU for latency

We hit the wall in the trade-off in a single core at ~50 byte messages

At about ~15k msg sizes CPU becomes negligible and the fibre-channel becomes limiting factor

Larger messages are easier to sustain the IO rate with.

Best to not have to page any given queue (this will happen if you have a slow consumer and the queue depth backs up), else rate will be <= 40% non-paged rate

Schemes for dealing with no paging include— Terminate slow consumers— Throttle publisher— Most likely need to design for 30% - 35% of max write rate for target message size

on given hardware if slow consumers will not be terminated.

Ignore specific data rates, as rate will be entirely determined by the submission and consumption pattern of your application

Best rates case be achieved with well pipelined data flow, and async ack on both publisher and consumer.

17

Clustering

When deploying a cluster in a messaging system there, any configuration will trade off throughput, latency, and connection management.● Using RHEL 5 technology● Handle commodity and central storage scenarios

Throughput

ConnectionsLatency

q1q3fa-2

q2q1-1

q3faq1-2

q3fbq3fa-1

q3fb-1

q3fb-2

q2-1

q2-2

Deployments trade offs Cluster config examples

Example – Python (Publish / Get)

client = connect()chan = client.channel(0)chan.channel_open()

chan.exchange_declare(0, "test", "direct")chan.queue_declare(queue="test-queue")chan.queue_bind(queue="test-queue", exchange="test", routing_key="key")

reply = chan.basic_consume(queue="test-queue")print "consumer: %s" % reply.consumer_tagqueue = client.queue(reply.consumer_tag)

BODY = "Hello World!"chan.basic_publish(exchange="test", routing_key="key", content=Content(BODY))msg = queue.get()print "received: %s" % msg.content.bodyassert msg.content.body == BODY, "bad message body: %s" % msg.content.bodychan.basic_ack(msg.delivery_tag, True)

Red Hat Confidential

Realtime - Illustrating determinism

20

What do you tune...? Turn stuff off

● Cpuspeed, desktop, Sendmail, Rpc stuff, nfs, etc, Console mouse, (gpm), Anacron jobs

What is your precision need – na noseconds or milliseconds? (non RT)● Pmtimer: Precise, but slow and not scalable● TSC: Fast but Pstate drift issues on some harware

Process Affinity / IRQ balance (tuna)● Isolate IRQ and Processes and OS on Multicore boxes

Interrupt Affinity (tuna)● They also use CPU's

Configure the Network stack, Numa etc. Write better code

● malloc()/free() over and over and over● Lock contention● Don't call gettimeofday() more than needed● rereading and reparsing config files,rebuilding caches

21

Realtime How? - Community Project Upstream -rt developer participants and approximate contribution rate

● 45% - Ingo Molnar – lead developer and overall upstream leader. Focus on scheduler, locking, interrupts. Red Hat fulltime employee

● 35% - Thomas Gleixner – developer, primarily concentrating on timers. Contractor to Red Hat

● 10% - Steven Rostedt – de veloper. Red Hat fulltime employee● 10% - all other participants (IBM, Monta Vista, Timesys, Novell, etc)

All efforts are ultimately shaped towards long-term mainstream inclusion Substantial additional Red Hat internal staffing for productization

● Testing & test development● Tool development – sy stem mgt & performance monitoring

● RHEL5 based tools remain relevant● Gdb, OProfile Frysk, SystemTap, kprobe ,kexec/kdump

● Latency Tracer – new RHEL-RT feature● Runtime trace capture of longest latency codepaths● Selectable triggers for threshold tracing● Detailed kernel profiles based on latency triggers

22

Real-time kernel work upstream

Items 2.6.18 and prior are in RHEL5

Work over the last 2+ years. 90% Red Hat, 120,000 lines

Mostly maintained in -rt tree

Over the last year, many patches have moved from -rt to mainline kernel:

BKL preemptable (2.6.8) Mutex patch (2.6.16) Semaphore-to-Mutex conversion

(ongoing ~85% done) Hrtimers subsystem (2.6.16) Robust futexes (2.6.17) Lock validator (2.6.18)

Priority inheritance futexes (PI-futex) (2.6.18)

Generic IRQ layer (2.6.18) Core time re-write (2.6.18) Sleepable RCU (2.6.19) Latency Tracer (circa 2.6.18) High-res+dynticks (2.6.21) CFS – comp letely fair scheduler

(2.6.23) Conversion of spin-locks to mutex

(2.6.23+) All Interrupt handling in threads

(~2.6.23+) Full rt-preempt (~2.6.24+)

23

Improve kernel lock synchronization Improve granularity – id entify and correct contention points Mutex rather than semaphores

● Mutexes are lighter weight Lock validator

● Efficient runtime confirmation of lock ordering● Can detect race conditions without actually hitting them

Priority Inheritance (PI)● Prevents low priority processes blocking higher priority. Problem scenario:

● Low priority process takes lock● High priority process needs lock, but must wait● Long running medium priority process preempts low priority process

● Solution: temporarily boost low pri process to allow completion● Required for realtime java – 1000's of threads

24

Timer precision & interrupt handling Timer enhancements

● Infrastructure cleanup – factor common code, increase fields to represent nanosecond precision

● Timer precision – utilize high resolution hardware timers at microsecond precision rather than approximate periodic time interrupt millisecond precision

● Generic timeof day – cle anly accommodate diverse clock sources● VDSO gettimeofday() - performance enhancement for millisecond accuracy● Dynamic ticks – power savings – no need to to timer interrupt 1000 times

per second on idle system – trans ition to low power state (great for OLPC) Interrupt handling

● Generic IRQ mechanism – infrastructure cleanup – factor common code● More fine-grained hardware interrupt control

CFS – Compl etely fair scheduler● Provides fair interactive response times in almost all situations● Includes modular scheduler framework – rea ltime task scheduler first

25

No application changes required

All of the realtime enhancements are in the kernel – under the hood from an application perspective.

No application changes are required to benefit from realtime enhancements. ● Applications which are latency bottlenecked due to kernel scheduling and

interrupt handling will see benefit.● Latencies introduced entirely in userspace (suboptimal application code,

unbounded java garbage collection, etc) are not eliminated. Recompilation is not required (same gcc/glibc as standard RHEL5)

● Applications recompiled on RHEL5 benefit from pi mutex glibc implementation enhancements to avoid syscall overhead on uncontested locks.

26

How? - Realtime Java (RTSJ)

Versions of Java which are more deterministic – primarily by removing garbage collection unpredictability and inter-JVM communication

RHEL-RT is the only Linux kernel having the prerequisites (ie, Priority Inheritance, preemption)

Working closely w/ IBM● IBM WebSphere Real Time● Realtime spec conformant – 200,000 rt thread capable● Exclusive realtime garbage collector● 1ms max GC pause time● Uses at most 30% cpu in any 10ms

window Deployed by US Navy

● DDG Destroyer program

MRG Grid Brings advantages of scale-out and flexible deployment to

any application

Delivers better asset utilization, allowing applications to to take advantage of all available computing resources

Dynamically provisions additional peak capacity for “Christmas Rush”-like situations

Executes across multiple platforms and in virtual machines

Provides seamless and flexible High Throughput Computing (HTC) and High Performance Computing (HPC) across

● Local grids

● Remote grids

● Remote clouds (Amazon EC2)

● Cycle-stealing from desktop PCs

MRG Grid is Based on Condor MRG Grid is based on the Condor Project created and hosted

by the University of Wisconsin, Madison

Red Hat and the University of Wisconsin have signed a strategic partnership around Condor:

● University of Wisconsin makes Condor source code available under OSI-approved open source license

● Red Hat & University of Wisconsin jointly fund and staff Condor development on-campus at the University of Wisconsin

Red Hat and the University of Wisconsin's partnership will:

● Add enhanced enterprise features, management, and supportability to Condor and MRG Grid

● Add High Throughput Computing capabilities to Linux

Red Hat is Initially Adding To Condor: Enterprise Supportability

● Break out Condor from statically-linked blob to multiple well-maintained and individually patchable rpm's

Web-Based Management Console

● Unified management across all of MRG for job, system, and workload management/monitoring

Virtualization Support via libvirt Integration

● Support scheduling of virtual machines on Linux using libvirt API's

AMQP Messaging Integration

● Enable job submission to Condor via AMQP Messaging clients

● Enable sub-second, low-latency scheduling for sub-second jobs

Amazon EC2 Integration

● Enable automatic Ec2 provisioning, job submission, results storage, teardown via Condor scheduler

● Runs as a job, so it can be a dependency for other jobs or executed based on rules (e.g. add capacity at EC2 if local grid out of capacity)

MRG Grid Features Management Tools

Desktop Cycle-Stealing

Cloud scheduling (Amazon EC2)

AMQP Messaging Integration – sub-second, messaging API for job submission

Virtualization – submit a (VM) as a user job; supports migration of the VM

Policies

Federated Grids/Clusters

Multiple Standards-Based APIs

Workflow Management

High Availability

Disk Space Management

Database Support

Compute On-Demand

Dynamic Pool Creation

Priority Based Scheduling

Accounting

Security

Parallel Universe - extensible framework for running parallel (including MPI) jobs

Java Universe

Time Scheduling for Job Execution

Backfill

File Staging

Dedicated and Undedicated Node Management

Master-Worker (MW) - a single master process can allocate and manage multiple worker processes

Condor-C – move jobs across queues

31

Red Hat Enterprise MRG Availability

MRG Announcement & Beta Launch: December 2007● Public & Interactive beta program

MRG v1.0: First half 2008● Messaging, Realtime, Management console● MRG Grid Technology Preview● Support: Hub from North America / UK

MRG v1.1: Late 2008● Spec level Security, Expanded mgmt console, Grid● Support: World Wide

Supported Platform matrix: http://www.redhat.com/mrg/hardware/

http://www.redhat.com/mrg/hardware/

32

Grid

Management

Date post:	09-Mar-2018
Category:	Documents
Upload:	ngokhue
View:	224 times
Download:	2 times

Red Hat Enterprise MRG Messaging, Realtime,...

Documents