Download - MC91: High Availability for WebSphere MQ on UNIX platforms ... · MC91: High Availability for WebSphere MQ 1 Take Note! Before using this document, be sure to read the general information

MC91: High Availability for WebSphere MQ

MC91: High Availability for WebSphere MQ on UNIX platforms

Version 7

April 2008

WebSphere MQ Development

IBM Hursley

Property of IBM


1

Take Note!

Before using this document, be sure to read the general information under “Notices”.

Fourth Edition, April 2008

This edition applies to Version 7.0 of SupportPac MC91 and to all subsequent releases and modifications unless otherwise indicated in new editions.

© Copyright International Business Machines Corporation 2000, 2008. All rights reserved. Note to US Government Users—Documentation related to restricted rights—Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule contract with IBM Corp.


2

Table of Contents Take Note!...................................................................................................................................................... 1

Table of Contents ........................................................................................................................................... 2

Notices............................................................................................................................................................ 4

Summary of Changes ..................................................................................................................................... 5

Trademarks................................................................................................................................................. 5

IMPORTANT: VERSIONS AND MIGRATION.......................................................................................... 7

Introduction .................................................................................................................................................... 8

Concepts ..................................................................................................................................................... 8

Definition of the word “cluster” ................................................................................................................. 8

Functional Capabilities ............................................................................................................................... 9

Installation .....................................................................................................................................................11

Installing the SupportPac...........................................................................................................................11

Configuration.................................................................................................................................................13

Step 1. Configure the HA Cluster..............................................................................................................14

Step 2. Configure the shared disks ............................................................................................................15

Step 3. Create the Queue Manager ............................................................................................................17

Step 4. Configure the movable resources ..................................................................................................18

Step 5. Configure the Application Server or Agent...................................................................................20

Step 6. Configure an Application Monitor ................................................................................................22

Step 7. Removal of Queue Manager from Cluster.....................................................................................24

Step 8. Deletion of Queue Manager ..........................................................................................................25

Upgrading WMQ software in a cluster..........................................................................................................26

Applying maintenance...............................................................................................................................26

Commands.....................................................................................................................................................27

hacrtmqm...................................................................................................................................................27

halinkmqm.................................................................................................................................................28


3

hadltmqm command ..................................................................................................................................29

hamqm_start ..............................................................................................................................................30

hamqm_stop ..............................................................................................................................................31

/MQHA/bin/rc.local...................................................................................................................................32

Working with other HA products ..................................................................................................................33

Related products ........................................................................................................................................33

Suggested Test...............................................................................................................................................34

Appendix A. Sample Configuration Files for VCS .......................................................................................37

types.cf ......................................................................................................................................................37

main.cf.......................................................................................................................................................37

Appendix B. Messages produced by MQM agent for VCS...........................................................................40


4

Notices This report is intended to help the customer or IBM systems engineer configure WebSphere MQ (WMQ) for UNIX platforms in a highly available manner using various High Availability products.

References in this report to IBM products or programs do not imply that IBM intends to make these available in all countries in which IBM operates.

While the information may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere.

The data contained in this report was determined in a controlled environment, and therefore results obtained in other operating environments may vary significantly.

The following paragraph does not apply to any country where such provisions are inconsistent with local law:

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore this statement may not apply to you.

References in this publication to IBM products, programs, or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any of the intellectual property rights of IBM may be used instead of the IBM product, program, or service. The evaluation and verification of operation in conjunction with other products, except those expressly designated by IBM, are the responsibility of the user.

Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independent created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact Laboratory Counsel, Mail Point 151, IBM United Kingdom Laboratories, Hursley Park, Winchester, Hampshire SO21 2JN, England. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the IBM Director of Licensing, 500 Columbus Avenue, Thornwood, New York 10594, U.S.A.


5

Summary of Changes May 1997

• Version 1.0 Initial release of the HACMP SupportPac MC63

November 2000

• Version 2.0 Updated to reflect current versions of products

December 2005

• Version 6.0 Replacement of MC63, MC6A and MC6B with MC91 to combine HACMP, MC/ServiceGuard and Veritas Cluster Server (VCS) documentation and scripts. Updated to support WebSphere MQ V6. Version numbered to match current version of WMQ.

January 2007

• Version 6.0.1 Updated with various comments for clarity. Code changes for MC/ServiceGuard to properly export environment variables. Code changes for VCS monitor script where queue manager name includes a “.”. Added amqcrsta to the list of processes to kill.

April 2008

• Version 7.0 Updated for compatibility with WebSphere MQ V7. Removed migration script.

Trademarks

The following terms are trademarks of the IBM Corporation in the United States, or other countries, or both:

o IBM o MQ o MQSeries o AIX o HACMP o WebSphere MQ

The following are trademarks of Hewlett Packard in the United States, or other countries, or both:

o HP-UX o Multi-Computer ServiceGuard (MC/ServiceGuard)

The following terms are trademarks of Sun Microsystems in the United States, or other countries, or both:

o Solaris

The following terms are trademarks of Symantec Corporation in the United States, or other countries, or both:

o Veritas

UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Limited.


6

Other company, product, and service names, which may be denoted by a double asterisk (**), may be trademarks or service marks of others.


7

IMPORTANT: VERSIONS AND MIGRATION This SupportPac is designed for WebSphere MQ (WMQ) V7. In an attempt to simplify the scripts, which had accumulated a lot of baggage from many previous versions of WMQ, the directories handled by the halinkmqm and hadltmqm scripts have been reduced to only those required by WMQ V7.

Because there is little change between WMQ V6 and WMQ V7 in the relevant areas for HA, then the scripts can still be used with WMQ V6.

No script is provided in this package to automate migration from earlier versions of WMQ. Again this is because of the variety of older versions that might have been in use, and because WMQ V5.3 is in any case no longer a supported release. Providing meaningful error handling was considered too problematic while keeping the scripts simple enough to understand. The only step required for V6 to V7 migration is described here.

Ideally queue managers will be newly created or recreated for V7 using these scripts. SupportPac MS03 can be useful to rebuild definitions of objects from a queue manager. However, we recognise it is not an ideal world! If you have a queue manager created with the V6 HA scripts and you wish to continue to use them with WMQ V7 then there is just one small change in the directory layout that you need to implement manually. This change should be done BEFORE upgrading the WMQ product version (or at least before restarting a queue manager) as migration of queue manager data is done automatically during the first restart after code upgrades. For older versions of WMQ, the approach we recommend for HA is to recreate the queue manager.

To change an existing V6 HA queue manager into the V7 layout, you need to add the qmgrlocl subdirectory to the local IPC directory for the queue manager, and create a symlink from the queue manager’s data directory. The queue manager can be kept running through this modification, as it will not be using the new directory until the product code is updated and the queue manager restarted.

On all nodes of the cluster, assuming you are running as root or mqm mkdir /var/mqm/ipc/<QMGR>/qmgrlocl chown mqm:mqm /var/mqm/ipc/<QMGR>/qmgrlocl chmod 775 /var/mqm/ipc/<QMGR>/qmgrlocl On the node that currently owns the queue manager data directory ln –fs /var/mqm/ipc/<QMGR>/qmgrlocl /var/mqm/qmgrs/<QMGR/qmgrlocl


8

Introduction Concepts

This SupportPac provides notes and sample scripts to assist with the installation and configuration of WebSphere MQ (WMQ) V6 and V7 in High Availability (HA) environments. Three different platforms and environments are described here, but they share a common design and this design can also be extended for many other systems.

Specifically this SupportPac deals with the following HA products:

• HACMP (High Availability Cluster Multi Processing)

• Veritas Cluster Server (VCS)

• MC/ServiceGuard (MCSG)

The corresponding operating systems for which these HA products have been built are AIX, Solaris and HP-UX respectively. There is a separate SupportPac MC41 that provides similar function for WMQ on i5/OS, and support for Microsoft Cluster Services on Windows is built into the WMQ product. A section of this document describes how designs for other platform/HA product combinations can be implemented. In some cases, an HA product vendor might also include support for WMQ components, but we cannot comment on their suitability.

This document shows how to create and configure WMQ queue managers such that they are amenable to operation within an HA cluster. It also shows how to configure the HA product to take control of such queue managers. This SupportPac does not include details of how to configure redundant power supplies, redundant disk controllers, disk mirroring or multiple network or adapter configurations. The reader is referred to the HA product’s documentation for assistance with these topics.

WMQ includes many functions to assist with availability. However by using WMQ and HA products together, it is possible to further enhance the availability of WMQ queue managers. With a suitably configured HA cluster, it is possible for failures of power supplies, nodes, disks, disk controllers, networks, network adapters or queue manager processes to be detected and automatically trigger recovery procedures to bring an affected queue manager back online as quickly as possible. More information about WMQ’s availability features can be found at

ibm.com/developerworks/websphere/library/techarticles/0505_hiscock/0505_hiscock.html

It is assumed that the reader of this document has already decided to use an HA cluster – we will not go through the benefits of these systems again.

Definition of the word “cluster”

The word “cluster” has a number of different meanings within the computing industry. Throughout this document, unless explicitly noted otherwise, the word “cluster” is used to describe an HA cluster, which is a collection of nodes and resources (such as disks and networks) which cooperate to provide high availability of services running within the cluster. It is worth making a clear distinction between such an “HA cluster” and the use of the phrase “WMQ Cluster”, which refers to a collection of queue managers which can allow access to their queues by other queue managers in the cluster. The relationship between these two types of cluster is described in “Relationship to WMQ Clusters” later in this chapter.


9

Functional Capabilities

Cluster Configurations This SupportPac can be used to help set up either standby or takeover configurations, including mutual takeover where all cluster nodes are running WMQ workload. Throughout this document we try to use the word “node” to refer to the entity that is running an operating system and the HA software; “system” or “machine” or “partition” or “blade” might be considered synonyms in this usage.

A standby configuration is the most basic cluster configuration in which one node performs work whilst the other node acts only as standby. The standby node does not perform work and is referred to as idle; this configuration is sometimes called “cold standby”. Such a configuration requires a high degree of hardware redundancy. To economise on hardware, it is possible to extend this configuration to have multiple worker nodes with a single standby node, the idea being that the standby node can take over the work of any other worker node. This is still referred to as a standby configuration and sometimes as an “N+1” configuration.

A takeover configuration is a more advanced configuration in which all nodes perform some kind of work and critical work can be taken over in the event of a node failure. A “one sided takeover” configuration is one in which a standby node performs some additional, non critical and non movable work. This is rather like a standby configuration but with (non critical) work being performed by the standby node. A “mutual takeover” configuration is one in which all nodes are performing highly available (movable) work. This type of cluster configuration is also sometimes referred to as “Active/Active” to indicate that all nodes are actively processing critical workload.

With the extended standby configuration or either of the takeover configurations it is important to consider the peak load which may be placed on any node which can take over the work of other nodes. Such a node must possess sufficient capacity to maintain an acceptable level of performance.

Cluster Diagram

Node A Node B

internaldisks

internaldisks

shared disks

Cluster Diagram

WebSphere MQClients

Remote Queue

Managers

public network (e.g. ethernet)

private network (e.g. serial)

The cluster could also have additional nodes, public and private networks, network adapters, disks and disk controllers

resources managed by cluster

Highly AvailableQMgr2

can migrate to other node/s

can migrate to other node/s

QMgr1 Service Address

can migrate to other adapters or nodes

QMgr1 Data

can migrate to other adapters or nodes

QMgr2 Service Address

Highly AvailableQMgr1

QMgr2 Data


10

WMQ Monitoring This SupportPac includes a monitor for WMQ, which will allow the HA product to monitor the health of the queue manager and initiate recovery actions that you configure, including the ability to restart the queue manager locally or move it to an alternate system.

Relationship to WMQ Clusters WMQ Clusters reduce administration and provide load balancing of messages across instances of cluster queues. They also offer higher availability than a single queue manager, because following a failure of a queue manager, messaging applications can still access surviving instances of a cluster queue. However, WMQ Clusters alone do not provide automatic detection of queue manager failure and automatic triggering of queue manager restart or failover. HA clusters provide these features. The two types of cluster can be used together to good effect.

WMQ Clients WMQ Clients which are communicating with a queue manager that may be subject to a restart or takeover should be written to tolerate a broken connection and should repeatedly attempt to reconnect. WebSphere MQ Version 7 introduces features in the processing of the Client Channel Definition Table that assist with connection availability and workload balancing; however these are not directly relevant when working with a failover system.

The Extended Transactional Client, which allows a WMQ Client to participate in two-phase transactions must always connect to the same queue manager and cannot use techniques such as an IP load-balancer to select from a list of queue managers. When an HA product is used, a queue manager maintains its identity (name and address) whichever node it is running on so the ETC can be used with queue managers that are under HA control.


11

Installation The operating system, the HA product and WebSphere MQ should already be installed, using the normal procedures on all nodes in the cluster. You should install WMQ onto local disks (which might be on a SAN, but they appear to be local filesystems) on each of the nodes and not attempt to share a single installation on shared disks. It is important that under normal operating conditions you are running identical versions of software on all cluster nodes. The only exception to this is during a rolling upgrade.

When installing WMQ, ignore the advice in the WMQ documentation about creating separate /var/mqm and /var/mqm/log filesystems. This is not the preferred configuration in an HA environment. See under “Chapter 3. Configuration” for more details.

When installing WMQ in a cluster, it is essential that the “mqm” username and “mqm” groupname have been created and each have the same numeric value on all of the cluster nodes.

Installing the SupportPac

For HACMP and MC/ServiceGuard: For each node in the cluster, log on as mqm or root. Create the /MQHA/bin directory. This is the working directory assumed by the example scripts. Download the SupportPac onto each of the cluster nodes into a temporary directory, uncompress and untar it. Then copy the files from the mcsg or hacmp subdirectory to /MQHA/bin.

All of the scripts in the working directory need to have executable permission. The easiest way to do this is to change to the working directory and run

chmod 755 ha*

You could use a different location than the default working directory if you wanted to, but you would have to change the example scripts.

For VCS: For each node in the cluster, log on as root. Create the /opt/VRTSvcs/bin/MQM directory. This is the working directory assumed by VCS and the example scripts. Download the SupportPac onto each of the cluster nodes into a temporary directory, uncompress and untar it, and then copy the files from the vcs subdirectory to the /opt/VRTSvcs/bin/MQM directory.

Ensure that all the methods and utility scripts are executable, by issuing:

chmod +x online offline monitor clean ha* explain

The agent methods are written in perl. You need to copy or link the ScriptAgent binary (supplied as part of VCS) into the MQM agent directory, as follows:

cp /opt/VRTSvcs/bin/ScriptAgent /opt/VRTSvcs/bin/MQM/MQMAgent

The MQM resource type needs to be added to the cluster configuration file. This can be done using the VCS GUI or ha* commands while the cluster is running, or by editing the types.cf file with the cluster stopped. If you choose to do this by editing the types.cf file, stop the cluster and edit /etc/VRTSvcs/conf/config/types.cf file by appending the MQM type definition shown in Appendix A. For convenience, this definition can be copied directly from the types.MQM file. This sets the OnlineWaitLimit, OfflineTimeout and LogLevel attributes of the resource type to recommended values. See Appendix A for more details.


12

Configure and restart the cluster and check that the new resource type is recognized correctly by issuing the following command:

hatype -display MQM


13

Configuration All HA products have the concept of a unit of failover. This is a set of definitions that contains all the processes and resources needed to deliver a highly available service and ideally should contain only those processes and resources. This approach maximises the independence of each service, providing flexibility and minimising disruption during a failure or planned maintenance.

In HACMP, the unit of failover is called a resource group. On other HA products the name might be different, but the concept is the same. On VCS, it is known as a service group, and on MC/ServiceGuard it is a package.

The smallest unit of failover for WMQ is a queue manager, since you cannot move part of a queue manager without moving the whole thing. It follows that the optimal configuration is to place each queue manager in a separate resource group, with the resources upon which it depends. The resource group should therefore contain the shared disks used by a queue manager, which should be in a volume group or disk group reserved exclusively for the resource group, the IP address used to connect to the queue manager (the service address) and an object which represents the queue manager.

You can put multiple queue managers into the same resource group, but if you do so they all have to failover to another node together, even if the problem causing the failover is confined to one queue manager. This causes unnecessary disruption to applications using the other queue managers.

HACMP/ES users who wish to use application monitoring should also note the restriction that only one Application Server in a resource group can be monitored. If you were to place multiple queue managers into the same group and wanted to monitor all of them, you would need to write a monitor capable of monitoring multiple queue managers.

It is assumed that if mirroring or RAID are used to provide protection from disk failures then references in the following text to physical disks should be taken to mean the disk or group of disks that are being used to store the data being described.

A queue manager that is to be used in an HA cluster needs to have its recovery logs and data on shared disks, so that they can be accessed by a surviving node in the event of a node failure. A node running a queue manager must also maintain a number of files on non-shared disks. These files include files that relate to all queue managers on the node, such as /var/mqm/mqs.ini, and queue manager specific files that are used to generate internal control information. Files related to a queue manager are therefore divided between local/private and shared disks.

Regarding the queue manager files that are stored on shared disk it would, in principle, be possible to use a single shared disk for all the recovery data (logs and data) related to a queue manager. However, for optimal performance, it is recommended practice to place logs and data in separate filesystems such that they can be separately tuned for disk I/O. The example scripts included in this SupportPac use separate filesystems. The layout is described in “Step 2. Configure the Shared Disks”.

If the HA cluster will contain multiple queue managers then, depending on your chosen cluster configuration, two or more queue managers may need to run on the same node, after a takeover. To provide correct routing of WMQ channel traffic to the queue managers, you should use a different TCP/IP port number for each queue manager. The standard WMQ port is 1414. It is common practice to use a range of port numbers immediately above 1414 for additional queue managers. Note that whichever port number you assign to a queue manager, that port needs to be consistently defined on all cluster nodes that may host the queue manager, and all channels to that queue manager need to refer to the port.

When configuring a listener for incoming WMQ connections you can choose between inetd and runmqlsr. If you use inetd then you do not need to perform any start or stop of the listener from within the cluster scripts. If you use runmqlsr then you must configure the node so that the listener is started along with the


14

queue manager. This can be done on HACMP and MC/ServiceGuard by a user exit called by the start scripts in this SupportPac. The preferred way of starting the channel listener from WMQ V6 onwards is to set up a Listener object which automatically starts the service along with the queue manager; this removes the need for any additional configuration in the start scripts.

The example scripts and utilities provided in the SupportPac and the descriptions of the configuration steps deal with one queue manager at a time. For additional queue managers, repeat Steps 2 through 8.

Step 1. Configure the HA Cluster

An initial configuration is straightforward and should present no difficulties for a trained implementer using the product documentation:

For HACMP: • Configure TCP/IP on the cluster nodes for HACMP. Remember to configure ~root/.rhosts,

/etc/rc.net, etc.

• Configure the cluster, cluster nodes and adapters to HACMP as usual.

• Synchronise the Cluster Topology.

For VCS: • Configuration of the VCS cluster should be performed as described in the VCS documentation.

• Create a cluster and configure the networks and systems as usual.

For MC/ServiceGuard: • Create and configure the template ASCII package configuration file:

o Set the PACKAGE_NAME o Set the NODE_NAME o Set the RUN_SCRIPT o Set the HALT_SCRIPT o Set the SERVICE_NAME o Set the SUBNET

• Create and configure the template package control script:

o Set VG, LV and IP o Set the SUBNET o Set the SERVICE_NAME

• Set up the customer_defined_run_cmds function to use the supplied hamqm_start script,

• Set up the customer_defined_stop_cmds function to use the supplied hamqm_stop script.

• Disable Node Fail Fast. This causes the machine to panic and is only necessary if WMQ services are so tightly integrated with other services on the node that it is impractical to failover WMQ independently.

• Enable Package Switching so that packages can be moved between nodes


15

For all: Once the initial configuration has been performed, test that the basic cluster is working - for example, that you can create a filesystem and that it can be moved from one node to another and that the filesystems mount correctly on each node.

Step 2. Configure the shared disks

This step creates the volume group (or disk group) and filesystems needed for the queue manager. The suggested layout is based on the advice earlier that each queue manager should be put into a separate resource group. You should perform this step and the subsequent steps for each queue manager that you wish to make highly available.

So that this queue manager can be moved from one node to another without disrupting any other queue managers, you should designate a group containing shared disks which is used exclusively by this queue manager and no others.

For performance, it is recommended that a queue manager uses separate filesystems for logs and data. The suggested layout therefore creates two filesystems within the volume group.

If you are using Veritas Volume Manager (VxVM) to control disks, you do not require the optional VxVM cluster feature that allows concurrent access to a shared disk by multiple systems. This capability is not needed for a failover service such as WMQ and it is recommended that a queue manager’s data or log filesystems are stored on disks that are not concurrently accessible from multiple nodes.

You can optionally protect each of the filesystems from disk failures by using mirroring or RAID, this is not shown in the suggested layout.

Mount points must all be owned by the mqm user.

You will need the following filesystems:

Per node: /var on a local non-shared disk - this is a standard filesystem or directory which will already exist. You only need one of these per node regardless of the number of queue managers that the node may host. It is important that all queue managers that may run on this node use one filesystem for some of their internal control information and the example scripts designate /var/mqm for this purpose. With the suggested configuration, not much WMQ data is stored in /var, so it should not need to be extended.

Even with a simple active/passive setup it is still recommended that you have independent mounted filesystems for queue manager data and logs, with /var/mqm continuing to be a node-specific directory, as applying maintenance to the software sometimes requires access to /var/mqm. Installing updates to a “passive” or standby node would not be possible if the whole directory is only accessible to the active node. This configuration also ensures that there is a per-node copy of mqs.ini which will be updated on standby nodes by the halinkmqm script.

Per queue manager: /MQHA/<qmgr>/data on shared disks - this is where the queue manager data directory will reside. /MQHA is the top level directory used in the example scripts.

/MQHA/<qmgr>/log on shared disks - this is where the queue manager recovery logs will reside.


16

The filesystems are shown on the following diagram. The subdirectories and symlinks are all created automatically in the next step. The diagram does not show ALL of the necessary symlinks but is an indication of the structure. You only need to create the filesystems that are on shared disk (e.g /MQHA/ha.csq1/data and /MQHA/ha.csq1/log), then proceed to Step 3.

Node A Node B

internaldisks

internaldisks

shared disks

Filesystem organisation

mqm

This diagram shows the filesystem organisation for a single queue manager, called ha.csq1

qmgrs

ipcFilesystem: /var

/var

ha!csq1

@ipcc/isem

isem

esem

ha!csq1

Filesystem: /varSame as for Node A

Filesystem: /MQHA/ha.csq1/data

Filesystem: /MQHA/ha.csq1/log

ha.csq1data

/MQHAqmgrs

log

ha!csq1

@ipcc/isem

isem

esemha!csq1

= symlink

= note

= subdirectory

For HACMP: 1. Create the volume group that will be used for this queue manager’s data and log files.

2. Create the /MQHA/<qmgr>/data and /MQHA/<qmgr>/log filesystems using the volume group created above.

3. For each node in turn, import the volume group, vary it on, ensure that the filesystems can be mounted, unmount the filesystems and varyoff the volume group.

For VCS: 1. Create the disk group that will be used for this queue manager's data and log files, specifying

the nodes that may host the queue manager. Add sufficient disks to the disk group to support the creation of volumes described below.

2. Create volumes within the disk group to support the creation of the /MQHA/<qmgr>/data and /MQHA/<qmgr>/log filesystems.

3. For each node in turn, create the mount points for the filesystems, import the disk group (temporarily), ensure that the filesystems can be mounted, unmount the filesystems and deport the disk group.


17

For MC/ServiceGuard: 1. Create the volume group that will be used for this queue manager's data and log files (e.g.

/dev/vg01).

2. Create a logical volume in this volume group (e.g. /dev/vg01/<qmgr>).

3. Mount the logical volume to be shared at /MQHA/<qmgr>.

4. Create the /MQHA/<qmgr>/data and /MQHA/<qmgr>/log filesystems using the volume group created above.

5. Unmount /MQHA/<qmgr>.

6. Issue a vgchange -a n /dev/vg01/<qmgr>.

7. Issue a vgchange -c y /dev/vg01/<qmgr>.

8. Issue a vgexport -m /tmp/mq.map -s -p -v /dev/vg01.

9. Copy the mq.map file created in 8 above to /tmp on the adoptive node.

10. On the adoptive node create the same logical volume and volume group as in steps 1 and 2 above.

11. Issue a vgimport -m /tmp/mq.map -s -v /dev/vg01

12. Mount the volume group on the adoptive node to check the configuration is correct

Step 3. Create the Queue Manager

Select a node on which to create the queue manager. It does not matter on which node you do this; any of the nodes that might host the queue manager can be used.

When you create the queue manager, it is strongly advised that you should use the hacrtmqm script included in the SupportPac. It is possible to create the queue manager manually, but using hacrtmqm will save a lot of effort. For example, hacrtmqm moves and relinks some subdirectories and for HACMP creates an HACMP/ES Application Monitor for the queue manager. The move and relink of these subdirectories is to ensure smooth coexistence of queue managers which may run on the same node.

1. Select a node on which to perform the following actions

2. Ensure the queue manager’s filesystems are mounted on the selected node.

3. Create the queue manager on this node, using the hacrtmqm script

4. Start the queue manager manually, using the strmqm command

5. Create any queues and channels

6. Test the queue manager

7. End the queue manager manually, using endmqm

8. On the other nodes, which may takeover the queue manager, run the halinkmqm script


18

Step 4. Configure the movable resources

The queue manager has been created and the standby/takeover nodes have been updated. You now need to define a resource or service group which will contain the queue manager and all its associated resources.

For HACMP: The resource group can be either cascading or rotating. Whichever you choose, remember that the resource group will use the IP address as the service label. This is the address which clients and channels will use to connect to the queue manager.

If you choose cascading, it is recommended that you consider disabling the automatic fallback facility by setting Cascading Without Fallback to true. This is to avoid the interruption to the queue manager which would be caused by the reintegration of the top priority node after a failure. Unless you have a specific requirement which would make automatic fallback desirable in your configuration, then it is probably better to manually move the queue manager resource group back to the preferred node when it will cause minimum disruption.

1. Create a resource group and select the type as discussed above.

2. Configure the resource group in the usual way adding the service IP label, volume group and filesystem resources to the resource group.

3. Synchronise the cluster resources.

4. Start HACMP on each cluster node in turn and ensure that the cluster stabilizes, that the respective volume groups are varied on by each node and that the filesystems are mounted correctly.

For VCS: The service group will contain the queue manager resource and the disk group and IP address for the queue manager. The IP address is the one which clients and channels will use to connect to the queue manager.

Set up the SystemList attribute for the service group.

Because a queue manager can only run on one node at a time, the service group will be a failover group, which is the default setting (0) of the Parallel attribute.

You may wish to consider what settings you would prefer for the OnlineRetryLimit, OnlineRetryInterval, FailoverPolicy, AutoStart, AutoRestart and AutoFailover attributes.

1. Create a service group with the properties discussed or chosen above.

2. Add the disk group and IP address to the service group.

3. Ensure that the service group can be switched to each of the nodes in the SystemList and that on each node the filesystems created earlier are successfully mounted.

4. Verify that the service group behaves as you would expect for your chosen settings of attributes that control retries, failovers and automatic start or restart.

For MC/ServiceGuard: The following steps show how to configure a cluster under MC/ServiceGuard and how to configure nodes into the cluster. This information was supplied by Hewlett Packard and you should consult your


19

MC/ServiceGuard documentation for a full explanation of the commands. The example commands are to set up a cluster of 2 nodes called ptaca2 and ptaca3. Examples of the files mentioned below are contained in the appendices of the Hewlett Packard documentation.

To configure the cluster:

1. Create the ascii template file:

cmquerycl -v -C /etc/cmcluster/cluster.ascii -n ptaca2 -n ptaca3 2. Modify this template to reflect the environment and then verify the cluster configuration:

cmcheckconf -v -C /etc/cmcluster/cluster.ascii 3. Apply the configuration file, this creates the cluster and automatically distributes the

“cmclconfig” file throughout the cluster:

cmapplyconf -v -C /etc/cmcluster/cluster.ascii 4. Start and stop the cluster to check that the above has worked.

cmruncl -v -n ptaca2 -n ptaca3 cmviewcl -v cmhalt -f -v cmruncl -n ptaca2 -n ptaca3

To configure the MC/ServiceGuard package called mq1 on the first node:

1. Create and tailor the mq1 package for your environment:

cd /etc/cmcluster mkdir mq1 cmmakepkg -s mq1.conf

2. Edit the mq1.conf file to reflect your environment.

3. Change into the mq1 directory created above.

4. Issue the following command:

cmmakepkg -s mq1.cntl 5. Shutdown the cluster:

cmhaltcl -f -v 6. Distribute the configuration files:

cmapplyconf -v -C /etc/cmcluster/cluster.ascii -P /etc/cmcluster/mq1/mq1.conf To test the cluster and package startup:

1. Shutdown all queue managers (if any are running). 2. Unmount all logical volumes in the volume group you created earlier (e.g. /dev/vg01)

3. Deactivate the volume group

4. Start the cluster:

cmruncl 5. Check that the package has started:

cmviewcl -v


20

To assign the dynamic IP address of the package:

1. Halt the package:

cmhaltpkg mq1 2. Edit the mq1.cntl script to add the package’s IP address.

3. Restart the package:

cmrunpkg -v mq1 4. Check the package has started and has clients:

cmviewcl -v To add the second node to the cluster:

1. Edit the mq1.conf file and add the following line:

NODE_NAME ptaca2 2. Apply the new configuration:

cmapplyconf -v -C /etc/cmcluster/cluster.ascii -P mq1.ascii 3. Halt the cluster:

cmhaltcl -f -v 4. Restart the cluster:

cmruncl -v To test package switching:

1. Halt the mq1 package:

cmhaltpkg mq1 2. Start the mq1 package on the node ptaca3:

cmrunpkg -n ptaca3 mq1 3. Enable package switching for the mq1 package on ptaca3:

cmmodpkg -e mq1 4. Halt the mq1 package:

cmhaltpkg mq1 5. Start the mq1 package on the node ptaca2:

cmrunpkg -n ptaca2 mq1 6. Enable package switching for the mq1 package on ptaca2:

cmmodpkg -e mq1

Step 5. Configure the Application Server or Agent

The queue manager is represented within the resource group by an application server or agent. The SupportPac includes example server start and stop methods which allow the HA products to start and end a queue manager, in response to cluster commands or cluster events.

For HACMP and MC/ServiceGuard the hamqm_start, hamqm_stop and hamqm_applmon programs are ksh scripts. For VCS, similar function is provided by the online, offline, monitor and clean perl programs.


21

The start and stop scripts allow you to specify a user exit to be invoked just after the queue manager is brought online or just before it is taken offline. Use of the user exit is optional. The purpose of the user exit is to allow you to start or stop additional processes following the start of a queue manager or just before ending it. For example, you may wish to start a listener, a trigger monitor or a command server. Note that WMQ V6 allows the queue manager to start these services and any arbitrary user program automatically, which might make use of the user exit redundant.

For HACMP: 1. Define an application server which will start and stop the queue manager. The start and stop

scripts contained in the SupportPac may be used unmodified, or may be used as a basis from which you can develop customized scripts. The examples are called hamqm_start and hamqm_stop.

2. Add the application server to the resource group definition created in the previous step.

3. Optionally, create a user exit in /MQHA/bin/rc.local

4. Synchronise the cluster configuration.

5. Test that the node can start and stop the queue manager, by bringing the resource group online and offline.

For VCS: The agent, which is called MQM, allows VCS to monitor and control the queue manager, in response to cluster commands or cluster events. You could use the example agent exactly as it is, or you could use it as a guide to develop your own customised agent by writing a set of scripts which implement the agent interface. The example agent allows you to create multiple resources of resource type MQM, either in the same or different service groups.

All the methods operate on the queue manager with the same name as the resource to which the operation is being applied. The resource name is the first parameter in the ArgList passed to each method.

The online and offline methods are robust in that they don't assume anything about the state of the queue manager on entry. The offline method uses the OfflineTimeout resource type attribute to determine how quickly it needs to operate and attempts initially to gracefully shutdown the queue manager, attempting more severe means of stopping it if it would run out of time. When the offline method is invoked by the cluster, it will issue an immediate stop of the queue manager (endmqm -i) and will allow just less than half the value of OfflineTimeout seconds for it to complete. If it does not complete within that time, a preemptive stop (endmqm -p) is issued, and a similar time is allowed. If the queue manager still hasn't stopped within that time, then the offline method terminates the queue manager forcefully. The amount of time allowed for each of the immediate and preemptive stops is just under half of the OfflineTimeout attribute configured for the MQM resource type. The reason it is slightly less than half is that the agent reserves a little time in case it needs to run the abrupt termination. By the expiry of OfflineTimeout, the offline method will have brought the queue manager to a stop, forcefully if necessary. It is better if the queue manager can be shutdown gracefully, because a clean shutdown will lead to a faster restart.

You could modify the online and offline scripts to include additional startup or shutdown tasks to be performed just before or after an online or offline transition. Alternatively you could configure VCS triggers to perform such actions. Either of these could be used to allow you to start or stop additional processes following the start of a queue manager or just before ending it. For example, you may wish to start a listener, a trigger monitor or a command server. You may also want to start some other application that uses the queue manager. You may wish to send a notification to an application, a monitoring system, or a human administrator. Again, remember that WMQ V6 allows the queue manager to start services including any arbitrary user program automatically,


22

The MQM type should already have been added to the types.cf file, in the previous section on installation. Resources of type MQM, which represent individual queue managers, are configured by editing the main.cf configuration file. They could also be added using the VCS GUI or the ha* commands. An example of a main.cf entry for a resource of type MQM is included in Appendix A of this document.

1. Add a resource entry into the /etc/VRTSvcs/conf/config/main.cf file. See Appendix A for an example of a complete main.cf file. Set the resource attributes to your preferred values.

2. Create resource dependencies between the queue manager resource and the filesystems and IP address. The main.cf in Appendix A provides an example.

3. Start the service group and check that it starts the queue manager successfully. You can test the queue manager by using runmqsc and inspecting its queues.

Stop the service group and check that it stops the queue manager.

For MC/ServiceGuard: To define the start command so as it can be run as the user mqm under MC/ServiceGuard control, create a wrapper function in the package control script (/etc/cmcluster/mq1/mq1.cntl) that contains the following line:

su mqm -c “/MQHA/bin/hamqm_start_su $qmgr”

To define the stop command so that can be run as the user mqm under Service Guard control , create a wrapper function in the package control script (/etc/cmcluster/mq1/mq1.cntl) that contains the following line:

su mqm -c “/MQHA/bin/hamqm_stop_su $qmgr 30”

Step 6. Configure a monitor

For HACMP: If you are using HACMP/ES then you can configure an application monitor which will monitor the health of the queue manager and trigger recovery actions as a result of MQ failures, not just node or network failures. Recovery actions include the ability to perform local restarts of the queue manager (see below) or to cause a failover of the resource group to another node.

To benefit from queue manager monitoring you must define an Application Monitor. If you created the queue manager using hacrtmqm then one of these will have been created for you, in the /MQHA/bin directory, and is called hamqm_applmon.$qmgr.

The example application monitor determines whether the queue manager is still starting or whether it considers itself to be fully started. If the queue manager is still starting then the application monitor will allow it to complete its startup processing. This is important because the startup time of the queue manager can vary, depending on how much log replay needs to be performed. From WebSphere MQ V6, the queue manager issues messages to the console during startup giving an indication of its progress through the replay phase.

If the application monitor only tested whether the queue manager was running, then it would be difficult to choose a stabilisation interval that was both short enough to allow sensitive monitoring and long enough to cater for the cases where there is a lot of log replay to perform. There would be a risk that a genuine failure could go undetected, or that a valid startup was abandoned during log replay. You may wish to incorporate similar monitoring behaviour if you decided to write your own application monitor.


23

If you are using HACMP/ES, and have configured the application monitor, as described above, then the recovery actions that you can configure include the ability to perform local restarts of the queue manager. HACMP will attempt local restart attempts up to a maximum number of attempts within a specified period. The maximum attempts threshold and the period are configurable. For WMQ, it is recommended that the threshold is set to 1 so that only one restart is attempted, and that the time period is set to a small multiple of the expected start time for the queue manager. With these settings, if successive restarts fail without a significant period of stability between, then the queue manager resource group will be moved to a different node. Attempting more restarts on a node on which a restart has just failed is unlikely to succeed.

1. To enable queue manager monitoring, define a custom application monitor for the Application Server created in Step 5, providing the name of the monitor script and tell HACMP how frequently to invoke it. Set the stabilisation interval to 10 seconds, unless your queue manager is expected to take a long time to restart. This would normally be if your environment has long-running transactions that might cause a substantial amount of recovery/replay to be required.

2. To configure for local restarts, specify the Restart Count and Restart Interval.

3. Synchronise the cluster resources.

4. Test the operation of the application monitoring, and in particular verify that the local restart capability is working as configured. A convenient way to provoke queue manager failures is to identify the Execution Controller process (called amqzxma0) associated with the queue manager, and kill it.

For VCS: No special steps are needed for VCS as the monitoring process is automatically invoked as needed.

For MC/ServiceGuard: Create a monitoring script file for use by MC/ServiceGuard. This script can initially be a renamed copy of the /MQHA/bin/hamqm_applmon_su script supplied with this SupportPac, which checks the health of a named queue manager using the PING QMGR command. You may wish to add extra features to your copy of this script to check that other processes essential to your environment are functioning correctly.

If you wish to have one common script for all queue managers under MC/ServiceGuard control then they can all use the same copy of the script, however if you wish the monitoring of different queue managers to check different things then it is suggested that you call your script $qmgr.mon after the queue manager it is used for. The monitoring script should ideally be run regularly but should also be a short lived process doing a quick check or checks and then disappearing to avoid slowing down WMQ and the node it runs on.


24

The monitoring script is then called from the package control script for the queue manager (/etc/cmcluster/mq1/mq1.cntl) . This is achieved by adding the following lines to mq1.cntl file created during Service Guard configuration:

SERVICE_NAME[n]=mq1 SERVICE_COMMAND[n]=”su mqm -c \”/etc/cmcluster/mq1/mq1.mon QMA\”” where mq1.mon is a renamed version of hamqm_applmon_script mq1 is the name of the package being monitored QMA is the name of the queue manager to monitor n is the number of the service being monitored.

Step 7. Removal of Queue Manager from the Cluster

Should you decide to remove the queue manager from the cluster, it is sufficient to remove the application server (and application monitor, if configured) from the HA configuration. You may also decide to delete the resource group. This does not destroy the queue manager, which will continue to function normally, but under manual control.

Once the queue manager has been removed from the HA configuration, it will not be highly available, and will remain on one node. Other nodes will still remember the queue manager and you may wish to tidy up the other nodes. Refer to Step 8 for details.

Similar considerations apply to the IP address used by the queue manager. This is easiest if the mapping of queue managers to disk groups, network interfaces and service groups is simple and each disk group and service group is used exclusively by one queue manager.

For HACMP: 1. Delete the application monitor, if configured

2. Delete the application server

3. Remove the filesystem, service label and volume group resources from the resource group.

4. Synchronise the cluster resources configuration.

For VCS: 1. Stop the resource, by taking it offline.

2. Delete the resource from the cluster by using the VCS GUI, VCS ha* commands or by editing the main.cf configuration file.

3. If you wish to keep the queue manager, either destroy the service group if you have no further use for it, or modify it by removing the disk group and public network interface used by the queue manager, provided they are not also used by any other services or applications.

For MC/ServiceGuard: 1. Delete the application server.

2. Delete the monitoring script.


25

3. Remove the filesystem, service label and volume group resources from the package.

Step 8. Deletion of Queue Manager

If you decide to delete the queue manager, then you should first remove it from the cluster configuration, as described in the previous step. Then, to delete the QM, perform the following actions.

1. Make sure the queue manager is stopped, by issuing the endmqm command.

2. On the node which currently has the queue manager’s shared disks and has the queue manager’s filesystems mounted, run the hadltmqm script provided in the SupportPac.

3. You can now destroy the filesystems /MQHA/<qmgr>/data and /MQHA/<qmgr>/log.

4. You can also destroy the volume group.

5. On each of the other nodes in the cluster,

a. Run the hadltmqm command as above, which will clean up the subdirectories related to the queue manager.

b. Manually remove the queue manager stanza from the /var/mqm/mqs.ini file.

The queue manager has now been completely removed from the cluster and the nodes.


26

Upgrading WMQ software in a cluster Applying maintenance

All nodes in a cluster should normally be running exactly the same version/release of the WMQ software. Sometimes however it is necessary to apply updates, such as for service fixes or for new versions of the product. This is best done by means of a “rolling upgrade”.

The principle of a rolling upgrade is to apply the new software to each node in turn, while continuing the WMQ service on other nodes. Assuming a two-node active/active cluster, the steps are

1. Select one node to upgrade first

2. At a suitable time, when the moving of a queue manager will not cause a serious disruption to service, manually force a migration of the active queue manager to its partner node

3. On the node that is now running both queue managers, disable the failover capabilities for the queue managers.

4. Upgrade the software on the node that is not running any queue managers

5. Re-enable failover, and move both queue managers across to the newly upgraded node

6. Disable failover again

7. Upgrade the original box

8. Re-enable failover

9. When it will cause least disruption, move one of the queue managers across to balance the workload

It should be obvious how to modify these steps for standby configurations and for “N+1” configurations. The overriding rule that has to be observed is that once a queue manager has been running on a node with a new level of software, it must not be transferred to a node running old software.

While there are times in this process when no failover is permitted, and service could thus be lost if a failure occurred, these periods are comparatively short as the newer software is installed. More complex HA configurations can minimise these windows.


27

Commands This section gives details of the configuration commands used to set up and monitor queue managers.

hacrtmqm

Purpose The hacrtmqm command creates the queue manager and ensures that its directories are arranged to allow for HA operation.

This command makes use of two environment variables to determine where the data and log directories should be created.

export MQHAFSDATA="/MQHA/<qmgr>/data" export MQHAFSLOG="/MQHA/<qmgr>/log"

The invocation of the hacrtmqm command uses exactly the same parameters that you would normally use for crtmqm. You do not need to set MQSPREFIX or specify the -ld parameter for the log directory as these are both handled automatically by hacrtmqm.

Note: You must be root to run the hacrtmqm command.

Syntax hacrtmqm <crtmqm parameters>

Parameters • crtmqm parameters are exactly the same as for the regular WMQ crtmqm command

Example # export MQHAFSDATA="/MQHA/ha.csq1/data" # export MQHAFSLOG="/MQHA/ha.csq1/log" # hacrtmqm -c "Highly available queue manager" ha.csq1


28

halinkmqm

Purpose Internally, hacrtmqm uses a script called halinkmqm to relink the subdirectories used for IPC keys and create a symbolic link from /var/mqm/qmgrs/<qmgr> to the /MQHA/<qmgr>/data/qmgrs/<qmgr> directory.

As shown at the end of hacrtmqm, must run halinkmqm on the remaining cluster nodes which will act as standby nodes for this queue manager. Do not run halinkmqm on the node on which you created the queue manager with hacrtmqm - it has already been run there.

The halinkmqm command creates the necessary links and inserts a stanza for the queue manager into the mqs.ini file.

For HACMP, the halinkmqm command is also responsible for creating an HACMP/ES Application Monitor on the standby/takeover nodes.

Note 1: You must be the mqm user or in the mqm group to run this command.

Note 2: The “mangled” queue manager directory might include a “!” character. With some shells, in particular if you are using bash, then this is considered a special character (like the “*” and “?” characters are) and might be expanded to unwanted values. To avoid this, make sure the parameter is enclosed in quotes on the command line as this will inhibit shell expansion.

Syntax halinkmqm <qmgr name> <mangled qmgr directory> <qmgr data directory>

Parameters • qmgr name - The name of the queue manager as you specified it to hacrtmqm (e.g. ha.csq1)

• mangled qmgr directory - The name of the directory under /var/mqm/qmgrs/ which closely resembles the qmgr name.

• qmgr data directory - The directory you selected for the queue manager data, and to which you set MQHAFSDATA before issuing hacrtmqm. (e.g. /MQHA/ha.csq1/data).

Example $ halinkmqm ha.csq1 ha!csq1 /MQHA/ha.csq1/data


29

hadltmqm command

Purpose The hadltmqm command deletes a queue manager. This destroys its log files, and control files and on the owning node only will remove the definition of the queue manager from the /var/mqm/mqs.ini file. This is similar to the behaviour of the dltmqm command, which the hadltmqm command uses internally. The hadltmqm command deletes the symbolic links used to locate the IPC subdirectories and deletes the subdirectories themselves.

Note: You must be the mqm user or in the mqm group to run this command.

Syntax hadltmqm <qmgr name>

Parameters • qmgr name - the name of the queue manager to be deleted


30

hamqm_start

Purpose This script is robust in that it does not assume anything about the state of the queue manager on entry and will forcefully kill any existing processes that might be associated with the queue manager, to ensure a clean start.

One error that can occur on restart of a queue manager (especially if it is an attempt to restart on the same node, instead of failing over to a different node) is that previously-connected application programs may not have disconnected. If that happens, the queue manager will not restart, will return error 24, and will show message AMQ8041. It might be appropriate for you to modify the supplied scripts to parse the output of this message and kill the offending application programs. This is not part of the default scripts, as it could be considered disruptive.

Syntax /MQHA/bin/hamqm_start <qmgr>

Parameters • qmgr - the name of the queue manager to be started

Example /MQHA/bin/hamqm_start ha.csq1


31

hamqm_stop

Purpose The stop script attempts to shutdown the queue manager gracefully. If the stop command is defined in this way, then when it is invoked, it will issue an immediate stop of the queue manager (endmqm -i) and will allow a defined time for it to complete. If it does not complete within that time, a pre-emptive stop (endmqm -p) is issued, and up a further delay is allowed. If the queue manager still hasn't stopped within that time, then this command terminates the queue manager forcefully. It is clearly better if the queue manager can be shutdown gracefully, because a clean shutdown will lead to a faster restart and is less disruptive to clients and applications.

Syntax /MQHA/bin/hamqm_stop <qmgr> <timeout>

Parameters • qmgr - the name of the queue manager to be stopped

• timeout - the time in seconds to use on each of the levels of severity of stop

Example /MQHA/bin/hamqm_stop ha.csq1 30

When the stop script is called, part of the processing is to forcefully kill all of the processes associated with the queue manager if they do not stop properly. In previous versions of the HA SupportPacs, the list of processes was hardcoded in the stop or restart scripts. For this version, the list of processes is in an external file called hamqproc. As shipped, the list includes all previous and current known internal processes from WMQ. The order in which the processes are killed is not especially important: if the queue manager has not ended cleanly, there will probably be FDC files created regardless of the sequence of the kill operations.

The list does not include external commands such as user applications or trigger monitors, but you might choose to add such commands to the file. Processing of this file requires that the queue manager name is one of the visible parameters to any additional commands.


32

/MQHA/bin/rc.local

Purpose For HACMP and MC/ServiceGuard, if you want to make use of the user exit capability, create a script called /MQHA/bin/rc.local. The script can contain anything you like. Ensure that the rc.local script is executable. The start and stop scripts will invoke the rc.local script as user "mqm". It is recommended that you test the exit by invoking it manually, before putting it under cluster control.

Syntax /MQHA/bin/rc.local <qmgr> <phase>

Parameters • qmgr - the name of the queue manager which this invocation relates to

• phase – either of the strings "pre_offline" or "post_online", to indicate what is about to happen or what has just happened.

The example start and stop scripts are written such that this script is invoked asynchronously (ie in the background). This is a conservative policy that aims to reduce the likelihood that an errant script could delay, possibly indefinitely, other cluster operations which the node needs to perform. The asynchronous invocation policy does have the disadvantage that the exit script cannot make any assumptions about the state of the queue manager, since it may change immediately after the script is invoked. The script should therefore be written in a robust style.


33

Working with other HA products There are many other HA products that could be used to control the operation of a WebSphere MQ system on Unix and Linux systems. They all appear to operate in essentially the same way, with methods to start, stop and monitor arbitrary resources. The scripts in this SupportPac have been used as the basis of similar scripts for some other HA products.

Creation and deletion of queue managers are likely to be identical regardless of the HA product. You can take the hacrtmqm, halinkmqm and hadltmqm scripts and probably use them unchanged, for the platform you are using.

The scripts provided in this SupportPac are separated into directories for the specific pairing of HA product and operating system on which they were originally implemented. But for other combinations, you should select the appropriate mix. For example, if you want to use VCS on AIX, you should use the scripts from the hacmp directory for creating, relinking, and deleting the queue manager. For Linux, the scripts that are used for MC/ServiceGuard are probably the best starting point – this is because WMQ for Linux systems continues to use the “ssem” subdirectories which are no longer used on AIX and Solaris.

More differences occur in the runtime aspects of HA, but this tends to simply be a matter of the “style”. For example, does the HA product issue the periodic monitoring calls, or does the “WMQ agent” have to do it itself; what is the natural language for the scripts to be written in; which return codes indicate errors; how do you issue text messages for error conditions; how are dependencies configured? When you can answer these questions, you should be able to take one or other of these sets of scripts and modify them to fit.

Related products

There are SupportPacs available for configuring WebSphere Message Broker in Highly Available configurations. They follow the same model as this one, and recommend that the broker’s underlying queue manager is put into an HA solution using this SupportPac.


34

Suggested Test You may find that the following tests are helpful when determining whether your configuration is working as intended.

Create a queue manager, e.g. QM1, and put it under HA control, as defined in the preceding chapters.

Start the QM1 queue manager and use runmqsc to define the following objects:

************************************************************** * * * Create the queues for QMgr (QM1) clustered QMgr * * * ************************************************************** ************************************************************** * Define the outbound xmit queue * ************************************************************** DEFINE QLOCAL(XMITQ1) + USAGE(XMITQ) + DEFPSIST(YES) + TRIGGER + INITQ(SYSTEM.CHANNEL.INITQ) + REPLACE ************************************************************** * Define the inbound/outbound message queue * ************************************************************** DEFINE QREMOTE(QM1_INBOUND_Q) + RNAME(QM2_INBOUND_Q) + RQMNAME(QM2) + XMITQ(XMITQ1) + DEFPSIST(YES) + REPLACE ************************************************************** * Define the channels between QM1 <-> QM2 * ************************************************************** * Channel 1 DEFINE CHANNEL(QM2_SDR.TO.QM1_RCV) + CHLTYPE(RCVR) + TRPTYPE(TCP) + HBINT(30) + REPLACE * Channel 2 DEFINE CHANNEL(QM1_SDR.TO.QM2_RCV) + CHLTYPE(SDR) + TRPTYPE(TCP) + CONNAME(QM2_ip_address) + XMITQ(XMITQ1) + HBINT(30) + REPLACE


35

Create another “out of cluster” queue manager, e.g. QM2, start it and create the following objects:

************************************************************** * * * Create the queues for "out-of-cluster" QMgr QM2 * * * ************************************************************** ************************************************************** * Define the inbound message queue * ************************************************************** DEFINE QLOCAL(QM2_INBOUND_Q) + DEFPSIST(YES) + REPLACE ************************************************************** * Define the outbound xmit queue * ************************************************************** DEFINE QLOCAL(XMITQ1) + USAGE(XMITQ) + DEFPSIST(YES) + INITQ(SYSTEM.CHANNEL.INITQ) + REPLACE ************************************************************** * Define the outbound message queue * ************************************************************** DEFINE QREMOTE(QM2_OUTBOUND_Q) + RNAME(QM1_INBOUND_Q) + RQMNAME(QM1) + XMITQ(XMITQ1) + DEFPSIST(YES) + REPLACE ************************************************************** * Define the channels between QM2 <-> QM1 * ************************************************************** * Channel 1 DEFINE CHANNEL(QM2_SDR.TO.QM1_RCV) + CHLTYPE(SDR) + TRPTYPE(TCP) + CONNAME(QM1_ip_address) + XMITQ(XMITQ1) + HBINT(30) + REPLACE * Channel 2 DEFINE CHANNEL(QM1_SDR.TO.QM2_RCV) + CHLTYPE(RCVR) + TRPTYPE(TCP) + HBINT(30) + REPLACE


36

On the node running QM2, run a script similar to the following, which will prime the transmission queue with a number of persistent messages.

#!/bin/sh # $1 controls size of message buffer # Actual amount sent is between $1 and 2*$1 KBytes # # e.g. QM2_put 10 # rm -f message_buffer # Construct the set of messages to send SIZE=`expr $1 \* 1000` MSG=0 date >> message_buffer while [ $MSG -le $SIZE ] do # double the previous buffer cp message_buffer .previous cat .previous .previous > message_buffer MSG=`ls -l message_buffer | awk '{ print $5 }'` done echo "Putting $MSG Bytes onto outbound queue" cat message_buffer | /usr/lpp/mqm/samp/bin/amqsput QM2_OUTBOUND_Q QM2

Each line of text in the message_buffer file will be sent as a WMQ message. Run initially with a small number of messages. The script will prime the transmission queue. When the transmission queue is primed, use runmqsc to start the QM2_SDR.TO.QM1_RCVR channel, and the messages will be transmitted to QM1and then routed via QM1’s transmission queue back to QM2’s inbound queue.

When you are happy that your definitions are correct and you get end to end transmission of messages, run the script again with a much large number of messages to be sent. When the transmission queue is primed, note how many messages it contains.

Once again start the sender channel, but after only a few seconds of transmitting messages reset the node on which QM1 is running so that it fails over to another cluster node. The QM1 queue manager should restart on the takeover node and the channels should be restarted. All the messages should be received at QM2’s inbound queue. You could use runmqsc to inspect the queue, or run the amqsget sample to retrieve the messages into a file.


37

Appendix A. Sample Configuration Files for VCS types.cf

The MQM resource type can be created by adding the following resource type definition to the types.cf file. If you add the resource type in this way, make sure you have stopped the cluster and use hacf -verify to check that the modified file is correct.

# Append the following type definition to types.cf type MQM ( static int OfflineTimeout = 60 static str OnlineWaitLimit static str LogLevel = error static str ArgList[] = { QMName, UserID } NameRule = resource.QMName str QMName str UserID = mqm )

As well as creating the MQM resource type, this also sets the values of the following resource type attributes:

• OfflineTimeout

The VCS default of 300 seconds is quite long for a queue manager, so the suggested value for this attribute is 60 seconds. You can adjust this attribute to suit your own configuration, but it is recommended that you do not set it any shorter than approximately 15 seconds.

• OnlineWaitLimit

It is recommended that you configure the OnlineWaitLimit for the MQM resource type. The default setting is 2, but to accelerate detection of start failures, this attribute should be set to 0.

• LogLevel

It is recommended that you run the MQM agent with LogLevel set to ‘error’. This will display any serious error conditions (in the VCS log). If you want more detail of what the MQM agent is doing, then you can increase the LogLevel to ‘debug’ or ‘all’, but this will produce far more messages and is not recommended for regular operation.

main.cf

A resource of type called MQM can be defined by adding a resource entry to the /etc/VRTSvcs/conf/config/main.cf file. The following is a complete main.cf for a simple cluster (called Kona) with two nodes (sunph1, sunph2) and one service group (vxg1) which includes resources for one queue manager (VXQM1) with an IP address (resource name vxip1) and filesystems managed by Mount resources (vxmnt1, vxmnt2) and a DiskGroup (resource name vxdg1).

Note that the file continues onto a second page.


38

include "types.cf" cluster Kona ( UserNames = { admin = "cDRpdxPmHpzS." } CounterInterval = 5 Factor = { runque = 5, memory = 1, disk = 10, cpu = 25, network = 5 } MaxFactor = { runque = 100, memory = 10, disk = 100, cpu = 100, network = 100 } ) system sunph1 system sunph2 snmp vcs ( TrapList = { 1 = "A new system has joined the VCS Cluster", 2 = "An existing system has changed its state", 3 = "A service group has changed its state", 4 = "One or more heartbeat links has gone down", 5 = "An HA service has done a manual restart", 6 = "An HA service has been manually idled", 7 = "An HA service has been successfully started" } ) group vxg1 ( SystemList = { sunph1, sunph2 } ) DiskGroup vxdg1 ( DiskGroup = vxdg1 ) IP vxip1 ( Device = hme0 Address = "9.20.110.247" ) MQM VXQM1 ( QMName = VXQM1 ) Mount vxmnt1 ( MountPoint = "/MQHA/VXQM1/data" BlockDevice = "/dev/vx/dsk/vxdg1/vxvol1" FSType = vxfs )


39

Mount vxmnt2 (

MountPoint = "/MQHA/VXQM1/log" BlockDevice = "/dev/vx/dsk/vxdg1/vxvol2" FSType = vxfs ) NIC vxnic1 ( Device = hme0 NetworkType = ether ) VXQM1 requires vxip1 VXQM1 requires vxmnt1 VXQM1 requires vxmnt2 vxip1 requires vxnic1 vxmnt1 requires vxdg1 vxmnt2 requires vxdg1 // resource dependency tree // // group vxg1 // { // MQM VXQM1 // { // IP vxip1 // { // NIC vxnic1 // } // Mount vxmnt1 // { // DiskGroup vxdg1 // } // Mount vxmnt2 // { // DiskGroup vxdg1 // } // } // }


40

Appendix B. Messages produced by MQM agent for VCS

The following is a numerically sorted list of the messages that are produced by the MQM agent. These messages should appear in the VCS cluster log. A message is only produced by the MQM agent if the current setting of LogLevel is at least as high as the level specified for that message.

The explanation of any message can be viewed on the cluster systems, by running:

/opt/VRTSvcs/bin/MQM/explain <msgid>

Message id 3005001 Severity trace

Text qmname <qmname>; userid <userid>; loglevel <loglevel>

Explanation: This message is output to the VCS log whenever an MQM agent method is invoked. It records the parameters passed to the method.

This message has trace severity, so will only appear if LogLevel='all'


Text completed without error

Explanation: This message indicates that an MQM agent method completed and that no errors were encountered.



Text Queue Manager <qmname> is responsive

Explanation: The queue manager has been tested by issuing a ping command, and it responded to it. This is taken as an indication that the queue manager is running correctly.



Text Queue Manager <qmname> is starting

Explanation: The queue manager did not respond to a ping test, but it appears to be still starting up. This is not viewed as an error condition. If startup is taking a long time then it is possible that there is a lot of log replay to perform.




41

Text Queue Manager <qmname> not responding (ping=<result>)

Explanation: The queue manager did not respond to a ping test and is not in the process of starting up. If the queue manager resource is currently supposed to be online then this is an error condition which will be handled by VCS either restarting the service group or performing a failover.


Message id 3005006 Severity error

Text Problem with loglevels!!

Explanation: This message indicates a problem with either the MQM agent code or the VCS cluster software. It is generated when the current setting of the LogLevel attribute is not one of the values in the set documented in the VCS Agent Developer's Guide. The MQM agent will only produce log messages in accordance with the setting of LogLevel. An apparently invalid setting will suppress the logging of messages.

This message has error severity and should be investigated/reported.


Text Could not locate queue manager <qmname>

Explanation: The queue manager name supplied to the method as parameter <qmname> does not identify a known queue manager. Please check the cluster configuration and retry.

Message id 3005008 Severity debug

Text <qmname> not running normally, will be terminated

Explanation: The queue manager identified by <qmname> is the subject of an offline operation. The queue manager was found to be not in the normal online state and the offline method will ensure that the queue manager is fully stopped.

This message has debug severity and should only appear if LogLevel is set to 'debug' or 'all'.


Text <qmname> claims to be running, take offline

Explanation: The queue manager identified by <qmname> is the subject of an offline operation. The status of the queue manager has been checked and was found to be a normal online state. The offline method will attempt to perform a graceful shutdown of the queue manager.



42


Text Invalid value for OfflineTimeout

Explanation: This message indicates that the current value of attribute OfflineTimeout is not set to a valid value. Any value greater than 0 is valid. The MQM agent offline method exits immediately and the clean method will terminate the queue manager.

This message has error severity and should be investigated, using hatype.


Text attempting <severity> stop of <qmname>

Explanation: As a result of an offline operation, the MQM agent is about to issue an end command of the specified severity for the queue manager identified by <qmname>.



Text Could not validate userid, <userid>

Explanation: The userid supplied to the method as parameter <userid> does not identify a known user. Please check the cluster configuration and retry.


Text Could not fork a process

Explanation: The MQM agent tried to fork a process, but was unable to. This is indicative of a serious problem with the system and the operation was abandoned.



Text <qmname> online method scheduling monitor in <wait_time> seconds

Explanation: This message is for information only and indicates that the online method for queue manager <qmname> has requested to VCS that the first monitor of the queue manager is scheduled to start in <wait_time> seconds.




43

Text Could not run hatype

Explanation: The MQM agent tried to use hatype to read an attribute value, but was unable to. The operation had to be abandoned.



Text waiting for <severity> stop of <qmname> to complete

Explanation: The MQM agent is waiting for the queue manager identified by <qmname> to stop, as a result of an offline operation.



Text <qmname> is still running...

Explanation: This message is for information only. The queue manager identified by <qmname> is the subject of an offline operation and the MQM agent is currently waiting for the queue manager to stop. If the queue manager fails to stop within the time allowed by the agent, the agent will use a more forceful stop to ensure that the queue manager is fully stopped within OfflineTimeout.



Text <qmname> is stopping

Explanation: The queue manager identified by <qmname> is currently stopping as a result of an offline operation. This is for information only.



Text <qmname> has stopped

Explanation: The queue manager identified by <qmname> has now stopped as a result of an offline operation. This is for information only.




44

Text ended with errors

Explanation: This message indicates that the method which reported it encountered and detected a serious error condition and did not complete successfully. Preceding messages should describe the nature of the error.



Text strmqm for <qmname> completed

Explanation: This message is notification that a start command was issued for the queue manager with the name <qmname> and that it completed successfully.



Text <qmname> could not be started (rc=<rc>)

Explanation: An attempt to start the queue manager did not succeed. If <rc> is 16 then check that the queue manager exists. If <rc> is 25 then check that the directory structure for the queue manager exists and is complete - this error could occur if the queue manager has been deleted but still has a /var/mqm/mqs.ini entry on one of the cluster systems. It could also indicate a problem with the content of the service group. Check that the queue manager's filesystems are being mounted and that the MQM resource has a resource dependency on the relevant Mount resources. For all values of <rc> check the MQ error logs for details of why the start failed.



Text Could not open file <filename>

Explanation: The file named <filename> could not be opened. The operation was abandoned. Please check that the file exists and is readable.



Text Could not list running processes

Explanation: An attempt to list the processes which are currently running did not succeed. This is indicative of a serious problem with the system and the operation was abandoned.



45


Text Could not find directory for <qmname>

Explanation: An attempt to locate the queue manager directory for the queue manager named <qmname> did not succeed. Please check the directory structure for the queue manager.