MQ HACMP related items

HACMP configuratoins

Cluster Configurations :

A standby configuration is the most basic cluster configuration in which one node performs work whilst the other node acts only as standby. The standby node does not perform work and is referred to as idle; this configuration is sometimes called "cold standby".

A takeover configuration is a more advanced configuration in which all nodes perform some kind of work and critical work can be taken over in the event of a node failure. A "one sided takeover" configuration is one in which a standby node performs some additional, non critical and non movable work. This is rather like a standby configuration but with (non critical) work being performed by the standby node. A "mutual takeover" configuration is one in which all nodes are performing highly available (movable) work. This type of cluster configuration is also sometimes referred to as "Active/Active" to indicate that all nodes are actively processing critical workload.

HACMP, VCS, ServiceGuard , Heartbeat and MSCS all use a "shared nothing" clustering architecture. A shared nothing cluster has no concurrently shared resources, and works by transferring ownership of resources from one node to another, to work around failures or in response to operator commands. Resources are things like disks, network addresses, or critical processes.

Configuration :

All HA products have the concept of a unit of failover. This is a set of definitions that contains all the processes and resources needed to deliver a highly available service and ideally should contain only those processes and resources.

In HACMP, the unit of failover is called a resource group. On other HA products the name might be different, but the concept is the same. On VCS, it is known as a service group, on MC/ServiceGuard it is a package, in Heartbeat is a resource group and in MSCS is a group. The smallest unit of failover for WMQ is a queue manager, since you cannot move part of a queue manager without moving the whole thing. It follows that the optimal configuration is to place each queue manager in a separate resource group, with the resources upon which it depends. The resource group should therefore contain the shared disks used by a queue manager, which should be in a volume group or disk group reserved exclusively for the resource group, the IP address used to connect to the queue manager (the service address) and an object which represents the queue manager.

Failover - Invoking a secondary system to take over when the primary system fails.

HACMP software samples

Required HA cluster software examples:

IBM PowerHA (HACMP)
Veritas Cluster Server (VCS)
Microsoft Cluster Service (MSCS)
Red Hat Cluster

When not to use HA WebSphere MQ queue manager clusters

HA WebSphere MQ queue manager clusters require additional proprietary HA hardware (shared disks) and external HA clustering software (such as HACMP). This increases the administration costs of the environment because you also need to administer the HA components. This approach also increases the initial implementation costs because extra hardware and software are required. Therefore, balance these initial costs with the potential costs incurred if a queue manager fail and messages become trapped.
If trapped messages are not a problem for the applications (for example, the response time of the application is irrelevant or the data is updated frequently), then HA WebSphere MQ queue manager clusters are probably not required.

General recommendations

Some of the advice pertinent to an HA environment in general is:

Each node in cluster must be sized large enough to support total load under failure conditions. Dynamic process allocation under AIX (shutting down less important processes on the production failover machine) can help lower hardware costs.
Optimal resource utilization in an HA environment occurs when each node in an Active/Active cluster drives similar loads.
Use HA clustering for high availability only, not for performance or simplified administration.
The HA solution in the QA environment should exactly mimic the production environment. This helps to avoid critical production problems and minimizes down time.
Test individual HA components for online, offline, and failover.
Attempt to restart the individual HA components (for example, InterChange Server or Message Broker) a minimum of three times on the primary node before failing over to the alternate node.
Automatic fallback is not recommended. Manual fallback is recommended to minimize system disruption.
It is recommended that a stock of spare components such as network adapters, cables, etc. be maintained. Related to this is also a need to define an escalation plan to deal with unexpected failures.
Be sure to create a list identifying who to call when something HA specific goes wrong, or the normal contacts are not available.

Highly Available WebSphere Business Integration Solutions, SG24-6328-00, chapter 8.2, page 122.

Top

MC91 - HA for MQ

This SupportPac has now been withdrawn. The support is now included in the WebSphere MQ V7.0.1 product and documentation.

url MC91 - high availability for MQ on Unix. Install into /MQHA/bin : samples use it.

This SupportPac provides notes and sample scripts to assist with the installation and configuration of WebSphere MQ (WMQ) V6 and V7 in High Availability (HA) environments. Three different platforms and environments are described here, but they share a common design and this design can also be extended for many other systems.

Specifically this SupportPac deals with the following HA products:

HACMP (High Availability Cluster Multi Processing)
Veritas Cluster Server (VCS)
MC/ServiceGuard (MCSG)

MC91 installation :

16/03/2009 20:40 1.310.445 mc91.tar.Z {mqm - /MQHA/bin/ } $ uncompress mc91.tar.Z {mqm - /MQHA/bin/ } $ tar -xvf mc91.tar

MC91 configuration :

[1] Configure the HA Cluster
[2] Configure the shared disks
[3] Create the Queue Manager
[4] Configure the movable resources
[5] Configure the Application Server or Agent
[6] Configure a monitor

[1] Configure the HA Cluster

Configure TCP/IP on the cluster nodes for HACMP. Remember to configure ~root/.rhosts, /etc/rc.net, etc.
Configure the cluster, cluster nodes and adapters to HACMP as usual.
Synchronise the Cluster Topology.

[2] Configure the shared disks

This step creates the volume group (or disk group) and filesystems needed for the queue manager. So that this queue manager can be moved from one node to another without disrupting any other queue managers, you should designate a group containing shared disks which is used exclusively by this queue manager and no others. For performance, it is recommended that a queue manager uses separate filesystems for logs and data. The suggested layout therefore creates two filesystems within the volume group.

You can optionally protect each of the filesystems from disk failures by using mirroring or RAID.

Mount points must all be owned by the mqm user.

You will need the following filesystems:

Per node:
- /var on a local non-shared disk - this is a standard filesystem or directory which will already exist.
Per queue manager:
- /MQHA/<qmgr>/data on shared disks - this is where the queue manager data directory will reside.
- /MQHA/<qmgr>/log on shared disks - this is where the queue manager recovery logs will reside.

The steps are :

Create the volume group that will be used for this queue manager's data and log files.
Create the /MQHA/<qmgr>/data and /MQHA/<qmgr>/log filesystems using the volume group created above.
For each node in turn, import the volume group, vary it on, ensure that the filesystems can be mounted, unmount the filesystems and varyoff the volume group.

[3] Create the Queue Manager

Select a node on which to perform the following actions
Ensure the queue manager's filesystems are mounted on the selected node.
Create the queue manager on this node, using the hacrtmqm script
Start the queue manager manually, using the strmqm command
Create any queues and channels
Test the queue manager
End the queue manager manually, using endmqm
On the other nodes, which may takeover the queue manager, run the halinkmqm script

[4] Configure the movable resources

The resource group will use the IP address as the service label. This is the address which clients and channels will use to connect to the queue manager.

Create a resource group and select the type as discussed above.
Configure the resource group in the usual way adding the service IP label, volume group and filesystem resources to the resource group.
Synchronise the cluster resources.
Start HACMP on each cluster node in turn and ensure that the cluster stabilizes, that the respective volume groups are varied on by each node and that the filesystems are mounted correctly.

[5] Configure the Application Server or Agent

The queue manager is represented within the resource group by an application server or agent.

Define an application server which will start and stop the queue manager. The start and stop scripts contained in the SupportPac may be used unmodified, or may be used as a basis from which you can develop customized scripts. The examples are called hamqm_start and hamqm_stop.
Add the application server to the resource group definition created in the previous step.
Optionally, create a user exit in /MQHA/bin/rc.local
Synchronise the cluster configuration.
Test that the node can start and stop the queue manager, by bringing the resource group online and offline.

[6] Configure a monitor

You can configure an application monitor which will monitor the health of the queue manager and trigger recovery actions as a result of MQ failures, not just node or network failures. Recovery actions include the ability to perform local restarts of the queue manager or to cause a failover of the resource group to another node.

To benefit from queue manager monitoring you must define an Application Monitor. If you created the queue manager using hacrtmqm, then one of these will have been created for you, in the /MQHA/bin directory, and is called hamqm_applmon.$qmgr.

To enable queue manager monitoring, define a custom application monitor for the Application Server created in previous step, providing the name of the monitor script and tell HACMP how frequently to invoke it. Set the stabilisation interval to 10 seconds, unless your queue manager is expected to take a long time to restart. This would normally be if your environment has long-running transactions that might cause a substantial amount of recovery/replay to be required.
To configure for local restarts, specify the Restart Count and Restart Interval.
Synchronise the cluster resources.
Test the operation of the application monitoring, and in particular verify that the local restart capability is working as configured. A convenient way to provoke queue manager failures is to identify the Execution Controller process (called amqzxma0) associated with the queue manager, and kill it.

Conclusion : the files we have to copy into /hacmp/, and adapt for our system, are :

hamqm_start
hamqm_stop
hamqm_applmon.$qmgr

Then, using smitty, we have to

define an application server
define a custom application monitor

Top

IC91 - HA for MB

url IC91 - high availability for MB on distributed platforms. Install into /MQHA/bin : samples use it.

A broker runs as a pair of processes, called bipservice and bipbroker. The latter in turn creates the execution groups that run message flows. It is this collection of processes which are managed by HA Software.

When creating the queue manager, don't configure the application server or application monitor described in SupportPac MC91. You will create an application server that covers the broker, queue manager and broker database instance.

When creating channels between queue managers, the sender channel should use the service address of the broker resource group and the broker queue manager's port number.

General Configuration
- [0] Configure the cluster
- Configure the shared disks
Configuration steps for Broker - UNIX
- [1] Create and configure the queue manager. Create the resource group.
- [2] Create and configure the broker database
- [3] Create the message broker
- [4] Place the broker under cluster control

[0] Configure the HA Cluster

Configure TCP/IP on the cluster nodes as described in your cluster software documentation.
Configure the cluster, cluster nodes and adapters to HA Software as usual.
Synchronise the Cluster Topology.
Now would be a good time to create and configure the user accounts that will be used to run the database instances, brokers and UNS. Home directories, (numeric) user ids, passwords, profiles and group memberships should be the same on all cluster nodes.

[1] Create and configure the queue manager

On one node, create a clustered queue manager as described in SupportPac MC91, using the hacrtmqm command. Use the volume group that you created for the broker and place the volume group and queue manager into a resource group to which the broker will be added. Don't configure the application server or application monitor described in SupportPac MC91 - you will create an application server that covers the broker, queue manager and broker database instance.
Set up queues and channels between the broker queue manager and the Configuration Manager queue manager:
- On the Configuration Manager queue manager create a transmission queue for communication to the broker queue manager. Ensure that the queue is given the same name and case as the broker queue manager. The transmission queue should be set to trigger the sender channel.
- On the Configuration Manager queue manager create a sender and receiver channel for communication with the broker queue manager. The sender channel should use the service address of the broker resource group and the broker queue manager's port number.
- On the broker queue manager create a transmission queue for communication to the Configuration Manager queue manager. Ensure that the queue is given the same name and case as the Configuration Manager queue manager. The transmission queue should be set to trigger the sender channel.
- On the broker queue manager create sender and receiver channels to match those just created on the Configuration Manager queue manager. The sender channel should use the IP address of the machine where the Configuration Manager queue manager runs, and the corresponding listener port number.
If you are using a UNS, set up queues and channels between the broker queue manager and the UNS queue manager:
- On the broker queue manager create a transmission queue for communication to the UNS queue manager. Ensure that the queue is given the same name and case as the UNS queue manager. The transmission queue should be set to trigger the sender channel.
- On the broker queue manager create a sender and receiver channel for communication with the UNS queue manager. If the UNS is clustered, the sender channel should use the service address of the UNS resource group and the UNS queue manager's port number.
- On the UNS queue manager create a transmission queue for communication to the broker queue manager. Ensure that the queue is given the same name and case as the broker queue manager. The transmission queue should be set to trigger the sender channel.
- On the UNS queue manager create a sender and receiver channel for communication with the broker queue manager, with the same names as the receiver and sender channel just created on the broker queue manager. The sender channel should use the service address of the broker resource group and the broker queue manager's port number.
Test that the above queue managers can communicate regardless of which node owns the resource groups they belong to.

[2] Create and configure the broker database

There are two options regarding where the broker database is run, either inside or outside the cluster. If you choose to run the database outside the cluster then simply follow the instructions in the WMB documentation for creating the broker database but ensure that you consider whether the database is a single point of failure and make appropriate provision for the availability of the database.

[3] Create the message broker

Create the broker on the node hosting the logical host using the hamqsicreatebroker command.
Ensure that you can start and stop the broker manually using the mqsistart and mqsistop cmmands.
On any other nodes in the resource group's nodelist (i.e. excluding the one on which you just created the broker), run the hamqsiaddbrokerstandby command to create the information needed by these nodes to enable them to host the broker.

[4] Place the broker under cluster control

Create an application server which will run the broker, its queue manager and the database instance, using the example scripts provided in this SupportPac. The example scripts are called hamqsi_start_broker_as and hamqsi_stop_broker_as.
You can also specify an application monitor using the hamqsi_applmon.<broker> script created by hamqsicreatebroker. An application monitor script cannot be passed parameters, so just specify the name of the monitor script. Also configure the other application monitor parameters, including the monitoring interval and the restart parameters you require.
Synchronise the cluster resources.
Ensure that the broker, queue manager and database instance are stopped, and start the application server.
Check that the components started and test that the resource group can be moved from one node to the other and that they run correctly on each node.
Ensure that stopping the application server stops the components.
With the application server started, verify that the HACMP local restart capability is working as configured. A convenient way to cause failures is to identify the bipservice for the broker and kill it.

MQ and HA

HisCock HA MQ whitepaper

HA in Clustering

A key problem in using MQ clustering for high availability is the problem of stuck messages (in Xmit queue).

Message expiry. If the message isn't delivered to it's target by the time the end-user would have timed out, get it to self destruct.

(fjb_saper) Easy answer: MQ clustering => load balancing. Hardware clustering => HA (high availability).

If you've got an HA set up there are 2 main options:

Use HA software so "the queue manager" is presented on a single IP/port no matter where it happens to be running so the value in the TAB file is always true.
Use the TAB file to define multiple instances of "THEQM" which identify QMA, QMB, etc

Complete (MQ&MB) Schema

First squema is like this one :

.-------------------. .-------------------. | | | | | AIX-1 (active) | | AIX-2 (pasive) | | | | | | .---------. | | .---------. | | | | | | | | | | | MB1(a) | | | | MB1(p) | | | | | | | | | | | .---------. | | .---------. | | | | | | .---------. | | .---------. | | | | | | | | | | | QM1(a) | | | | QM1(p) | | | | | | | | | | | .---------. | | .---------. | | | | | .-------------------. .-------------------.

We have an active machine, AIX-1, running QM1 and MB1, and a passive machine, AIX-2, which is almost always stopped.

So, in order to improve this second machine utilization, we can create a second queue manager and a second broker on AIX-2, and place its backup replicas in AIX-1 :

.--------------------------------. .--------------------------------. | | | | | AIX-1 (active) | | AIX-2 (active) | | | | | | .---------. | | .---------. |- | | | | | | | | \ | | MB1(a) | | | | MB1(p) | | | | | | | | | | | | | .---------. | | .---------. | | | | | | | => Service address 1 | .---------. | | .---------. | | | | | | | | | | | | | QM1(a) | | | | QM1(p) | | | | | | | | | | | / | .---------. | | .---------. |- | | | | | .---------. | | .---------. |- | | | | | | | | \ | | MB2(p) | | | | MB2(a) | | | | | | | | | | | | | .---------. | | .---------. | | | | | | | => Service address 2 | .---------. | | .---------. | | | | | | | | | | | | | QM2(p) | | | | QM2(a) | | | | | | | | | | | / | .---------. | | .---------. |- | | | | .--------------------------------. .--------------------------------.

Finaly, we join QM1 and QM2 in a MQ cluster, so while one machine in moving to its backup image, the source messages are still processed.

A n+1 arquitecture is also possible : to have "n" machines running, and 1 more being the backup of all those "n" machines - we guess they will fail one at a time !

Se instala MC91 (HACMP para MQ) y luego IC91 (HACMP para MB).

/MQHA/bin/hamqproc

/MQHA/bin/hamqproc contains the list of processes to be killed by hamqm_stop_su :

for process in `cat /MQHA/bin/hamqproc` do ps -ef | grep $process | grep -v grep | \ egrep "$srchstr" | awk '{print $2}'| \ xargs kill -9 done

{bestp}

Top

WMQ in HA Clusters - best practices

Separate disks for data files and logs
- While not essential for HA reasons, it is recommended for performance
- All our examples use this configuration
Channel state is hardened
- Sender channels will be automatically restarted (if triggered)
- Use the virtual IP address (or name) in the CONNAME
- Requesters will not be auto-restarted
  - But a 'server' machine is less likely to have requester channels
May need to use custom scripts or services to restart things like
- Trigger monitors
- Command server
- Applications
It's exactly the same as a normal queue manager restart
- So non-persistent messages will disappear
Long-running transactions may cause long-running restart
if cluster is used, place FR's on HACMP machines

TMM04 {BCN}

Top

HACMP "stop" script

#!/bin/ksh # DESCRIPTION: # /MQHA/bin/ha_mqm_stop_su <qmname> # # Stops the QM. # Check to see if the QM is already stopped. # If so, just make sure no processes are lying around. online=`/MQHA/bin/hamqm_running ${QM}` if [ ${online} != "1" ] then # QM is reported as offline; ensure no processes remain # Note that this whole script should be executed under su, which is why there's no su in the following loop. # The regular expression in the next line contains a tab character. Edit only with tab-friendly editors. srchstr="( |-m)$QM[ ]*.*$" for process in runmqlsr amqpcsea amqhasmx amqharmx amqzllp0 \ amqzlaa0 runmqchi amqrrmfa amqzxma0 do ps -ef | grep $process | grep -v grep | \ egrep "$srchstr" | awk '{print $2}'| \ xargs kill -9 done exit 0 fi

It can be done (newer stop_su) providing the names in a file : see here.

Top

HACMP "link" script

The core is :

# Args: # $1: Qmgr name # $2: Mangled qmgr directory name -- may or may not be the same as qmgr # $3: Shared Prefix -- e.g. /MQHA//data if [ -r $3/qmgrs/$2/qm.ini ] then # We're running on the master node that owns the queue manager # so we will create symlinks back to /var/mqm/ipc subdirs for topdir in @ipcc @qmpersist @app do for subdir in esem isem msem shmem spipe do rm -fr $ipcorig/$subdir rm -fr $ipcorig/$topdir/$subdir ln -fs $ipcbase/$subdir $ipcorig/$subdir ln -fs $ipcbase/$topdir/$subdir $ipcorig/$topdir/$subdir done done rm -rf $ipcorig/qmgrlocl ln -fs $ipcbase/qmgrlocl $ipcorig/qmgrlocl else # We're running on a standby node, so all we have to do is to # update the config file that tells us where the queue manager lives cat >> /var/mqm/mqs.ini <<EOF QueueManager: Name=$1 Prefix=$3 Directory=$2 EOF fi

HACMP "MQ monit" script

"simple" one

Just does a "ping qmgrname" :

dy0608:/MQHA/bin # more hamqm_applmon.QMPROD01 #!/bin/ksh su mqm -c /MQHA/bin/hamqm_applmon_su QMPROD01 dy0608:/MQHA/bin # more hamqm_applmon_su #!/bin/ksh QM=$1 # Test the operation of the QM. echo "ping qmgr" | runmqsc ${QM} > /dev/null 2>&1 pingresult=$? # pingresult will be 0 on success; non-zero on error (man runmqsc) if [ $pingresult -eq 0 ] then # ping succeeded echo "hamqm_applmon: Queue Manager ${QM} is responsive" result=0 else # ping failed result=$pingresult fi exit $result

New alternative : dspmq -n <qmname> | grep "RUNNING"

"complex" one

Verifies few processes are still running :

Check_qmgr: # Check for the main processes for pid in amqzxma0 amqhasmx amqzllp0 do if ps -u mqm -o pid,args | eval /usr/xpg4/bin/grep -E '$PATTERN' |\ grep -w $pid > /dev/null then rc=0 else rc=1 fi

Gracias, Vicente !

HACMP "MB monit" script

STATE="stopped" # cnt=`ps -ef | grep db2sysc | grep -v grep | grep $DBINST | wc -l` if [ $cnt -gt "0" ] then # Found one or more db2sysc process, so database instance assumed to be running normally echo "hamqsi_monitor_broker_as: Broker database is running" STATE="started" else # Did not find a db2sysc process, but check to see whether db2start is still running and only report error if there is not one. cnt=`ps -ef | grep db2start | grep -v grep | grep $DBINST | wc -l` if [ $cnt -gt "0" ] then echo "hamqsi_monitor_broker_as: Broker database is starting" STATE="starting" else echo "hamqsi_monitor_broker_as: Broker database is not running correctly" STATE="stopped" fi fi # Decide whether to continue or to exit case $STATE in stopped) echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is not running correctly" exit 1 ;; starting) echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is starting" echo "hamqsi_monitor_broker_as: WARNING - Stabilisation Interval may be too short" echo "hamqsi_monitor_broker_as: WARNING - No test of broker $BROKER will be conducted" exit 0 ;; started) echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is running" continue # proceed by testing broker ;; esac # ------------------------------------------------------------------ # Check the MQSI Broker is running # # Re-initialise STATE for safety STATE="stopped" # # The broker runs as a process called bipservice which is responsible for starting and re-starting the admin agent process (bipbroker). # The bipbroker is responsible for starting any DataFlowEngines. # If no execution groups have been assigned to the broker there will be no DataFlowEngine processes. # There should always be a bipservice and bipbroker process pair. # This monitor script only tests for bipservice, because bipservice should restart bipbroker if necessary # - the monitor script should not attempt to restart bipbroker and it may be premature to report an absence of a bipbroker as a failure. cnt=`ps -ef | grep "bipservice $BROKER" | grep -v grep | wc -l` if [ $cnt -eq 0 ] then echo "hamqsi_monitor_broker_as: MQSI Broker $BROKER is not running" STATE="stopped" else echo "hamqsi_monitor_broker_as: MQSI Broker $BROKER is running" STATE="started" fi # Decide how to exit case $STATE in stopped) echo "hamqsi_monitor_broker_as: Broker ($BROKER) is not running correctly" exit 1 ;; started) echo "hamqsi_monitor_broker_as: Broker ($BROKER) is running" exit 0 ;; esac

HA logs

An easy way to monitor cluster events and messages is by tailing the following HACMP log files:

/tmp/hacmp.out - Outputs detailed logs on cluster events and output from application start and stop scripts. Note that output from the monitoring is not part of this log file.
/usr/es/adm/cluster.log - Contains timestamped, formatted messages generated by HACMP start and stop scripts and daemons. This file contains only high level events, and not detailed log information. The above log files are local to each node.

Application monitoring in HACMP has its own set of log files.

HA sanity tests

Manual system start up tests :

Node 1 and node 2 are both active:
1. disable the cluster on both nodes by using smitty clsstop
2. stopping the cluster unmounts the shared drives. Mount the shared drives.
3. start QM1 on node 1
4. start QM2 on node 2
5. observe the results to verify that no errors are reported during MQ operation.
Node 2 is the only active node:
1. start QM1 on node 2
2. start QM2 on node 2
3. observe the results to verify that no errors are reported during MQ operation.
Node 1 is the only active node:
1. start QM2 on node 1
2. start QM1 on node 1
3. observe the results to verify that no errors are reported during MQ operation.

Verify HA configuration

The tests cases for verifying the HA configuration are:

Shared files - the following files should be located in shared directories:
- MQ logs
- Queue manager data for every queue manger should reside in a shared location.

/MQHA/<qmgr>/data and /MQHA/<qmgr>/log

Separate disks for data files and logs - while not essential for HA reasons, it is recommended for performance

Test HA control

The objective of this test suite is to verify if HACMP is able to start, restart, and monitor all the individual applications that are part of the cluster.

Automatic system startup/restart under HACMP control
- end message broker gracefully : "mqsistop -i MB_NAME"
- end message broker abruptly
  ps -ef | grep bipservice | grep -v grep | \ awk '{print $2}' | xargs kill -9
- end queue manager gracefully : "endmqm -i QM_NAME"
- end queue manager abruptly
  ps -ef | grep AMQXSSVN.EXE | grep -v grep | \ awk '{print $2}' | xargs kill -9
- end queue manager abruptly : kill "AMQZXMA0", Execution Controller process. url
- disable qmgr restarting by:
  - changing the owner of AMQERR01.LOG
  - chmod 400 of active log file S0000000.LOG
  - renaming general mqs.ini or specific qm.ini
Restart attempts setting
Verify the number of retry attempts. Each WebSphere Business Integration application will be restarted three times before a resource group failover is initiated. The number of retry attempts can be configured in HACMP.
Failover
Fallback
Fallback refers to the movement of a resource group from a secondary or a failover node to the primary node, which is being reintegrated into the cluster. In the current WebSphere Business Integration cluster, automatic fallback is disabled. However, manual reintegration should be validated:
1. Bring node 1 down:
  shutdown -r now
2. Node 1 resource group fails over to node 2.
3. Verify failover by looking at the HACMP log file:
  tail -f /tmp/hacmp.out
4. Start up cluster on node 1 after node 1 is back up:
  smitty clstart
5. Observe that the resource groups are still running on node 2 even though the cluster on node 1 is backed up.
6. Repeat the above test for node 2 fallback.

Migration / maintenance

Assuming a two-node active/active cluster, the steps are

Select one machine to upgrade first
At a suitable time, when the moving of a queue manager will not cause a serious disruption to service, manually force a migration of the active queue manager to its partner node
On the machine that is now running both queue managers, disable the failover capabilities for the queue managers.
Upgrade the software on the machine that is not running any queue managers
Re-enable failover, and move both queue managers across to the newly upgraded machine
Disable failover again
Upgrade the original box
Re-enable failover
When it will cause least disruption, move one of the queue managers across to balance the workload

Top

HACMP tips

Fix Packs : move both environments to a single machine and apply maintenance to "empty" machine. Failback all components and apply maintenance to the other machine. Caution : there is a point of "no way back" !
design applications to handle duplicates
if store & forward plus persistency is used, when we introduce disaster recovery, a duplicate message chance is created. if EndToEnd transaction number is used (and in E2E_Ack also), the duplicate problem is solved and persistency is not required anymore.
design applications to be able to run in multiple instances
a PR (partial repository) is a subscriber of the FR's (full repositories) and (if there are 2) it receives duplicates
how to monitor HACMP is working ok ?
Include checking all of the things that the qmgr must have to satisfy the business applications, like: the status of channels, listeners, availability of application queues (put/get enabled, trigger attributes), existence of processes, namelists, trigger monitor, applications, ...
maybe you dont trust any "process monitoring" and prefer to put a message into a queue to validate the qmgr is working !
script to go to backup node :
- endmqm (in background)
- wait some time
- kill remaining mq resources in specific order (see SysAdmin manual)
CM does not require to go under HACMP : not many changes to production environment. CM can go on a VMWARE image !

Top

FileSystem Requirements

What are the requirements for a HACMP (MQ) filesystem ?

And using Multi-Instance ? NFS v4 !

Top

Multi-instance MQ & MB

multi-instance queue managers - good intro.

In multi-instance terminology, there is an Active qmgr and a Standby qmgr that are both running and looking at file locks. Read nfsv4 specs, RFC 3530 : lease period

Increase messaging availability : url

Creating a multi-instance qmgr on Linux

Both machines have a diferent IP !
So, "do NOT use multi-instance queue managers as full repositories", page 19, but here it says "if you still need better availability, consider hosting the full repository queue managers as multi-instance queue managers" !

CONNAME has been expanded to support more than one "ipaddress(port)" combination, across all channel types that use it :

define channel(CH_NAME) chltype(SDR) trptype(TCP) xmitq(XQN) conname('<ip>(<port>)') replace sample : DEFINE CHANNEL(CHANNEL1) CHLTYPE(CLNTCONN) TRPTYPE(TCP) CONNAME('server1(2345),server2(2345)') QMNAME(QM1) REPLACE

Mind the 48 character limit for "CONNAME" !

developerWorks : complete sample, part 2 ; creating a multi-instance queue manager for MQ on Linux.

create shared directories : url
create multi-instance MQ : url
create multi-instance MB : url

When you intend to use a queue manager as a multi-instance queue manager, create a single queue manager on one of the servers using the WebSphere MQ crtmqm command, placing its queue manager data and logs in shared network storage. On the other server, rather than create the queue manager again, use the WebSphere MQ addmqinf command to create a reference to the queue manager data and logs on the network storage.

You can now run the queue manager from either of the servers. Each of the servers references the same queue manager data and logs; there is only one queue manager, and it is active on only one server at a time.

You can swap the active instance to the other server, once it has started, by stopping the active instance using the switchover option to transfer control to the standby.

The active instance of QM1 has exclusive access to the shared queue manager data and logs folders when it is running. The standby instance of QM1 detects when the active instance has failed, and becomes the active instance. It takes over the QM1 data and logs in the state they were left by the active instance, and accepts reconnections from clients and channels. The active instance might fail for various reasons that result in the standby taking over:

Failure of the server hosting the active queue manager instance.
Failure of connectivity between the server hosting the active queue manager instance and the file system.
Unresponsiveness of queue manager processes, detected by WebSphere MQ, which then shuts down the queue manager.

You can add the queue manager configuration information to multiple servers, and choose any two servers to run as the active/standby pair.

A multi-instance queue manager is one part of a high availability solution. You need some additional components to build a useful high availability solution.

client and channel reconnection to transfer WebSphere MQ connections to the computer that takes over running the active queue manager instance.
a high performance shared network file system that manages locks correctly and provides protection against media and file server failure.
resilient networks and power supplies to eliminate single points of failure in the basic infrastructure.
applications that tolerate failover. In particular you need to pay close attention to the behavior of transactional applications, and to applications that browse WebSphere MQ queues.
monitoring and management of the active and standby instances to ensure that they are running, and to restart active instances that have failed. Although Start of changemulti-instanceEnd of change queue managers restart automatically, you need to be sure that your standby instances are running, ready to take over, and that failed instances are brought back online as new standby instances.

WebSphere MQ Clients and channels reconnect automatically to the standby queue manager when it becomes active. Reconnection, and the other components in a high availability solution are discussed in related topics. Automatic client reconnect is not supported by WebSphere MQ classes for Java.

MQ, MB.

NFS specs and samples

Filesystem requisites : The storage must be accessed by a network file system protocol which is Posix-compliant and supports lease-based locking. Network File System version 4 (NFS v4) satisfies this requirement. Also "NAS" or "GPFS".

Probeid ZX155001 : If you are using the NFS V4 file system as the shared file system, you must use hard mounts, synchronous writing and disable write caching, to fulfill these requirements.

Verification tool : amqmfsck

Required highly available network-attached storage (NAS) examples:

IBM GPFS
Veritas Cluster File System
Highly available NFSv4

2-instance creation summary

Summary:

Set up shared filesystems for QM data and logs
Create the queue manager on machine1
crtmqm -md /shared/qmdata -ld /shared/qmlog QM1
Define the queue manager on machine2 (or edit mqs.ini)
addmqinf -vName=QM1 -vDirectory=QM1 -vPrefix=/var/mqm -vDataPath=/shared/qmdata/QM1
Start an instance on machine1 - it becomes Active
strmqm -x QM1
Start another instance on machine2 - it becomes Standby
strmqm -x QM1