|
home /
infca /
mq_hacmp
(navigation links)
|
"Avui comença tot"
|
HACMP
HACMP configuratoins
Cluster Configurations :
A standby configuration is the most basic cluster configuration
in which one node performs work whilst the other node acts only as standby.
The standby node does not perform work and is referred to as idle;
this configuration is sometimes called "cold standby".
A takeover configuration is a more advanced configuration
in which all nodes perform some kind of work
and critical work can be taken over in the event of a node failure.
A "one sided takeover" configuration is one in which
a standby node performs some additional, non critical and non movable work.
This is rather like a standby configuration
but with (non critical) work being performed by the standby node.
A "mutual takeover" configuration is one in which
all nodes are performing highly available (movable) work.
This type of cluster configuration is also sometimes referred to as "Active/Active"
to indicate that all nodes are actively processing critical workload.
HACMP, VCS, ServiceGuard , Heartbeat and MSCS all use a "shared nothing" clustering architecture.
A shared nothing cluster has no concurrently shared resources,
and works by transferring ownership of resources from one node to another,
to work around failures or in response to operator commands.
Resources are things like disks, network addresses, or critical processes.
Configuration :
All HA products have the concept of a unit of failover.
This is a set of definitions that contains all the processes and resources
needed to deliver a highly available service
and ideally should contain only those processes and resources.
In HACMP, the unit of failover is called a resource group.
On other HA products the name might be different, but the concept is the same.
On VCS, it is known as a service group,
on MC/ServiceGuard it is a package,
in Heartbeat is a resource group and in MSCS is a group.
The smallest unit of failover for WMQ is a queue manager,
since you cannot move part of a queue manager without moving the whole thing.
It follows that the optimal configuration is to place each queue manager in a separate resource group,
with the resources upon which it depends.
The resource group should therefore contain
the shared disks used by a queue manager,
which should be in a volume group or disk group reserved exclusively for the resource group,
the IP address used to connect to the queue manager (the service address)
and an object which represents the queue manager.
Failover -
Invoking a secondary system to take over when the primary system fails.
HACMP software samples
Required HA cluster software examples:
- IBM PowerHA (HACMP)
- Veritas Cluster Server (VCS)
- Microsoft Cluster Service (MSCS)
- Red Hat Cluster
When not to use HA WebSphere MQ queue manager clusters
HA WebSphere MQ queue manager clusters require
additional proprietary HA hardware (shared disks)
and external HA clustering software (such as HACMP).
This increases the administration costs of the environment
because you also need to administer the HA components.
This approach also increases the initial implementation costs
because extra hardware and software are required.
Therefore, balance these initial costs
with the potential costs incurred if a queue manager fail and messages become trapped.
If trapped messages are not a problem for the applications
(for example, the response time of the application is irrelevant or the data is updated frequently),
then HA WebSphere MQ queue manager clusters are probably not required.
General recommendations
Some of the advice pertinent to an HA environment in general is:
- Each node in cluster must be sized large enough to support total load under failure conditions.
Dynamic process allocation under AIX
(shutting down less important processes on the production failover machine)
can help lower hardware costs.
- Optimal resource utilization in an HA environment occurs when
each node in an Active/Active cluster drives similar loads.
- Use HA clustering for high availability only, not for performance or simplified administration.
- The HA solution in the QA environment should exactly mimic the production environment.
This helps to avoid critical production problems and minimizes down time.
- Test individual HA components for online, offline, and failover.
- Attempt to restart the individual HA components
(for example, InterChange Server or Message Broker)
a minimum of three times on the primary node before failing over to the alternate node.
- Automatic fallback is not recommended.
Manual fallback is recommended to minimize system disruption.
- It is recommended that a stock of spare components such as network adapters, cables, etc. be maintained.
Related to this is also a need to define an escalation plan to deal with unexpected failures.
- Be sure to create a list identifying who to call when something HA specific goes wrong,
or the normal contacts are not available.
Highly Available WebSphere Business Integration Solutions, SG24-6328-00, chapter 8.2, page 122.
MC91 - HA for MQ
This SupportPac has now been
withdrawn.
The support is now included in the WebSphere MQ V7.0.1 product and documentation.
url MC91 - high availability for MQ on Unix.
Install into /MQHA/bin : samples use it.
This SupportPac provides notes and sample scripts
to assist with the installation and configuration
of WebSphere MQ (WMQ) V6 and V7 in High Availability (HA) environments.
Three different platforms and environments are described here,
but they share a common design
and this design can also be extended for many other systems.
Specifically this SupportPac deals with the following HA products:
- HACMP (High Availability Cluster Multi Processing)
- Veritas Cluster Server (VCS)
- MC/ServiceGuard (MCSG)
MC91 installation :
16/03/2009 20:40 1.310.445 mc91.tar.Z
{mqm - /MQHA/bin/ } $ uncompress mc91.tar.Z
{mqm - /MQHA/bin/ } $ tar -xvf mc91.tar
MC91 configuration :
- [1] Configure the HA Cluster
- [2] Configure the shared disks
- [3] Create the Queue Manager
- [4] Configure the movable resources
- [5] Configure the Application Server or Agent
- [6] Configure a monitor
[1] Configure the HA Cluster
- Configure TCP/IP on the cluster nodes for HACMP. Remember to configure ~root/.rhosts, /etc/rc.net, etc.
- Configure the cluster, cluster nodes and adapters to HACMP as usual.
- Synchronise the Cluster Topology.
[2] Configure the shared disks
This step creates the volume group (or disk group) and filesystems needed for the queue manager.
So that this queue manager can be moved from one node to another
without disrupting any other queue managers,
you should designate a group containing shared disks
which is used exclusively by this queue manager and no others.
For performance, it is recommended that a queue manager uses separate filesystems for logs and data.
The suggested layout therefore creates two filesystems within the volume group.
You can optionally protect each of the filesystems from disk failures by using mirroring or RAID.
Mount points must all be owned by the mqm user.
You will need the following filesystems:
- Per node:
- /var on a local non-shared disk - this is a standard filesystem or directory which will already exist.
- Per queue manager:
- /MQHA/<qmgr>/data on shared disks - this is where the queue manager data directory will reside.
- /MQHA/<qmgr>/log on shared disks - this is where the queue manager recovery logs will reside.
The steps are :
- Create the volume group that will be used for this queue manager's data and log files.
- Create the /MQHA/<qmgr>/data and /MQHA/<qmgr>/log filesystems using the volume group created above.
- For each node in turn, import the volume group, vary it on, ensure that the filesystems can be mounted, unmount the filesystems and varyoff the volume group.
[3] Create the Queue Manager
- Select a node on which to perform the following actions
- Ensure the queue manager's filesystems are mounted on the selected node.
- Create the queue manager on this node, using the hacrtmqm script
- Start the queue manager manually, using the strmqm command
- Create any queues and channels
- Test the queue manager
- End the queue manager manually, using endmqm
- On the other nodes, which may takeover the queue manager, run the halinkmqm script
[4] Configure the movable resources
The resource group will use the IP address as the service label.
This is the address which clients and channels will use to connect to the queue manager.
- Create a resource group and select the type as discussed above.
- Configure the resource group in the usual way adding the service IP label, volume group and filesystem resources to the resource group.
- Synchronise the cluster resources.
- Start HACMP on each cluster node in turn and ensure that the cluster stabilizes, that the respective volume groups are varied on by each node and that the filesystems are mounted correctly.
[5] Configure the Application Server or Agent
The queue manager is represented within the resource group by an application server or agent.
- Define an application server which will start and stop the queue manager.
The start and stop scripts contained in the SupportPac may be used unmodified,
or may be used as a basis from which you can develop customized scripts.
The examples are called hamqm_start and hamqm_stop.
- Add the application server to the resource group definition created in the previous step.
- Optionally, create a user exit in /MQHA/bin/rc.local
- Synchronise the cluster configuration.
- Test that the node can start and stop the queue manager,
by bringing the resource group online and offline.
[6] Configure a monitor
You can configure an application monitor which will monitor the health of the queue manager
and trigger recovery actions as a result of MQ failures, not just node or network failures.
Recovery actions include the ability to perform local restarts of the queue manager
or to cause a failover of the resource group to another node.
To benefit from queue manager monitoring you must define an Application Monitor.
If you created the queue manager using hacrtmqm,
then one of these will have been created for you,
in the /MQHA/bin directory,
and is called hamqm_applmon.$qmgr.
- To enable queue manager monitoring,
define a custom application monitor for the Application Server created in previous step,
providing the name of the monitor script and tell HACMP how frequently to invoke it.
Set the stabilisation interval to 10 seconds,
unless your queue manager is expected to take a long time to restart.
This would normally be if your environment has long-running transactions
that might cause a substantial amount of recovery/replay to be required.
- To configure for local restarts, specify the Restart Count and Restart Interval.
- Synchronise the cluster resources.
- Test the operation of the application monitoring,
and in particular verify that the local restart capability is working as configured.
A convenient way to provoke queue manager failures is
to identify the Execution Controller process (called amqzxma0) associated with the queue manager, and kill it.
Conclusion : the files we have to copy into /hacmp/, and adapt for our system, are :
- hamqm_start
- hamqm_stop
- hamqm_applmon.$qmgr
Then, using smitty, we have to
- define an application server
- define a custom application monitor
IC91 - HA for MB
url IC91 - high availability for MB on distributed platforms.
Install into /MQHA/bin : samples use it.
A broker runs as a pair of processes, called bipservice and bipbroker.
The latter in turn creates the execution groups that run message flows.
It is this collection of processes which are managed by HA Software.
When creating the queue manager,
don't configure the application server or application monitor described in SupportPac MC91.
You will create an application server that covers the broker, queue manager and broker database instance.
When creating channels between queue managers,
the sender channel should use the service address of the broker resource group
and the broker queue manager's port number.
- General Configuration
- [0] Configure the cluster
- Configure the shared disks
- Configuration steps for Broker - UNIX
- [1] Create and configure the queue manager. Create the resource group.
- [2] Create and configure the broker database
- [3] Create the message broker
- [4] Place the broker under cluster control
[0] Configure the HA Cluster
- Configure TCP/IP on the cluster nodes as described in your cluster software documentation.
- Configure the cluster, cluster nodes and adapters to HA Software as usual.
- Synchronise the Cluster Topology.
- Now would be a good time to create and configure the user accounts
that will be used to run the database instances, brokers and UNS.
Home directories, (numeric) user ids, passwords, profiles and group memberships
should be the same on all cluster nodes.
[1] Create and configure the queue manager
- On one node, create a clustered queue manager as described in SupportPac MC91, using the hacrtmqm command.
Use the volume group that you created for the broker and place the volume group and queue manager
into a resource group to which the broker will be added.
Don't configure the application server or application monitor described in SupportPac MC91 -
you will create an application server that covers the broker, queue manager and broker database instance.
- Set up queues and channels between the broker queue manager and the Configuration Manager queue manager:
- On the Configuration Manager queue manager create a transmission queue for communication to the broker queue manager. Ensure that the queue is given the same name and case as the broker queue manager. The transmission queue should be set to trigger the sender channel.
- On the Configuration Manager queue manager create a sender and receiver channel for communication with the broker queue manager. The sender channel should use the service address of the broker resource group and the broker queue manager's port number.
- On the broker queue manager create a transmission queue for communication to the Configuration Manager queue manager. Ensure that the queue is given the same name and case as the Configuration Manager queue manager. The transmission queue should be set to trigger the sender channel.
- On the broker queue manager create sender and receiver channels to match those just created on the Configuration Manager queue manager. The sender channel should use the IP address of the machine where the Configuration Manager queue manager runs, and the corresponding listener port number.
- If you are using a UNS, set up queues and channels between the broker queue manager and the UNS queue manager:
- On the broker queue manager create a transmission queue for communication to the UNS queue manager. Ensure that the queue is given the same name and case as the UNS queue manager. The transmission queue should be set to trigger the sender channel.
- On the broker queue manager create a sender and receiver channel for communication with the UNS queue manager. If the UNS is clustered, the sender channel should use the service address of the UNS resource group and the UNS queue manager's port number.
- On the UNS queue manager create a transmission queue for communication to the broker queue manager. Ensure that the queue is given the same name and case as the broker queue manager. The transmission queue should be set to trigger the sender channel.
- On the UNS queue manager create a sender and receiver channel for communication with the broker queue manager, with the same names as the receiver and sender channel just created on the broker queue manager. The sender channel should use the service address of the broker resource group and the broker queue manager's port number.
- Test that the above queue managers can communicate regardless of which node owns the resource groups they belong to.
[2] Create and configure the broker database
There are two options regarding where the broker database is run, either inside or outside the cluster.
If you choose to run the database outside the cluster
then simply follow the instructions in the WMB documentation for creating the broker database
but ensure that you consider whether the database is a single point of failure
and make appropriate provision for the availability of the database.
[3] Create the message broker
- Create the broker on the node hosting the logical host using the hamqsicreatebroker command.
- Ensure that you can start and stop the broker manually using the mqsistart and mqsistop cmmands.
- On any other nodes in the resource group's nodelist
(i.e. excluding the one on which you just created the broker),
run the hamqsiaddbrokerstandby command to create the information needed by these nodes to enable them to host the broker.
[4] Place the broker under cluster control
- Create an application server which will run the broker, its queue manager and the database instance,
using the example scripts provided in this SupportPac.
The example scripts are called hamqsi_start_broker_as and hamqsi_stop_broker_as.
- You can also specify an application monitor using the hamqsi_applmon.<broker> script created by hamqsicreatebroker.
An application monitor script cannot be passed parameters, so just specify the name of the monitor script.
Also configure the other application monitor parameters,
including the monitoring interval and the restart parameters you require.
- Synchronise the cluster resources.
- Ensure that the broker, queue manager and database instance are stopped, and start the application server.
- Check that the components started and test that the resource group can be moved from one node to the other
and that they run correctly on each node.
- Ensure that stopping the application server stops the components.
- With the application server started, verify that the HACMP local restart capability is working as configured.
A convenient way to cause failures is to identify the bipservice for the broker and kill it.
MQ and HA
HisCock HA MQ whitepaper
HA in Clustering
A key problem in using MQ clustering for high availability
is the problem of stuck messages (in Xmit queue).
Message expiry.
If the message isn't delivered to it's target
by the time the end-user would have timed out,
get it to self destruct.
(fjb_saper) Easy answer:
MQ clustering => load balancing.
Hardware clustering => HA (high availability).
If you've got an HA set up there are 2 main options:
- Use HA software so "the queue manager"
is presented on a single IP/port
no matter where it happens to be running
so the value in the TAB file is always true.
- Use the TAB file to define multiple instances of "THEQM"
which identify QMA, QMB, etc
Complete (MQ&MB) Schema
First squema is like this one :
.-------------------. .-------------------.
| | | |
| AIX-1 (active) | | AIX-2 (pasive) |
| | | |
| .---------. | | .---------. |
| | | | | | | |
| | MB1(a) | | | | MB1(p) | |
| | | | | | | |
| .---------. | | .---------. |
| | | |
| .---------. | | .---------. |
| | | | | | | |
| | QM1(a) | | | | QM1(p) | |
| | | | | | | |
| .---------. | | .---------. |
| | | |
.-------------------. .-------------------.
We have an active machine, AIX-1, running QM1 and MB1,
and a passive machine, AIX-2, which is almost always stopped.
So, in order to improve this second machine utilization,
we can create a second queue manager and a second broker on AIX-2,
and place its backup replicas in AIX-1 :
.--------------------------------. .--------------------------------.
| | | |
| AIX-1 (active) | | AIX-2 (active) |
| | | |
| .---------. | | .---------. |-
| | | | | | | | \
| | MB1(a) | | | | MB1(p) | | |
| | | | | | | | |
| .---------. | | .---------. | |
| | | | | => Service address 1
| .---------. | | .---------. | |
| | | | | | | | |
| | QM1(a) | | | | QM1(p) | | |
| | | | | | | | /
| .---------. | | .---------. |-
| | | |
| .---------. | | .---------. |-
| | | | | | | | \
| | MB2(p) | | | | MB2(a) | | |
| | | | | | | | |
| .---------. | | .---------. | |
| | | | | => Service address 2
| .---------. | | .---------. | |
| | | | | | | | |
| | QM2(p) | | | | QM2(a) | | |
| | | | | | | | /
| .---------. | | .---------. |-
| | | |
.--------------------------------. .--------------------------------.
Finaly, we join QM1 and QM2 in a MQ cluster,
so while one machine in moving to its backup image,
the source messages are still processed.
A n+1 arquitecture is also possible : to have "n" machines running,
and 1 more being the backup of all those "n" machines - we guess they will fail one at a time !
Se instala MC91 (HACMP para MQ) y luego IC91 (HACMP para MB).
Complete list :
MB_HACMP (ext, ***)
Install checklist :
MB_HACMP
/MQHA/bin/hamqproc
/MQHA/bin/hamqproc contains the list of processes to be killed by hamqm_stop_su :
for process in `cat /MQHA/bin/hamqproc`
do
ps -ef | grep $process | grep -v grep | \
egrep "$srchstr" | awk '{print $2}'| \
xargs kill -9
done
{bestp}
WMQ in HA Clusters - best practices
- Separate disks for data files and logs
- While not essential for HA reasons, it is recommended for performance
- All our examples use this configuration
- Channel state is hardened
- Sender channels will be automatically restarted (if triggered)
- Use the virtual IP address (or name) in the CONNAME
- Requesters will not be auto-restarted
- But a 'server' machine is less likely to have requester channels
- May need to use custom scripts or services to restart things like
- Trigger monitors
- Command server
- Applications
- It's exactly the same as a normal queue manager restart
- So non-persistent messages will disappear
- Long-running transactions may cause long-running restart
- if cluster is used, place FR's on HACMP machines
TMM04 {BCN}
Perl
You will need to insure the shebang line points to it.
So you may need to set it to
#!/usr/bin/perl
or #!/bin/perl
or even #!/usr/perl - whatever is local standard.
How to know the "local standard" ?
HACMP "stop" script
#!/bin/ksh
# DESCRIPTION:
# /MQHA/bin/ha_mqm_stop_su <qmname>
#
# Stops the QM.
# Check to see if the QM is already stopped.
# If so, just make sure no processes are lying around.
online=`/MQHA/bin/hamqm_running ${QM}`
if [ ${online} != "1" ]
then
# QM is reported as offline; ensure no processes remain
# Note that this whole script should be executed under su, which is why there's no su in the following loop.
# The regular expression in the next line contains a tab character. Edit only with tab-friendly editors.
srchstr="( |-m)$QM[ ]*.*$"
for process in runmqlsr amqpcsea amqhasmx amqharmx amqzllp0 \
amqzlaa0 runmqchi amqrrmfa amqzxma0
do
ps -ef | grep $process | grep -v grep | \
egrep "$srchstr" | awk '{print $2}'| \
xargs kill -9
done
exit 0
fi
It can be done (newer stop_su) providing the names in a file :
see here.
HACMP "link" script
The core is :
# Args:
# $1: Qmgr name
# $2: Mangled qmgr directory name -- may or may not be the same as qmgr
# $3: Shared Prefix -- e.g. /MQHA//data
if [ -r $3/qmgrs/$2/qm.ini ]
then
# We're running on the master node that owns the queue manager
# so we will create symlinks back to /var/mqm/ipc subdirs
for topdir in @ipcc @qmpersist @app
do
for subdir in esem isem msem shmem spipe
do
rm -fr $ipcorig/$subdir
rm -fr $ipcorig/$topdir/$subdir
ln -fs $ipcbase/$subdir $ipcorig/$subdir
ln -fs $ipcbase/$topdir/$subdir $ipcorig/$topdir/$subdir
done
done
rm -rf $ipcorig/qmgrlocl
ln -fs $ipcbase/qmgrlocl $ipcorig/qmgrlocl
else
# We're running on a standby node, so all we have to do is to
# update the config file that tells us where the queue manager lives
cat >> /var/mqm/mqs.ini <<EOF
QueueManager:
Name=$1
Prefix=$3
Directory=$2
EOF
fi
HACMP "MQ monit" script
"simple" one
Just does a "ping qmgrname" :
dy0608:/MQHA/bin # more hamqm_applmon.QMPROD01
#!/bin/ksh
su mqm -c /MQHA/bin/hamqm_applmon_su QMPROD01
dy0608:/MQHA/bin # more hamqm_applmon_su
#!/bin/ksh
QM=$1
# Test the operation of the QM.
echo "ping qmgr" | runmqsc ${QM} > /dev/null 2>&1
pingresult=$?
# pingresult will be 0 on success; non-zero on error (man runmqsc)
if [ $pingresult -eq 0 ]
then # ping succeeded
echo "hamqm_applmon: Queue Manager ${QM} is responsive"
result=0
else # ping failed
result=$pingresult
fi
exit $result
New alternative :
dspmq -n <qmname> | grep "RUNNING"
"complex" one
Verifies few processes are still running :
Check_qmgr: # Check for the main processes
for pid in amqzxma0 amqhasmx amqzllp0
do
if ps -u mqm -o pid,args | eval /usr/xpg4/bin/grep -E '$PATTERN' |\
grep -w $pid > /dev/null
then
rc=0
else
rc=1
fi
Gracias, Vicente !
HACMP "MB monit" script
STATE="stopped"
#
cnt=`ps -ef | grep db2sysc | grep -v grep | grep $DBINST | wc -l`
if [ $cnt -gt "0" ]
then
# Found one or more db2sysc process, so database instance assumed to be running normally
echo "hamqsi_monitor_broker_as: Broker database is running"
STATE="started"
else
# Did not find a db2sysc process, but check to see whether db2start is still running and only report error if there is not one.
cnt=`ps -ef | grep db2start | grep -v grep | grep $DBINST | wc -l`
if [ $cnt -gt "0" ]
then
echo "hamqsi_monitor_broker_as: Broker database is starting"
STATE="starting"
else
echo "hamqsi_monitor_broker_as: Broker database is not running correctly"
STATE="stopped"
fi
fi
# Decide whether to continue or to exit
case $STATE in
stopped)
echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is not running correctly"
exit 1
;;
starting)
echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is starting"
echo "hamqsi_monitor_broker_as: WARNING - Stabilisation Interval may be too short"
echo "hamqsi_monitor_broker_as: WARNING - No test of broker $BROKER will be conducted"
exit 0
;;
started)
echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is running"
continue # proceed by testing broker
;;
esac
# ------------------------------------------------------------------
# Check the MQSI Broker is running
#
# Re-initialise STATE for safety
STATE="stopped"
#
# The broker runs as a process called bipservice which is responsible for starting and re-starting the admin agent process (bipbroker).
# The bipbroker is responsible for starting any DataFlowEngines.
# If no execution groups have been assigned to the broker there will be no DataFlowEngine processes.
# There should always be a bipservice and bipbroker process pair.
# This monitor script only tests for bipservice, because bipservice should restart bipbroker if necessary
# - the monitor script should not attempt to restart bipbroker and it may be premature to report an absence of a bipbroker as a failure.
cnt=`ps -ef | grep "bipservice $BROKER" | grep -v grep | wc -l`
if [ $cnt -eq 0 ]
then
echo "hamqsi_monitor_broker_as: MQSI Broker $BROKER is not running"
STATE="stopped"
else
echo "hamqsi_monitor_broker_as: MQSI Broker $BROKER is running"
STATE="started"
fi
# Decide how to exit
case $STATE in
stopped)
echo "hamqsi_monitor_broker_as: Broker ($BROKER) is not running correctly"
exit 1
;;
started)
echo "hamqsi_monitor_broker_as: Broker ($BROKER) is running"
exit 0
;;
esac
HA logs
An easy way to monitor cluster events and messages is by tailing the following HACMP log files:
- /tmp/hacmp.out - Outputs detailed logs on cluster events and output from
application start and stop scripts. Note that output from the monitoring is not
part of this log file.
- /usr/es/adm/cluster.log - Contains timestamped, formatted messages
generated by HACMP start and stop scripts and daemons.
This file contains only high level events, and not detailed log information.
The above log files are local to each node.
Application monitoring in HACMP has its own set of log files.
HA sanity tests
Manual system start up tests :
- Node 1 and node 2 are both active:
- disable the cluster on both nodes by using smitty clsstop
- stopping the cluster unmounts the shared drives. Mount the shared drives.
- start QM1 on node 1
- start QM2 on node 2
- observe the results to verify that no errors are reported during MQ operation.
- Node 2 is the only active node:
- start QM1 on node 2
- start QM2 on node 2
- observe the results to verify that no errors are reported during MQ operation.
- Node 1 is the only active node:
- start QM2 on node 1
- start QM1 on node 1
- observe the results to verify that no errors are reported during MQ operation.
Verify HA configuration
The tests cases for verifying the HA configuration are:
- Shared files - the following files should be located in shared directories:
- MQ logs
- Queue manager data for every queue manger should reside in a shared location.
/MQHA/<qmgr>/data and /MQHA/<qmgr>/log
Separate disks for data files and logs -
while not essential for HA reasons, it is recommended for performance
Test HA control
The objective of this test suite is to verify if HACMP is able
to start, restart, and monitor all the individual applications that are part of the cluster.
- Automatic system startup/restart under HACMP control
- Restart attempts setting
Verify the number of retry attempts.
Each WebSphere Business Integration application will be restarted three times
before a resource group failover is initiated.
The number of retry attempts can be configured in HACMP.
- Failover
- Fallback
Fallback refers to the movement of a resource group from a secondary or a failover node to the primary node,
which is being reintegrated into the cluster.
In the current WebSphere Business Integration cluster, automatic fallback is disabled.
However, manual reintegration should be validated:
- Bring node 1 down:
shutdown -r now
- Node 1 resource group fails over to node 2.
- Verify failover by looking at the HACMP log file:
tail -f /tmp/hacmp.out
- Start up cluster on node 1 after node 1 is back up:
smitty clstart
- Observe that the resource groups are still running on node 2
even though the cluster on node 1 is backed up.
- Repeat the above test for node 2 fallback.
Migration / maintenance
Assuming a two-node active/active cluster, the steps are
- Select one machine to upgrade first
- At a suitable time, when the moving of a queue manager will not cause a serious disruption to service,
manually force a migration of the active queue manager to its partner node
- On the machine that is now running both queue managers, disable the failover capabilities for the queue managers.
- Upgrade the software on the machine that is not running any queue managers
- Re-enable failover, and move both queue managers across to the newly upgraded machine
- Disable failover again
- Upgrade the original box
- Re-enable failover
- When it will cause least disruption, move one of the queue managers across to balance the workload
HACMP tips
- Fix Packs : move both environments to a single machine and apply maintenance to "empty" machine.
Failback all components and apply maintenance to the other machine.
Caution : there is a point of "no way back" !
- design applications to handle duplicates
if store & forward plus persistency is used,
when we introduce disaster recovery,
a duplicate message chance is created.
if EndToEnd transaction number is used (and in E2E_Ack also),
the duplicate problem is solved and persistency is not required anymore.
- design applications to be able to run in multiple instances
- a PR (partial repository) is a subscriber of the FR's (full repositories)
and (if there are 2) it receives duplicates
- how to monitor HACMP is working ok ?
Include checking all of the things that the qmgr must have to satisfy the business applications, like:
the status of channels, listeners, availability of application queues (put/get enabled, trigger attributes),
existence of processes, namelists, trigger monitor, applications, ...
- maybe you dont trust any "process monitoring"
and prefer to put a message into a queue to validate the qmgr is working !
- script to go to backup node :
- endmqm (in background)
- wait some time
- kill remaining mq resources in specific order (see SysAdmin manual)
- CM does not require to go under HACMP :
not many changes to production environment.
CM can go on a VMWARE image !
FileSystem Requirements
What are the requirements for a HACMP (MQ) filesystem ?
And using Multi-Instance ?
NFS v4 !
Multi-instance MQ & MB
multi-instance queue managers -
good
intro.
In multi-instance terminology,
there is an Active qmgr and a Standby qmgr
that are both running and looking at file locks.
Read
nfsv4 specs,
RFC 3530 : lease period
Increase messaging availability :
url
Creating a
multi-instance qmgr on Linux
Both machines have a diferent IP !
So,
"do NOT use multi-instance queue managers as full repositories",
page 19,
but
here
it says
"if you still need better availability, consider hosting the full repository queue managers as multi-instance queue managers" !
CONNAME has been expanded to support more than one "ipaddress(port)" combination, across all channel types that use it :
define channel(CH_NAME) chltype(SDR) trptype(TCP) xmitq(XQN) conname('<ip>(<port>)') replace
sample : DEFINE CHANNEL(CHANNEL1) CHLTYPE(CLNTCONN) TRPTYPE(TCP) CONNAME('server1(2345),server2(2345)') QMNAME(QM1) REPLACE
Mind the 48 character limit for "CONNAME" !
developerWorks :
complete sample,
part 2 ;
creating a multi-instance queue manager for MQ on
Linux.
- create shared directories :
url
- create multi-instance MQ :
url
- create multi-instance MB :
url
When you intend to use a queue manager as a multi-instance queue manager,
create a single queue manager on one of the servers using the WebSphere MQ crtmqm command,
placing its queue manager data and logs in shared network storage.
On the other server, rather than create the queue manager again, use the WebSphere MQ addmqinf command
to create a reference to the queue manager data and logs on the network storage.
You can now run the queue manager from either of the servers.
Each of the servers references the same queue manager data and logs;
there is only one queue manager, and it is active on only one server at a time.
You can swap the active instance to the other server, once it has started, by stopping the active instance using the switchover option to transfer control to the standby.
The active instance of QM1 has exclusive access to the shared queue manager data and logs folders when it is running.
The standby instance of QM1 detects when the active instance has failed, and becomes the active instance.
It takes over the QM1 data and logs in the state they were left by the active instance, and accepts reconnections from clients and channels.
The active instance might fail for various reasons that result in the standby taking over:
- Failure of the server hosting the active queue manager instance.
- Failure of connectivity between the server hosting the active queue manager instance and the file system.
- Unresponsiveness of queue manager processes, detected by WebSphere MQ, which then shuts down the queue manager.
You can add the queue manager configuration information to multiple servers, and choose any two servers to run as the active/standby pair.
A multi-instance queue manager is one part of a high availability solution.
You need some additional components to build a useful high availability solution.
- client and channel reconnection to transfer WebSphere MQ connections to the computer that takes over running the active queue manager instance.
- a high performance shared network file system that manages locks correctly and provides protection against media and file server failure.
- resilient networks and power supplies to eliminate single points of failure in the basic infrastructure.
- applications that tolerate failover. In particular you need to pay close attention to the behavior of transactional applications, and to applications that browse WebSphere MQ queues.
- monitoring and management of the active and standby instances to ensure that they are running, and to restart active instances that have failed. Although Start of changemulti-instanceEnd of change queue managers restart automatically, you need to be sure that your standby instances are running, ready to take over, and that failed instances are brought back online as new standby instances.
WebSphere MQ Clients and channels reconnect automatically to the standby queue manager when it becomes active.
Reconnection, and the other components in a high availability solution are discussed in related topics.
Automatic client reconnect is not supported by WebSphere MQ classes for Java.
MQ,
MB.
NFS specs and samples
Filesystem
requisites :
The storage must be accessed by a network file system protocol which is Posix-compliant and supports lease-based locking.
Network File System version 4 (NFS v4) satisfies this requirement.
Also "NAS" or "GPFS".
Probeid
ZX155001 :
If you are using the NFS V4 file system as the shared file system,
you must use hard mounts, synchronous writing and disable write caching, to fulfill these requirements.
Verification
tool : amqmfsck
Required highly available network-attached storage (NAS) examples:
- IBM GPFS
- Veritas Cluster File System
- Highly available NFSv4
2-instance creation summary
Summary:
- Set up shared filesystems for QM data and logs
- Create the queue manager on machine1
crtmqm -md /shared/qmdata -ld /shared/qmlog QM1
- Define the queue manager on machine2 (or edit mqs.ini)
addmqinf -vName=QM1 -vDirectory=QM1 -vPrefix=/var/mqm -vDataPath=/shared/qmdata/QM1
- Start an instance on machine1 - it becomes Active
strmqm -x QM1
- Start another instance on machine2 - it becomes Standby
strmqm -x QM1
Filesystem verification tool
Mind "File System Check tool" : amqmfsck,
( applies only to UNIX and IBM i systems ).
Details
here
Verifying the multi-instance queue manager on Linux
Use the sample programs amqsghac, amqsphac and amqsmhac to verify a multi-instance queue manager configuration.
url
Conversion (unix)
Implementation considerations for Multi-Instance queue managers in MQ cluster environment :
How to convert queue manager to be multi-instance
Windows domains and multi-instance queue managers
The only way to ensure each of the servers running queue manager instances
use the same local mqm group with the same SID as the owner of the queue manager data and log directories on the file server
is to make the local mqm group a domain local group.
In order to use domain local groups, you must run multi-instance queue managers on a domain controller.
On a domain controller all local groups are implicitly domain local groups.
url
Failover mechanism
How does the standby queue manager take it over ?
Actions that cause a failover.
Failover of a multi-instance queue manager can be triggered by hardware or software failures,
including networking problems which prevent the queue manager writing to its data or log files.
To be confident that a shared file system will provide integrity
and work with a multi-instance queue manager when such a problem occurs unexpectedly,
test all possible failure scenarios.
A list of actions that would cause a failover includes:
- Shutting down the operating system including syncing the disks
- Halting the operating system without syncing the disks
- Physically pressing the server's reset button
- Physically pulling the network cable out of the server
- Physically pulling the power cable out of the server
- Physically switching the machine off
Two IPs
Remote client cann access the multi-instance qmgr by
DEFINE CHANNEL(CHANNEL1) CHLTYPE(SVRCONN) TRPTYPE(TCP) MCAUSER('mqm') REPLACE
DEFINE CHANNEL(CHANNEL1) CHLTYPE(CLNTCONN) TRPTYPE(TCP) CONNAME('ipaddr1(1414),ipaddr2(1414)') QMNAME(QM1) REPLACE
START CHANNEL(CHANNEL1)
Multi-instance MB
MB starts/stops as a MQ service ...
Configuring a multi-instance Message Broker for
High Availability support :
A multi-instance broker is created using the mqsicreatebroker command,
with an additional -e option that specifies the location in shared network storage of the broker registry and other configuration data.
Additional instances of the broker can then be created on other machines in the network using a new command called mqsiaddbrokerinstance,
using the -e option to target the same location in shared network storage.
Broker logging, error handling and shared Java Classes remain local to the machine that hosts the broker or broker instance.
Configuring a WebSphere Message Broker to run in
multi-instance mode
Active/active multi-instance MB
You can create an ACTIVE/ACTIVE scenario using multi-instance brokers/qms.
This would be setup similar to your description
- machine 1 - QMA & BRKA (Active), QMB & BRKB (Passive)
- machine 2 - QMA & BRKA (Passive), QMB & BRKB (Active)
QMA & QMB working in a cluster to provide load balancing.
HA manager
Using a broker with an existing
high availability manager,
using a broker with an existing
Windows cluster
Multi-instance or HA cluster?
Multi-instance queue manager advantatges
- Integrated into the IIB and MQ products
- Faster failover than HA cluster*
- Delay before queue manager restart is much shorter*
- Runtime performance of networked storage must be considered
- IP address of standby instance is different to primary
- No automatic fail-back to primary hardware when restored
- More susceptible to MQ and OS defects
HA cluster advantatges
- Capable of handling a wider range of failures
- Failover historically rather slow, but some HA clusters are improving
- Some customers frustrated by unnecessary failovers
- Require MC91 SupportPac or equivalent configuration
- Extra product purchase and skills required
Storage distinction
- Multi-instance queue manager typically uses NAS
- HA clustered queue manager typically uses SAN