|
home /
infca /
mq_hacmp
(navigation links)
|
"Avui comença tot"
|
HACMP
Cluster Configurations :
A standby configuration is the most basic cluster configuration
in which one node performs work whilst the other node acts only as standby.
The standby node does not perform work and is referred to as idle;
this configuration is sometimes called "cold standby".
A takeover configuration is a more advanced configuration
in which all nodes perform some kind of work
and critical work can be taken over in the event of a node failure.
A "one sided takeover" configuration is one in which
a standby node performs some additional, non critical and non movable work.
This is rather like a standby configuration
but with (non critical) work being performed by the standby node.
A "mutual takeover" configuration is one in which
all nodes are performing highly available (movable) work.
This type of cluster configuration is also sometimes referred to as "Active/Active"
to indicate that all nodes are actively processing critical workload.
HACMP, VCS, ServiceGuard , Heartbeat and MSCS all use a "shared nothing" clustering architecture.
A shared nothing cluster has no concurrently shared resources,
and works by transferring ownership of resources from one node to another,
to work around failures or in response to operator commands.
Resources are things like disks, network addresses, or critical processes.
Configuration :
All HA products have the concept of a unit of failover.
This is a set of definitions that contains all the processes and resources
needed to deliver a highly available service
and ideally should contain only those processes and resources.
In HACMP, the unit of failover is called a resource group.
On other HA products the name might be different, but the concept is the same.
On VCS, it is known as a service group,
on MC/ServiceGuard it is a package,
in Heartbeat is a resource group and in MSCS is a group.
The smallest unit of failover for WMQ is a queue manager,
since you cannot move part of a queue manager without moving the whole thing.
It follows that the optimal configuration is to place each queue manager in a separate resource group,
with the resources upon which it depends.
The resource group should therefore contain
the shared disks used by a queue manager,
which should be in a volume group or disk group reserved exclusively for the resource group,
the IP address used to connect to the queue manager (the service address)
and an object which represents the queue manager.
Failover -
Invoking a secondary system to take over when the primary system fails.
When not to use HA WebSphere MQ queue manager clusters
HA WebSphere MQ queue manager clusters require
additional proprietary HA hardware (shared disks)
and external HA clustering software (such as HACMP).
This increases the administration costs of the environment
because you also need to administer the HA components.
This approach also increases the initial implementation costs
because extra hardware and software are required.
Therefore, balance these initial costs
with the potential costs incurred if a queue manager fail and messages become trapped.
If trapped messages are not a problem for the applications
(for example, the response time of the application is irrelevant or the data is updated frequently),
then HA WebSphere MQ queue manager clusters are probably not required.
General recommendations
Some of the advice pertinent to an HA environment in general is:
- Each node in cluster must be sized large enough to support total load under failure conditions.
Dynamic process allocation under AIX
(shutting down less important processes on the production failover machine)
can help lower hardware costs.
- Optimal resource utilization in an HA environment occurs when
each node in an Active/Active cluster drives similar loads.
- Use HA clustering for high availability only, not for performance or simplified administration.
- The HA solution in the QA environment should exactly mimic the production environment.
This helps to avoid critical production problems and minimizes down time.
- Test individual HA components for online, offline, and failover.
- Attempt to restart the individual HA components
(for example, InterChange Server or Message Broker)
a minimum of three times on the primary node before failing over to the alternate node.
- Automatic fallback is not recommended.
Manual fallback is recommended to minimize system disruption.
- It is recommended that a stock of spare components such as network adapters, cables, etc. be maintained.
Related to this is also a need to define an escalation plan to deal with unexpected failures.
- Be sure to create a list identifying who to call when something HA specific goes wrong,
or the normal contacts are not available.
Highly Available WebSphere Business Integration Solutions, SG24-6328-00, chapter 8.2, page 122.
MC91 - HA for MQ
This SupportPac has now been
withdrawn.
The support is now included in the WebSphere MQ V7.0.1 product and documentation.
url MC91 - high availability for MQ on Unix.
Install into /MQHA/bin : samples use it.
This SupportPac provides notes and sample scripts
to assist with the installation and configuration
of WebSphere MQ (WMQ) V6 and V7 in High Availability (HA) environments.
Three different platforms and environments are described here,
but they share a common design
and this design can also be extended for many other systems.
Specifically this SupportPac deals with the following HA products:
- HACMP (High Availability Cluster Multi Processing)
- Veritas Cluster Server (VCS)
- MC/ServiceGuard (MCSG)
MC91 installation :
16/03/2009 20:40 1.310.445 mc91.tar.Z
{mqm - /MQHA/bin/ } $ uncompress mc91.tar.Z
{mqm - /MQHA/bin/ } $ tar -xvf mc91.tar
MC91 configuration :
- [1] Configure the HA Cluster
- [2] Configure the shared disks
- [3] Create the Queue Manager
- [4] Configure the movable resources
- [5] Configure the Application Server or Agent
- [6] Configure a monitor
[1] Configure the HA Cluster
- Configure TCP/IP on the cluster nodes for HACMP. Remember to configure ~root/.rhosts, /etc/rc.net, etc.
- Configure the cluster, cluster nodes and adapters to HACMP as usual.
- Synchronise the Cluster Topology.
[2] Configure the shared disks
This step creates the volume group (or disk group) and filesystems needed for the queue manager.
So that this queue manager can be moved from one node to another
without disrupting any other queue managers,
you should designate a group containing shared disks
which is used exclusively by this queue manager and no others.
For performance, it is recommended that a queue manager uses separate filesystems for logs and data.
The suggested layout therefore creates two filesystems within the volume group.
You can optionally protect each of the filesystems from disk failures by using mirroring or RAID.
Mount points must all be owned by the mqm user.
You will need the following filesystems:
- Per node:
- /var on a local non-shared disk - this is a standard filesystem or directory which will already exist.
- Per queue manager:
- /MQHA/<qmgr>/data on shared disks - this is where the queue manager data directory will reside.
- /MQHA/<qmgr>/log on shared disks - this is where the queue manager recovery logs will reside.
The steps are :
- Create the volume group that will be used for this queue manager's data and log files.
- Create the /MQHA/<qmgr>/data and /MQHA/<qmgr>/log filesystems using the volume group created above.
- For each node in turn, import the volume group, vary it on, ensure that the filesystems can be mounted, unmount the filesystems and varyoff the volume group.
[3] Create the Queue Manager
- Select a node on which to perform the following actions
- Ensure the queue manager's filesystems are mounted on the selected node.
- Create the queue manager on this node, using the hacrtmqm script
- Start the queue manager manually, using the strmqm command
- Create any queues and channels
- Test the queue manager
- End the queue manager manually, using endmqm
- On the other nodes, which may takeover the queue manager, run the halinkmqm script
[4] Configure the movable resources
The resource group will use the IP address as the service label.
This is the address which clients and channels will use to connect to the queue manager.
- Create a resource group and select the type as discussed above.
- Configure the resource group in the usual way adding the service IP label, volume group and filesystem resources to the resource group.
- Synchronise the cluster resources.
- Start HACMP on each cluster node in turn and ensure that the cluster stabilizes, that the respective volume groups are varied on by each node and that the filesystems are mounted correctly.
[5] Configure the Application Server or Agent
The queue manager is represented within the resource group by an application server or agent.
- Define an application server which will start and stop the queue manager.
The start and stop scripts contained in the SupportPac may be used unmodified,
or may be used as a basis from which you can develop customized scripts.
The examples are called hamqm_start and hamqm_stop.
- Add the application server to the resource group definition created in the previous step.
- Optionally, create a user exit in /MQHA/bin/rc.local
- Synchronise the cluster configuration.
- Test that the node can start and stop the queue manager,
by bringing the resource group online and offline.
[6] Configure a monitor
You can configure an application monitor which will monitor the health of the queue manager
and trigger recovery actions as a result of MQ failures, not just node or network failures.
Recovery actions include the ability to perform local restarts of the queue manager
or to cause a failover of the resource group to another node.
To benefit from queue manager monitoring you must define an Application Monitor.
If you created the queue manager using hacrtmqm,
then one of these will have been created for you,
in the /MQHA/bin directory,
and is called hamqm_applmon.$qmgr.
- To enable queue manager monitoring,
define a custom application monitor for the Application Server created in previous step,
providing the name of the monitor script and tell HACMP how frequently to invoke it.
Set the stabilisation interval to 10 seconds,
unless your queue manager is expected to take a long time to restart.
This would normally be if your environment has long-running transactions
that might cause a substantial amount of recovery/replay to be required.
- To configure for local restarts, specify the Restart Count and Restart Interval.
- Synchronise the cluster resources.
- Test the operation of the application monitoring,
and in particular verify that the local restart capability is working as configured.
A convenient way to provoke queue manager failures is
to identify the Execution Controller process (called amqzxma0) associated with the queue manager, and kill it.
Conclusion : the files we have to copy into /hacmp/, and adapt for our system, are :
- hamqm_start
- hamqm_stop
- hamqm_applmon.$qmgr
Then, using smitty, we have to
- define an application server
- define a custom application monitor
IC91 - HA for MB
url IC91 - high availability for MB on distributed platforms.
Install into /MQHA/bin : samples use it.
A broker runs as a pair of processes, called bipservice and bipbroker.
The latter in turn creates the execution groups that run message flows.
It is this collection of processes which are managed by HA Software.
When creating the queue manager,
don't configure the application server or application monitor described in SupportPac MC91.
You will create an application server that covers the broker, queue manager and broker database instance.
When creating channels between queue managers,
the sender channel should use the service address of the broker resource group
and the broker queue manager's port number.
- General Configuration
- [0] Configure the cluster
- Configure the shared disks
- Configuration steps for Broker - UNIX
- [1] Create and configure the queue manager. Create the resource group.
- [2] Create and configure the broker database
- [3] Create the message broker
- [4] Place the broker under cluster control
[0] Configure the HA Cluster
- Configure TCP/IP on the cluster nodes as described in your cluster software documentation.
- Configure the cluster, cluster nodes and adapters to HA Software as usual.
- Synchronise the Cluster Topology.
- Now would be a good time to create and configure the user accounts
that will be used to run the database instances, brokers and UNS.
Home directories, (numeric) user ids, passwords, profiles and group memberships
should be the same on all cluster nodes.
[1] Create and configure the queue manager
- On one node, create a clustered queue manager as described in SupportPac MC91, using the hacrtmqm command.
Use the volume group that you created for the broker and place the volume group and queue manager
into a resource group to which the broker will be added.
Don't configure the application server or application monitor described in SupportPac MC91 -
you will create an application server that covers the broker, queue manager and broker database instance.
- Set up queues and channels between the broker queue manager and the Configuration Manager queue manager:
- On the Configuration Manager queue manager create a transmission queue for communication to the broker queue manager. Ensure that the queue is given the same name and case as the broker queue manager. The transmission queue should be set to trigger the sender channel.
- On the Configuration Manager queue manager create a sender and receiver channel for communication with the broker queue manager. The sender channel should use the service address of the broker resource group and the broker queue manager's port number.
- On the broker queue manager create a transmission queue for communication to the Configuration Manager queue manager. Ensure that the queue is given the same name and case as the Configuration Manager queue manager. The transmission queue should be set to trigger the sender channel.
- On the broker queue manager create sender and receiver channels to match those just created on the Configuration Manager queue manager. The sender channel should use the IP address of the machine where the Configuration Manager queue manager runs, and the corresponding listener port number.
- If you are using a UNS, set up queues and channels between the broker queue manager and the UNS queue manager:
- On the broker queue manager create a transmission queue for communication to the UNS queue manager. Ensure that the queue is given the same name and case as the UNS queue manager. The transmission queue should be set to trigger the sender channel.
- On the broker queue manager create a sender and receiver channel for communication with the UNS queue manager. If the UNS is clustered, the sender channel should use the service address of the UNS resource group and the UNS queue manager's port number.
- On the UNS queue manager create a transmission queue for communication to the broker queue manager. Ensure that the queue is given the same name and case as the broker queue manager. The transmission queue should be set to trigger the sender channel.
- On the UNS queue manager create a sender and receiver channel for communication with the broker queue manager, with the same names as the receiver and sender channel just created on the broker queue manager. The sender channel should use the service address of the broker resource group and the broker queue manager's port number.
- Test that the above queue managers can communicate regardless of which node owns the resource groups they belong to.
[2] Create and configure the broker database
There are two options regarding where the broker database is run, either inside or outside the cluster.
If you choose to run the database outside the cluster
then simply follow the instructions in the WMB documentation for creating the broker database
but ensure that you consider whether the database is a single point of failure
and make appropriate provision for the availability of the database.
[3] Create the message broker
- Create the broker on the node hosting the logical host using the hamqsicreatebroker command.
- Ensure that you can start and stop the broker manually using the mqsistart and mqsistop cmmands.
- On any other nodes in the resource group's nodelist
(i.e. excluding the one on which you just created the broker),
run the hamqsiaddbrokerstandby command to create the information needed by these nodes to enable them to host the broker.
[4] Place the broker under cluster control
- Create an application server which will run the broker, its queue manager and the database instance,
using the example scripts provided in this SupportPac.
The example scripts are called hamqsi_start_broker_as and hamqsi_stop_broker_as.
- You can also specify an application monitor using the hamqsi_applmon.<broker> script created by hamqsicreatebroker.
An application monitor script cannot be passed parameters, so just specify the name of the monitor script.
Also configure the other application monitor parameters,
including the monitoring interval and the restart parameters you require.
- Synchronise the cluster resources.
- Ensure that the broker, queue manager and database instance are stopped, and start the application server.
- Check that the components started and test that the resource group can be moved from one node to the other
and that they run correctly on each node.
- Ensure that stopping the application server stops the components.
- With the application server started, verify that the HACMP local restart capability is working as configured.
A convenient way to cause failures is to identify the bipservice for the broker and kill it.
MQ and HA
HisCock HA MQ whitepaper
HA in Clustering
A key problem in using MQ clustering for high availability
is the problem of stuck messages (in Xmit queue).
Message expiry.
If the message isn't delivered to it's target
by the time the end-user would have timed out,
get it to self destruct.
(fjb_saper) Easy answer:
MQ clustering => load balancing.
Hardware clustering => HA (high availability).
If you've got an HA set up there are 2 main options:
- Use HA software so "the queue manager"
is presented on a single IP/port
no matter where it happens to be running
so the value in the TAB file is always true.
- Use the TAB file to define multiple instances of "THEQM"
which identify QMA, QMB, etc
Complete (MQ&MB) Schema
First squema is like this one :
.-------------------. .-------------------.
| | | |
| AIX-1 (active) | | AIX-2 (pasive) |
| | | |
| .---------. | | .---------. |
| | | | | | | |
| | MB1(a) | | | | MB1(p) | |
| | | | | | | |
| .---------. | | .---------. |
| | | |
| .---------. | | .---------. |
| | | | | | | |
| | QM1(a) | | | | QM1(p) | |
| | | | | | | |
| .---------. | | .---------. |
| | | |
.-------------------. .-------------------.
We have an active machine, AIX-1, running QM1 and MB1,
and a passive machine, AIX-2, which is almost always stopped.
So, in order to improve this second machine utilization,
we can create a second queue manager and a second broker on AIX-2,
and place its backup replicas in AIX-1 :
.--------------------------------. .--------------------------------.
| | | |
| AIX-1 (active) | | AIX-2 (active) |
| | | |
| .---------. | | .---------. |-
| | | | | | | | \
| | MB1(a) | | | | MB1(p) | | |
| | | | | | | | |
| .---------. | | .---------. | |
| | | | | => Service address 1
| .---------. | | .---------. | |
| | | | | | | | |
| | QM1(a) | | | | QM1(p) | | |
| | | | | | | | /
| .---------. | | .---------. |-
| | | |
| .---------. | | .---------. |-
| | | | | | | | \
| | MB2(p) | | | | MB2(a) | | |
| | | | | | | | |
| .---------. | | .---------. | |
| | | | | => Service address 2
| .---------. | | .---------. | |
| | | | | | | | |
| | QM2(p) | | | | QM2(a) | | |
| | | | | | | | /
| .---------. | | .---------. |-
| | | |
.--------------------------------. .--------------------------------.
Finaly, we join QM1 and QM2 in a MQ cluster,
so while one machine in moving to its backup image,
the source messages are still processed.
A n+1 arquitecture is also possible : to have "n" machines running,
and 1 more being the backup of all those "n" machines - we guess they will fail one at a time !
Se instala MC91 (HACMP para MQ) y luego IC91 (HACMP para MB).
Complete list :
MB_HACMP (ext, ***)
Install checklist :
MB_HACMP
Stopping queue managers in WebSphere MQ for UNIX systems
To stop a queue manager running under MQ for UNIX© systems :
- Find the process IDs of the queue manager programs
that are still running using the ps command.
For example, if the queue manager is called QMNAME, use the following command:
ps -ef | grep QMNAME
- End any queue manager processes that are still running.
Use the kill command,
specifying the process IDs discovered using the ps command.
- End the processes in the following order :
amqzmuc0 Critical process manager
amqzxma0 execution controller
amqzfuma OAM process
amqzlaa0 LQM agents
amqzlsa0 LQM agents
amqzmgr0 process controller
amqzmur0 restartable process manager
amqrmppa process pooling process
amqrrmfa the repository process (for clusters)
amqzdmaa deferred message processor
amqpcsea the command server
Note: processes that fail to stop can be ended using kill -9.
If you stop the queue manager manually,
FFSTs might be taken, and FDC files placed in /var/mqm/errors.
Do not regard this as a defect in the queue manager.
The queue manager should restart normally,
even after you have stopped it using this method.
URL (sys admin guide)
/MQHA/bin/hamqproc contains the list of processes to be killed by hamqm_stop_su :
for process in `cat /MQHA/bin/hamqproc`
do
ps -ef | grep $process | grep -v grep | \
egrep "$srchstr" | awk '{print $2}'| \
xargs kill -9
done
Critical MQ processes [Doug]
amqzlaa0 - LQM agent - a qmgr will have at least one of these if running
amqzmur0 - journal utility manager
amqrmppa - channel process pooler
amqhasmn - the logger
{bestp}
WMQ in HA Clusters - best practices
- Separate disks for data files and logs
- While not essential for HA reasons, it is recommended for performance
- All our examples use this configuration
- Channel state is hardened
- Sender channels will be automatically restarted (if triggered)
- Use the virtual IP address (or name) in the CONNAME
- Requesters will not be auto-restarted
- But a 'server' machine is less likely to have requester channels
- May need to use custom scripts or services to restart things like
- Trigger monitors
- Command server
- Applications
- It's exactly the same as a normal queue manager restart
- So non-persistent messages will disappear
- Long-running transactions may cause long-running restart
- if cluster is used, place FR's on HACMP machines
TMM04 {BCN}
Perl
You will need to insure the shebang line points to it.
So you may need to set it to
#!/usr/bin/perl
or #!/bin/perl
or even #!/usr/perl - whatever is local standard.
How to know the "local standard" ?
HACMP "stop" script
#!/bin/ksh
# DESCRIPTION:
# /MQHA/bin/ha_mqm_stop_su <qmname>
#
# Stops the QM.
# Check to see if the QM is already stopped.
# If so, just make sure no processes are lying around.
online=`/MQHA/bin/hamqm_running ${QM}`
if [ ${online} != "1" ]
then
# QM is reported as offline; ensure no processes remain
# Note that this whole script should be executed under su, which is why there's no su in the following loop.
# The regular expression in the next line contains a tab character. Edit only with tab-friendly editors.
srchstr="( |-m)$QM[ ]*.*$"
for process in runmqlsr amqpcsea amqhasmx amqharmx amqzllp0 \
amqzlaa0 runmqchi amqrrmfa amqzxma0
do
ps -ef | grep $process | grep -v grep | \
egrep "$srchstr" | awk '{print $2}'| \
xargs kill -9
done
exit 0
fi
It can be done (newer stop_su) providing the names in a file :
see here.
HACMP "link" script
The core is :
# Args:
# $1: Qmgr name
# $2: Mangled qmgr directory name -- may or may not be the same as qmgr
# $3: Shared Prefix -- e.g. /MQHA//data
if [ -r $3/qmgrs/$2/qm.ini ]
then
# We're running on the master node that owns the queue manager
# so we will create symlinks back to /var/mqm/ipc subdirs
for topdir in @ipcc @qmpersist @app
do
for subdir in esem isem msem shmem spipe
do
rm -fr $ipcorig/$subdir
rm -fr $ipcorig/$topdir/$subdir
ln -fs $ipcbase/$subdir $ipcorig/$subdir
ln -fs $ipcbase/$topdir/$subdir $ipcorig/$topdir/$subdir
done
done
rm -rf $ipcorig/qmgrlocl
ln -fs $ipcbase/qmgrlocl $ipcorig/qmgrlocl
else
# We're running on a standby node, so all we have to do is to
# update the config file that tells us where the queue manager lives
cat >> /var/mqm/mqs.ini <<EOF
QueueManager:
Name=$1
Prefix=$3
Directory=$2
EOF
fi
HACMP "MQ monit" script
"simple" one
Just does a "ping qmgrname" :
dy0608:/MQHA/bin # more hamqm_applmon.QMPROD01
#!/bin/ksh
su mqm -c /MQHA/bin/hamqm_applmon_su QMPROD01
dy0608:/MQHA/bin # more hamqm_applmon_su
#!/bin/ksh
QM=$1
# Test the operation of the QM.
echo "ping qmgr" | runmqsc ${QM} > /dev/null 2>&1
pingresult=$?
# pingresult will be 0 on success; non-zero on error (man runmqsc)
if [ $pingresult -eq 0 ]
then # ping succeeded
echo "hamqm_applmon: Queue Manager ${QM} is responsive"
result=0
else # ping failed
result=$pingresult
fi
exit $result
"complex" one
Verifies few processes are still running :
Check_qmgr: # Check for the main processes
for pid in amqzxma0 amqhasmx amqzllp0
do
if ps -u mqm -o pid,args | eval /usr/xpg4/bin/grep -E '$PATTERN' |\
grep -w $pid > /dev/null
then
rc=0
else
rc=1
fi
Gracias, Vicente !
HACMP "MB monit" script
STATE="stopped"
#
cnt=`ps -ef | grep db2sysc | grep -v grep | grep $DBINST | wc -l`
if [ $cnt -gt "0" ]
then
# Found one or more db2sysc process, so database instance assumed to be running normally
echo "hamqsi_monitor_broker_as: Broker database is running"
STATE="started"
else
# Did not find a db2sysc process, but check to see whether db2start is still running and only report error if there is not one.
cnt=`ps -ef | grep db2start | grep -v grep | grep $DBINST | wc -l`
if [ $cnt -gt "0" ]
then
echo "hamqsi_monitor_broker_as: Broker database is starting"
STATE="starting"
else
echo "hamqsi_monitor_broker_as: Broker database is not running correctly"
STATE="stopped"
fi
fi
# Decide whether to continue or to exit
case $STATE in
stopped)
echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is not running correctly"
exit 1
;;
starting)
echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is starting"
echo "hamqsi_monitor_broker_as: WARNING - Stabilisation Interval may be too short"
echo "hamqsi_monitor_broker_as: WARNING - No test of broker $BROKER will be conducted"
exit 0
;;
started)
echo "hamqsi_monitor_broker_as: Database instance ($DBINST) is running"
continue # proceed by testing broker
;;
esac
# ------------------------------------------------------------------
# Check the MQSI Broker is running
#
# Re-initialise STATE for safety
STATE="stopped"
#
# The broker runs as a process called bipservice which is responsible for starting and re-starting the admin agent process (bipbroker).
# The bipbroker is responsible for starting any DataFlowEngines.
# If no execution groups have been assigned to the broker there will be no DataFlowEngine processes.
# There should always be a bipservice and bipbroker process pair.
# This monitor script only tests for bipservice, because bipservice should restart bipbroker if necessary
# - the monitor script should not attempt to restart bipbroker and it may be premature to report an absence of a bipbroker as a failure.
cnt=`ps -ef | grep "bipservice $BROKER" | grep -v grep | wc -l`
if [ $cnt -eq 0 ]
then
echo "hamqsi_monitor_broker_as: MQSI Broker $BROKER is not running"
STATE="stopped"
else
echo "hamqsi_monitor_broker_as: MQSI Broker $BROKER is running"
STATE="started"
fi
# Decide how to exit
case $STATE in
stopped)
echo "hamqsi_monitor_broker_as: Broker ($BROKER) is not running correctly"
exit 1
;;
started)
echo "hamqsi_monitor_broker_as: Broker ($BROKER) is running"
exit 0
;;
esac
HA logs
An easy way to monitor cluster events and messages is by tailing the following HACMP log files:
- /tmp/hacmp.out - Outputs detailed logs on cluster events and output from
application start and stop scripts. Note that output from the monitoring is not
part of this log file.
- /usr/es/adm/cluster.log - Contains timestamped, formatted messages
generated by HACMP start and stop scripts and daemons.
This file contains only high level events, and not detailed log information.
The above log files are local to each node.
Application monitoring in HACMP has its own set of log files.
HA sanity tests
Manual system start up tests :
- Node 1 and node 2 are both active:
- disable the cluster on both nodes by using smitty clsstop
- stopping the cluster unmounts the shared drives. Mount the shared drives.
- start QM1 on node 1
- start QM2 on node 2
- observe the results to verify that no errors are reported during MQ operation.
- Node 2 is the only active node:
- start QM1 on node 2
- start QM2 on node 2
- observe the results to verify that no errors are reported during MQ operation.
- Node 1 is the only active node:
- start QM2 on node 1
- start QM1 on node 1
- observe the results to verify that no errors are reported during MQ operation.
Verify HA configuration
The tests cases for verifying the HA configuration are:
- Shared files - the following files should be located in shared directories:
- MQ logs
- Queue manager data for every queue manger should reside in a shared location.
/MQHA/<qmgr>/data and /MQHA/<qmgr>/log
Separate disks for data files and logs -
while not essential for HA reasons, it is recommended for performance
Test HA control
The objective of this test suite is to verify if HACMP is able
to start, restart, and monitor all the individual applications that are part of the cluster.
- Automatic system startup/restart under HACMP control
- Restart attempts setting
Verify the number of retry attempts.
Each WebSphere Business Integration application will be restarted three times
before a resource group failover is initiated.
The number of retry attempts can be configured in HACMP.
- Failover
- Fallback
Fallback refers to the movement of a resource group from a secondary or a failover node to the primary node,
which is being reintegrated into the cluster.
In the current WebSphere Business Integration cluster, automatic fallback is disabled.
However, manual reintegration should be validated:
- Bring node 1 down:
shutdown -r now
- Node 1 resource group fails over to node 2.
- Verify failover by looking at the HACMP log file:
tail -f /tmp/hacmp.out
- Start up cluster on node 1 after node 1 is back up:
smitty clstart
- Observe that the resource groups are still running on node 2
even though the cluster on node 1 is backed up.
- Repeat the above test for node 2 fallback.
Migration / maintenance
Assuming a two-node active/active cluster, the steps are
- Select one machine to upgrade first
- At a suitable time, when the moving of a queue manager will not cause a serious disruption to service,
manually force a migration of the active queue manager to its partner node
- On the machine that is now running both queue managers, disable the failover capabilities for the queue managers.
- Upgrade the software on the machine that is not running any queue managers
- Re-enable failover, and move both queue managers across to the newly upgraded machine
- Disable failover again
- Upgrade the original box
- Re-enable failover
- When it will cause least disruption, move one of the queue managers across to balance the workload
HACMP tips
- Fix Packs : move both environments to a single machine and apply maintenance to "empty" machine.
Failback all components and apply maintenance to the other machine.
Caution : there is a point of "no way back" !
- design applications to handle duplicates
if store & forward plus persistency is used,
when we introduce disaster recovery,
a duplicate message chance is created.
if EndToEnd transaction number is used (and in E2E_Ack also),
the duplicate problem is solved and persistency is not required anymore.
- design applications to be able to run in multiple instances
- a PR (partial repository) is a subscriber of the FR's (full repositories)
and (if there are 2) it receives duplicates
- how to monitor HACMP is working ok ?
Include checking all of the things that the qmgr must have to satisfy the business applications, like:
the status of channels, listeners, availability of application queues (put/get enabled, trigger attributes),
existence of processes, namelists, trigger monitor, applications, ...
- maybe you dont trust any "process monitoring"
and prefer to put a message into a queue to validate the qmgr is working !
- script to go to backup node :
- endmqm (in background)
- wait some time
- kill remaining mq resources in specific order (see SysAdmin manual)
- CM does not require to go under HACMP :
not many changes to production environment.
CM can go on a VMWARE image !
What are the requirements for a HACMP (MQ) filesystem ?