Tool Mentor: OMEGAMON - Resolve Incident and Recover Service
TM015 - How to Use OMEGAMON to Automatically Recover a Service
Tool: IBM Tivoli OMEGAMON XE and DE
Relationships
Main Description

Context

Tool mentors explain how a tool can perform tasks, which are part of ITUP processes and activities. The tasks are listed as Related Elements in the Relationships section.

You can see the details of how processes and activities are supported by this tool mentor, by clicking the links next to the icons:

Details

For the purpose of explaining how IBM® Tivoli® OMEGAMON® would assist with this activity, this mentor explores a real example involving WebSphere® MQ. Although WebSphere MQ is generally very reliable, sometimes various services within the MQ environment fail. When this situation occurs, it can cause delays in messages being processed or can cause messages to be misdirected or applications to fail.

The OMEGAMON monitoring agent for MQ includes the ability to issue certain MQ commands in order to recover from some failures.

Assuming that the OMEGAMON base infrastructure is installed and working, all that is needed to implement MQ monitoring is to install an OMEGAMON MQ Agent and configure it to work with a specific Queue Manager. This is explained in the manual, Using IBM Tivoli OMEGAMON XE for WebSphere MQ Monitoring, which can be found at http://publibfp.boulder.ibm.com/epubs/pdf/c3168880.pdf

The OMEGAMON Agent for MQ comes with a set of predefined Situations that raise alerts when certain conditions occur. In this example, the focus is only on the one that identifies when a Channel has stopped: "MQSeries_MQ_Channel_Stopped"

As supplied, this Situation simply raises an Alert when a Channel stops. The steps in this example modify it to issue a command to restart the Channel. The steps needed to complete this Situation are as follows:

  • Run the Situation Editor. (Click it in the Edit menu pick list, or click the Situation Editor icon in the tool bar.)
  • Use the Situation Filter button to show Situations that are associated with the monitored application (MQ, in this case).
  • Click the MQSeries_MQ_Channel_Stopped Situation and its properties will display in the right panel.
  • Click the Action tab and type the following into the Command field:
    MQ: START CHANNEL("&Channel_Statistics.Channel_Name")
    Note that you can use Attribute Substitution button to get the correct parameter for the Channel name; you don't have to manually type &Channel_Statistics.Channel_Name .
  • If you want to see a visual layout of how the Situation will work, click the Show Formula button. You will see a display similar to the screen shot similar to this:

Show Formula View
Figure 1: Show Formula View

  • Click the Distribution tab. Using this tab you can specify which Queue Managers this Situation will be allocated to. Highlight the Queue Manager name in the right panel and click the left arrow button to move its name across to the Assigned panel on the left.
  • Click the Conditions tab to return to the main display.
  • Check the Run at Startup box to enable the Situation to start automatically whenever the OMEGAMON MQ Agent is started.
  • Look at the values in the Sampling Interval fields at the lower left of the panel. Reset these to a realistic value that suits your environment. (Note that the sampling interval should not be less than the sampling rate of the Agent itself.)
  • Change the State in the pick list to the lower right to reflect the severity level that you would like this Event to be reported at "Critical", "Warning", or "Informational".
  • Click OK to save and enable the Situation.

In order to test this procedure, you could issue a manual command to stop the Channel and the Situation will restart it for you. Although this command demonstrates that the Situation works in a test environment, it also illustrates a problem that would limit the usefulness of this Situation.

Consider a production environment where you need to deliberately shut down the Channel for some time in order to do some network maintenance. If this Situation were active, it would prevent you from doing this by always restarting the Channel every time you stopped it.

Accordingly, it would be advisable to add an extra test into the logic of the Situation to make it also check the "Event_Qualifier" attribute to see if it says "Channel_Stopped_OK." This attribute is true if an administrator stopped the Channel deliberately and false if the Channel stopped itself due to an error.

In the example shown, the command was issued directly to the Queue Manager by virtue of using the "MQ:" prefix in the "Command" field. If the "MQ:" prefix is missing, the command is issued to the operating system upon which the OMEGAMON Agent is running. This method can be used to run scripts or applications that could also help in the recovery of failed processes.

You can optionally specify that an Event is to be raised as a result of your Situation becoming true. When this situation occurs, an icon reflecting the severity level will be superimposed over the Queue Manager icon in the CNP navigator panel:

  • Critical: Red triangle containing a white '!'.
  • Warning: Yellow square containing a black '!'.
  • Informational: Pink circle containing a blue '!'.

In the example shown in the screen shot, a "Warning" alert is raised, after which the command to restart the Channel is immediately issued. In a production environment, you could choose to not raise an Alert at all.

Although only the Channel Stopped Situation is used in this example, other Situations can be handled in a similar way. If you want to create a custom Situation for your own purposes, existing Situations can be copied and renamed to suit your environment.

For More Information

For more information about this tool, click on the link for this tool at the top of this page.