Get The Most Out Of The T:LAN/RIO Alarm Reporting Capabilities (Part 2)
This is the 2nd part of a series which attempts to provide a more comprehensive look at the SNMP alarm reporting capabilities of the Optima T:LAN and RIO products. You can find Part 1 here. It provides some background information and introduces measures to counter alarm floods. Part 2 continues to discuss additional measures to prevent these notorious notification storms:
USE TRAP THROTTLING.
The T:LAN/RIO support trap throttling. Throttling enforces a strict upper limit as to the number of SNMPv2 InformRequests that can be generated by the T:LAN/RIO during each 60s interval. Events that had to be throttled in the current minute will attempt re-transmission during an upcoming interval based on the programmed number of repeats and the repeat interval.
This method guarantees that each T:LAN/RIO will never generate more than the pre-set number of alarms during each minute.
How do you select the right throttling rate?
Log into the T:LAN, then go to the RIO Trap Statistics screen. It shows the statistics for each NMS and the totals.
Calculate the average of the last 10 minutes of event reporting activity when a moderate amount of alarm events is being reported. This figure should give you a good indication as to what value to select for the trap throttling rate.
How do I know if my throttling rate needs improvement?
Log into the T:LAN, then go to the RIO Trap Statistics screen. Look for the number of throttled events.
If that number is 0 or very low, then the T:LAN/RIO is rarely (if ever) forced to apply trap throttling. This might indicate that you picked a throttling rate that was set too generously.
On the other hand, if the number is high, it might indicate that you were too aggressive with your throttling.
In this case, you might run the risk that some events are never reported to your NMS destinations. If they are throttled over and over again, they might age out of the log before the T:LAN/RIO had a chance of reporting them.
How can I estimate the worst case scenario for my NMS installation(s)?
Since each T:LAN/RIO will be limited by the selected throttling rate, take this number and multiply it by the number of deployed T:LAN/RIO units in your network. This will give you the highest number of events that all your T:LAN/RIO units combined might send to your NMS every 60 seconds.
Make sure your NMS will be able to work under these (admittedly theoretical) worst case load conditions.
The THROTTLING RATE determines the maximum amount of event notifications per T:LAN/RIO per minute. Theoretically, the worst case would be that these events might all be generated in the first second of the 60s interval, followed by 59s of an enforced pause.
But as each T:LAN/RIO operates its own 60s (non-synchronized) trap throttling timer, the receiving NMS installation(s) will see the overall event notification load distributed quite evenly during each 1 minute interval.
What needs to be done on the NMS side?
It is vital to ensure that the destination NMS responds properly to each alarm notification. The NMS needs to answer each SNMPv2 INFORM REQUEST. This tells the T:LAN/RIO that it no longer needs to repeat an alarm notification and that the NMS acknowledges receipt.
Without having this mechanism in place, the T:LAN/RIO will attempt to re-transmit the same SNMP notification based on the user selected interval period and repeat count. For details on how to program these, see Chapter 3, RIO Trap Severity Menu in the RIO User Guide.
Why does the NMS need to send an Acknowledgement?
With SNMPv2, the InformRequest sent by the alarm sender (T:LAN/RIO) will be ACKed by the recipient (NMS) by sending back a copy of the processed SNMPv2 InformRequest. This way the alarm originator is assured that the recipient has indeed received and processed the event.
This eliminates the need to repeat the same event notification just to ensure proper reception, saving bandwidth, CPU power and reducing the possibilities of event storms.
What happens if the T:LAN/RIO does not receive an Acknowledgment?
If the T:LAN/RIO does not receive an ACK within the specified repeat interval, then the T:LAN/RIO will re-transmit the SNMPv2 InformRequest (up to the maximum number of times specified for the appropriate severity level).
The T:LAN/RIO stops attempting to deliver the event notification if it does not receive an ACK after the maximum number of repeat attempts has been reached.
How can you tell if the T:LAN/RIO is receiving Acknowledgements?
Check the RIO EVENT LOG using the [T]imer View. If the CC (Current Count) and Mx (Maximum number of repeats) column values always equal, and the Dst (Destination) column is NOT 0 (zero) then:
- the NMS is NOT handling the SNMPv2 InformRequests properly, or
- the ACKs are NOT getting back to the T:LAN/RIO.
What does the Dst column tell?
The Dst (Destination) column always starts with a binary combination of all the defined SNMP NMS destinations for a particular event.
An event needs to be sent to NMS #1 (which has the binary value of 1) and NMS #3 (which has the binary value of 4).
Therefore, initially the Dst column will show:
1 + 4 = 5.
Let’s assume NMS#3 replies back with a proper ACK, then the corresponding bit position (with the binary value of 4) will be cleared:
5 – 4 = 1.
From then on, the Dst column will show a value of 1.
Once NMS #1 answers as well, the Dst column will go to:
1 – 1 = 0.
As soon as the Dst column shows 0, the T:LAN/RIO will stop transmitting the corresponding event.
Dst = 0 means: All intended recipients have now acknowledged that they received the alarm notification!
Remember, this was just a simple example to demonstrate the behaviour of the Dst column. There are a total of four NMS destinations that can be addressed by the T:LAN/RIO. To make handling all the different combinations easier, we included a handy table to decode the Dst column value:
What if the Dst column still shows a value other than 0?
Then a particular (or several) NMS systems have not ACKed the SNMPv2 InformRequest before the maximum number of repeat attempts had been exhausted.
How can you tell which NMS has not Acknowledged an event?
Look for the value remaining in the Dst column. Then use the table above to find out which NMS systems failed to provide the ACK(s).
How many event entries does the T:LAN/RIO keep in its RIO Event Log?
How many concurrent SNMP events can the T:LAN/RIO actively support?
What happens if the log is full?
The T:LAN/RIO will discard the oldest entry in the log to make room for the newly recorded event.
What happens if the oldest entry was still active?
If the oldest entry in the log needs to be bumped to make room for a new event, then any associated retry counter/interval timer will be scrubbed as well.
The event will no longer show in the log nor will any further re-transmission attempts be made.
How likely is the possibility that an active entry will get bumped?
This depends on several factors:
- number of concurrently active alarms per T:LAN/RIO
- speed at which new events will be detected
- how fast the NMS systems acknowledge the event notifications
- network health
- available bandwidth
- trap throttling settings (see above)
As can be seen, many factors can contribute to the event buffer filling up. As stated above, the T:LAN/RIO can handle up to 250 concurrent event notifications without having to resort to bumping older entries from the RIO Event Log (FIFO).
What can be done to reduce the possibility of dropping active entries?
Balance the number of repeats, the rate at which an event is repeated and the reporting mode. The longer it takes to finish processing an event (if no ACKs are received) the more likely it is that such an active event might be bumped out of the log if there is a flurry of alarms to report.
Also, picking ONLY MOST RECENT EVENT REPORTING mode (as already mentioned in Part 1 of this series) can dramatically reduce the number of log entries required to report active events.
As each installation is unique, so are the average or peak number of alarm events that need to be reported.