+ All Categories
Home > Documents > BSM (OMi) 9.2x Stream-based Event correlation Troubleshooting

BSM (OMi) 9.2x Stream-based Event correlation Troubleshooting

Date post: 02-Jan-2016
Category:
Upload: kennan-riddle
View: 36 times
Download: 3 times
Share this document with a friend
Description:
BSM (OMi) 9.2x Stream-based Event correlation Troubleshooting. Agenda. SBEC – general feature Overview. WHAT is Stream-Based Event Correlation ?. - PowerPoint PPT Presentation
Popular Tags:
29
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BSM (OMI) 9.2X STREAM-BASED EVENT CORRELATION TROUBLESHOOTING
Transcript
Page 1: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice.

BSM (OMI) 9.2X STREAM-BASED EVENT CORRELATION TROUBLESHOOTING

Page 2: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.2

AGENDA

Stream-based Event CorrelationSBEC – General Feature Overview

Details on Rules

Troubleshooting

Event Suppression

Page 3: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice.

SBEC – GENERAL FEATURE OVERVIEW

Page 4: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.4

WHAT IS STREAM-BASED EVENT CORRELATION ?

Stream-based event correlation (SBEC) uses rules and filters to identify commonly occurring events or combinations of events and helps simply the handling of such events by automatically identifying events that can be withheld, removed or need a new event to be generated and displayed to the operators.

– The following types of SBEC rules can be configured:– Repetition Rules: Frequent repetitions of the same event

may indicate a problem that requires attention. – Combination Rules: A combination of different events

occurring together or in a particular order indicates an issue, and requires special treatment.

– Missing Recurrence Rules: A regularly recurring event is missing, for example, a regular heartbeat event do not arrive when expected.

– SBEC Rules are processed in the order defined in the rules list. Modifications are executed as soon as the rule is matched, and subsequent rules see modifications done by earlier rules

Page 5: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.5

COMBINATION RULES– When a combination of events occur, sometimes in a precise order, within a

short period of time, this may be understood as a problem requiring corrective action or even as a scenario that may initially appear to be a problem but which does not require any intervention by an operator. For example, a node-down event followed by a node-up event within 2 minutes usually means that a system reboot has occurred. This is typically viewed as not significant, as long a reboots do not occur too frequently, and does not require action other than the automatic cleaning up of these events.

– Configuring a combination rule requires at least two filters to select the events to consider, for example, to select events with a node-down indicator and to select events with a node-up indicator. Certain attributes must be the same to be regarded as originating from the same source, for example, the node CI and source CI must be the same. The time interval between the related events must be short, for example, a maximum of five minutes, before the scenario is considered to be a problem. You can also specify if the events must occur in a particular order for the rule to be matched and executed.

– It may be considered advantageous to hold back matching events during the time interval to reduce the number of unnecessary events being sent to the Event Browser. Only when the required combination of events are received within the specified time period is it necessary to inform the operator that action is necessary. This could be to close or discard all events, or modify the last event to inform that a reboot has taken place. Alternatively, a new event can be automatically generated. All matching events can be relate to the new event as symptoms.

Page 6: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.6

MISSING RECURRENCE RULES– Events are sometimes regularly generated to inform that no

problem has occurred, for example "alive" events indicate that a system is running. As soon as the expected regular event is not received, it can be assumed that there is a problem, for example, If a system stops reporting “alive” events every 10 minutes, it is has probably stopped running.

– Configuring a missing recurrence rule requires a filter to select the events to consider, for example, to select events with "node alive" in the title. Certain attributes must be the same to be regarded as originating from the same source, for example, the node, CI and source CI must be the same. The time allowable interval before an expected event is considered to be missing must be specified, for example, a maximum of 10 minutes in our example.

– It may be considered advantageous to discard recurring events to reduce the number of unnecessary events being sent to the Event Browser.

– When the expected event is not received within the specified time period, is it necessary to inform the operator that action is necessary.A new event can be automatically generated. All matching events can be relate to the new event as symptoms.

Page 7: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.7

REPETITION RULES– The repeated generation of the same event may indicate a problem.

For example, more than 10 login failures for the same account within 2 minutes is typically viewed as requiring action and should create a security alert.

– Configuring a repetition rule requires a filter to select the events to consider, for example, text "login failed" is contained within the title. Certain attributes must be the same to be regarded as originating from the same source, for example, the host name of the system and the user name being used to log in must be the same. The time interval between login attempts must be short, for example, a maximum of two minutes, and there must be a minimum number of attempted failed logins before the scenario is considered to be a problem.

– It may be considered advantageous to hold back matching events during the time interval to reduce the number of unnecessary events being sent to the Event Browser. Only when the minimum number of attempted failed logins exceeds the specified threshold, is it necessary to inform the operator that action is necessary. This could be to close or discard the failed login events, except for the last event which is modified to inform of the series of failed logins. Alternatively, a new event can be automatically generated. All failed-login events can be relate to the new event as symptoms.

Page 8: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.8

ConceptREPETITION

– Purpose: Event Repetition indicates a problem– Example: More than 3 Reboots within 1 hour shall create a

critical event

1 2 3

Time Interval

t

“Node rebooted”

Page 9: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.9

ConceptCOMBINATION

– Purpose: Handle a combination of events a certain way– Example: When a node is down, events about failed SiS

monitors should be related to the node down event

Time Interval

t

1 3

4

2

“Node down”

“SiS monitor failed”

“TCP timeout occured”

Page 10: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.10

ConceptMISSING RECURRENCE

– Purpose: Detect that regularly-received events are no longer arriving

– Example: For auditing and compliance purposes, detailed health data and statistics are collected every day using events. If these audit events do not arrive, a critical event should be sent

1 2 A t? ?

Page 11: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.11

How SBEC engine worksRULE PROCESSING

Only when receiving a new event:

For each Rule…– in the order defined, all input filters are checked if they match

the incoming event– On every match of an input filter, a query is executed to check

whether all conditions of the corresponding rule are matched• Repetition: enough events received within time frame• Combination: at least one event for every filter (“event set”) received within time frame

– If all conditions are matched, the Actions configured in that rule are executed with immediate effect on all corresponding events

Page 12: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice.

SBEC RULES – GOOD TO KNOW, BEST PRACTICES

Page 13: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.13

MULTIPLE SBEC RULES

– Any number of Repetition, Combination, and Missing Recurrence Rules can be created

– Processed in defined order (visible to the user, configurable)– Can be chained together

• First rule that triggers can modify events (e.g. close, discard, create new)• Next rule in line will see event modifications

– Can filter for the same events (even use the same filter)

Page 14: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.14

EFFECT OF HOLD BACK WHEN MULTIPLE RULES MATCH THE SAME EVENTS– Note 1: If at least one rule is holding back an event, it‘s held

back • Even if another rule is not holding it back

– Note 2: There is one holding area for all rules

Example

– Rule 1: combination rule: looking for node down/node up events – holding back node down as it wants to discard it and create reboot event instead

– Rule 2: combination rule: looking for node down & SiS events – not holding back the events

– Result: node down is hold back as long as within time window of Rule 1 (and as long as it is not released by any other rule)!

– Holding area – stored in DB if BSM server is stopped, but no persistency in case of unnatural abort of opr-backend

Page 15: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.15

EFFECT OF RELEASE WHEN MULTIPLE RULES MATCH THE SAME EVENTS– Note 3: When a rule triggers, all the corresponding input

events are removed from the holding area• Even if another rule put them there• Why? The rule that triggered detected a certain situation where the input events are relevant and therefore it can be seen as the master of these events. It has the right to release or even discard them.

– Note 4: If no rule was holding back an event, release has no effect

Example• Rule 1: combination rule: looking for node down/node up events – holding back node down as it wants to discard it and create reboot event instead• Rule 2: combination rule: looking for node down & SiS events – not holding back the events• Rule 2 triggers after node down & one SiS monitor event was received. Releases events.• Result: node down is no longer held back and correlated with SiS event.• Note: Rule 1 might still trigger later and create the reboot event!!

Page 16: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.16

EFFECT OF DISCARD IF POSSIBLE WHEN MULTIPLE RULES MATCH THE SAME EVENTS

– Note 4: Discard if possible will only have an effect if event is still in holding area• If no rule was holding the event back or if another rule already triggered and released the event, discard will have no effect (but the close operation is executed)• If discard is possible, event will be deleted immediately. For other rules, it will look like as if event never arrived.

Example• Rule 1: combination rule: looking for node down/node up events – holding back node down as it wants to discard it and create reboot event instead• Rule 2: combination rule: looking for node down & SiS events – not holding back the events• Rule 2 triggers after node down & one SiS monitor event was received. Releases events.• Rule 1 triggers: wants to discard node down event, but this is not possible as it was already released by Rule 2

Page 17: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.17

GOTCHAS & BEST PRACTICES

– Gotcha• It‘s quite easy to create a simple repetition rule like this:

−repetition rule uses filter title contains „rebooted“−and creates new event with title: „system <node> rebooted 10 times in 2 hours“−Guess what happens...

– Best Practices• In a rule don‘t create events that match the input filter of the rule• Include check for event state in filter - look for non-closed events only• Avoid too generic filters (like contains „rebooted“)• Add custom attribute (e.g. „SBECcreated=true“) and checks for it if you want to avoid that a created event is processed by following rules• If possible, avoid matching the same events. If unavoidable, make sure you understand the hold/release/discard behavior• When you reuse CI Hint in „Create New Event“, also reuse Node Hint.

Page 18: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice.

EVENT SUPPRESSION

Page 19: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.19

EVENT SUPPRESSION

– Purpose: All events matching a filter will be discarded from the event pipeline

– Example: OMi is receiving unimportant events from data source that is not under control of OpsBridge organization – can’t be filtered out on source level

– Configurable by event suppression rules consisting of• Event Filter• Name• Description• Enable/Disable

– Suppression rules are processed in the event pipeline at an early stage• Right after the resolution step, before Post-Resolution-EPI• no further processing occurs, events will be lost and not stored in the OMi DB.

Page 20: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice.

TROUBLESHOOTING

Page 21: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.21

EVENT HISTORY

– An event has changed and you have no idea why? Check the event history• Contains information about user / component, that changes event properties

– Common Source for unexpected changes on events: Event Forwarding & Back-Synch

Page 22: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.22

LOGGING / DEBUGGING

– Server: DPS– Process: opr-backend– Log config to enable log level “DEBUG”:

/<HPBSM_root>/conf/core/Tools/log4j/opr-backend/opr-backend.properties

– Log files:• /<HPBSM_root>/log/opr-backend/opr-backend.log default location for all logging within this process)• /<HPBSM_root>/log/opr-backend_boot.log for more severe issues, e.g. unhandled Exceptions, everything dumped to stdout/stderr)

Page 23: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.23

1. Debug opr_backend.log

HOW TO TRACK ANS SBEC RULE AS EVENTS ARRIVE

– 2. Make sure event is arrived, you will see an error like this– 2013-01-17 05:51:16,726 [Thread-44] DEBUG

EventChannelCiResolver.logEvent(309) - resolving event: SBEC(01b0ceb8-8189-4fdb-ae74-cf38b698b6d9), nodeHints=bsm92, relatedCiHint=bsm92, service_id=null

– 3. Make sure event matches SBEC rule– 2013-01-17 05:51:09,951 [Thread-44] DEBUG

EventStreamCorrelator.evaluateEventInRule(95) - Event matches filter in rule 'SBEC 3 CRITICAL EVENT RULE'

– 2013-01-17 05:51:09,951 [Thread-44] DEBUG FilterConfigManagerImpl.getFilterConfig(100) - get filter configuration with id: 93c74ff4-8cd0-463a-899b-2ffd41658d0f

Page 24: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.24

4. SBEC RULE WILL BE EVALUATED

– 2013-01-17 05:51:09,958 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(117) - Found 1 results

– 2013-01-17 05:51:09,958 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(130) - New SbecInstance: com.hp.opr.common.streamcorrelation.result.SbecInstance@bfa28a9[RuleId=89299047-5a85-4755-945e-43e5b9ab837a,MatchedEvtSets=[com.hp.opr.common.streamcorrelation.result.MatchedEventSet@54837563[c43278af-044d-b827-ef0f-484e086159b7,[84ee3e3d-e19f-4f39-818f-c70be3746550]]]]

– 2013-01-17 05:51:09,958 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 1 of 3 events collected

– 2013-01-17 05:51:09,958 [Thread-44] DEBUG EventUpdater.storeCorrelations(247) - Storing correlations

– 2

Page 25: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.25

5. AS SECOND EVENT ARRIVES WITHIN TIME FRAME LISTED , MAKE SURE IT IS STORED

– 2013-01-17 05:51:13,560 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(117) - Found 2 results

– 2013-01-17 05:51:13,560 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(130) - New SbecInstance: com.hp.opr.common.streamcorrelation.result.SbecInstance@5f5fc212[RuleId=89299047-5a85-4755-945e-43e5b9ab837a,MatchedEvtSets=[com.hp.opr.common.streamcorrelation.result.MatchedEventSet@7be5ca9[c43278af-044d-b827-ef0f-484e086159b7,[999d66c6-08cd-4b20-ac2c-6fc5dd19c11e, 84ee3e3d-e19f-4f39-818f-c70be3746550]]]]

– 2013-01-17 05:51:13,560 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 2 of 3 events collected

– 2013-01-17 05:51:13,560 [Thread-44] DEBUG EventUpdater.storeCorrelations(247) - Storing correlations

Page 26: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.26

6. CHECK MAKE SURE 3RD EVENTS ARRIVE– 2013-01-17 05:51:16,753 [Thread-44] DEBUG

SbecRuleEvaluatorImpl.findSbecInstances(117) - Found 3 results

– 2013-01-17 05:51:16,753 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(130) - New SbecInstance: com.hp.opr.common.streamcorrelation.result.SbecInstance@3dc63d85[RuleId=89299047-5a85-4755-945e-43e5b9ab837a,MatchedEvtSets=[com.hp.opr.common.streamcorrelation.result.MatchedEventSet@21f10672[c43278af-044d-b827-ef0f-484e086159b7,[01b0ceb8-8189-4fdb-ae74-cf38b698b6d9, 999d66c6-08cd-4b20-ac2c-6fc5dd19c11e, 84ee3e3d-e19f-4f39-818f-c70be3746550]]]]

– 2013-01-17 05:51:16,754 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 3 of 3 events collected

– !

Page 27: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.27

7. ONCE IT MATCHES RULE, IT WILL EXECUTE ACTIONS SPECIFIED

– 2013-01-17 05:51:16,754 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 3 of 3 events collected

– 2013-01-17 05:51:16,754 [Thread-44] DEBUG QueryResultProcessorImpl.processResult(53) - Rule 'SBEC 3 CRITICAL EVENT RULE' matches. Now executing Actions!

– 2013-01-17 05:51:16,754 [Thread-44] DEBUG BSMConnectionProvider.logOpenConnection(188) - Connection has been retrieved from pool. Number of borrowed connections is now: 1

Page 28: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.28

8. NEW SBEC EVENT GETS CREATED– 2013-01-17 05:51:16,811 [Thread-44] DEBUG

PipelineEventPoolImpl.insertNewEvent(330) - New event being inserted into the pipeline: com.hp.opr.common.model.Event@c8dbe6a[dbf5cae8-7a31-4093-8853-1bf208a100f7,1,SBEC received 3 Critical Evenst in aminute,<null>,OPEN,CRITICAL,<null>,<null>,<null>,<null>,bsm92,<null>,<null>,<null>,com.hp.opr.common.model.ResolutionHints@2dd02796[bsm92,<null>,<null>,<null>],com.hp.opr.common.model.ResolutionHints@3cd70059[<null>,<null>,<null>,<null>],<null>,<null>,false,-1,-1,[],{},Thu Jan 17 05:51:16 MST 2013,<null>,Thu Jan 17 05:51:16 MST 2013,0,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,false,<null>,<null>,<null>,<null>,<null>,<null>]

– 2013-01-17 05:51:16,811 [Thread-44] DEBUG EventPipeline.reinsertEvent(440) - Event dbf5cae8-7a31-4093-8853-1bf208a100f7 is now waiting for reinsertion at step PipelineEntry

– 2013-01-17 05:51:16,811 [Thread-44] DEBUG EventUpdater.storeCorrelations(247) - Storing correlations

Page 29: BSM (OMi) 9.2x  Stream-based Event correlation  Troubleshooting

© Copyright 2012 Hewlett-Packard Development Company, L.P.29

9. TO TROUBLESHOOT , JUST KNOW STEPS AND HOW IT WORKS– If any of above steps fails it will give you a reason why in opr-

backend.log ( DEBUG MODE)

– To find corrupt people follow the Money, to find non working SBEC events follow the EVENT through opr_backend.log


Recommended