VERITAS Cluster Server for Solaris

VERITAS Cluster Serverfor Solaris

Troubleshooting

VCS_3.5_Solaris_R3.5_20020915

14-2

Objectives

After completing this lesson, you will be able to:

• Monitor system and cluster status.• Apply troubleshooting techniques in a VCS

environment.• Detect and solve VCS communication problems.• Identify and solve VCS engine problems.• Correct service group problems.• Resolve problems with resources.• Solve problems with agents.• Correct resource type problems.• Plan for disaster recovery.

VCS_3.5_Solaris_R3.5_20020915

14-3

Monitoring VCS

• VCS log files

• System log files

• The hastatus utility

• SNMP traps

• Event notification triggers

• Cluster Manager

VCS_3.5_Solaris_R3.5_20020915

14-4

VCS Log Entries• Engine log: /var/VRTSvcs/log/engine_A.log• View logs using the GUI or the hamsg command:

hamsg engine_A

• Example entries:TAG_D 2001/04/03 12:17:44 VCS:11022:VCS engine (had) started

TAG_D 2001/04/03 12:17:44 VCS:10114:opening GAB library

TAG_C 2001/04/03 12:17:45 VCS:10526:IpmHandle::recv peer exited errno 10054

TAG_E 2001/04/03 12:17:52 VCS:10077:received new cluster membership

TAG_E 2001/04/03 12:17:52 VCS:10080:Membership: 0x3, Jeopardy: 0x0

Most RecentMost Recent

VCS_3.5_Solaris_R3.5_20020915

14-5

Agent Log Entries

• Agent logs kept in /var/VRTSvcs/log• Log files named AgentName_A.log• LogLevel attribute settings:

– none– error (default setting)– info– debug– all

• To change log level:

hatype -modify res_type LogLevel debug

VCS_3.5_Solaris_R3.5_20020915

14-6

Troubleshooting Guide

Start by running hastatus -summary:

• Cluster communication problems are indicated by the message:

Cannot connect to server -- Retry Later

• VCS engine startup problems are indicated by systems in one of the WAIT states.

• Service group, resource, or agent problems are indicated within the hastatus display.

VCS_3.5_Solaris_R3.5_20020915

14-7

Cluster Communication Problems

Run lltconfig to determine if LLT is running. If LLT is not running:• Check the /etc/llttab file:

– Verify that the node number is within range (0-31)– Verify that the cluster number is within range (0-255).– Determine whether the link directive is specified correctly (qf3

should be qfe, for example).

• Check the /etc/llthosts file:– Verify that node numbers are within range.– Verify that the system names match the entries in the llttab or

sysname files.

• Check the /etc/VRTSvcs/conf/sysname file:– Make sure there is only one system name in the file.– Verify that the system name matches the entry in the llthosts

file.

VCS_3.5_Solaris_R3.5_20020915

14-8

Problems with LLT

If LLT is running:• Run lltstat -n to determine if systems can see each other on

the LLT link.• Check the physical network connection(s) if LLT cannot see

each node.

train11# lltconfigLLT is running train11# lltstat -nLLT node information: Node State Links * 0 train11 OPEN 2 1 train12 CONNWAIT 2

train11# lltconfigLLT is running train11# lltstat -nLLT node information: Node State Links * 0 train11 OPEN 2 1 train12 CONNWAIT 2

train12# lltconfigLLT is running train12# lltstat -nLLT node information: Node State Links 0 train11 CONNWAIT 2 * 1 train12 OPEN 2

train12# lltconfigLLT is running train12# lltstat -nLLT node information: Node State Links 0 train11 CONNWAIT 2 * 1 train12 OPEN 2

VCS_3.5_Solaris_R3.5_20020915

14-9

Problems with GABCheck GAB by running gabconfig –a:• No port a membership indicates a GAB problem.• Check the seed number in /etc/gabtab.• If a node is not operational, hence the cluster is not seeded,

force GAB to start:

gabconfig -x • If GAB starts and immediately shuts down, check LLT and

private network cabling.• No port h membership indicates a VCS engine (had) startup

problem.

# gabconfig -aGAB Port Memberships========================

# gabconfig -aGAB Port Memberships========================

# gabconfig -aGAB Port Memberships===================================Port a gen 24110002 membership 01

# gabconfig -aGAB Port Memberships===================================Port a gen 24110002 membership 01

HAD not running:GAB and LLT functioning

VCS_3.5_Solaris_R3.5_20020915

14-10

VCS Engine Startup Problems

Check the VCS engine (HAD) by running hastatus –sum:• Check GAB and LLT if you see this messsage:Cannot connect to server -- Retry Later

• Verify that the main.cf file is valid and that system names match llthosts and llttab:hacf –verify /etc/VRTSvcs/conf/config

• Check for systems in WAIT states:– STALE_ADMIN_WAIT: The system has a stale configuration

and no other system is in a RUNNING state.– ADMIN_WAIT: The system cannot build or obtain a valid

configuration.

VCS_3.5_Solaris_R3.5_20020915

14-11

STALE_ADMIN_WAIT

To recover from STALE_ADMIN_WAIT state: Visually inspect the main.cf file to determine

whether it is valid. Edit the main.cf file, if necessary. Verify the syntax of main.cf, if modified.

hacf –verify config_dir

Start VCS on the system with the valid main.cf file: hasys -force system_name

1. All other systems perform a remote build from the system now running.

VCS_3.5_Solaris_R3.5_20020915

14-12

ADMIN_WAIT

• A system can be in the ADMIN_WAIT state under these circumstances:

– A .stale flag exists and the main.cf file has a syntax problem.

– A disk error occurs affecting main.cf during a local build.

– The system is performing a remote build and last running system fails.

• Restore main.cf and use the procedure for STALE_ADMIN_WAIT.

VCS_3.5_Solaris_R3.5_20020915

14-13

Identifying Other Problems

After verifying that HAD, LLT, and GAB are functioning properly, run hastatus –sum toidentify problems in other areas:• Service groups• Resources• Agents and resource types

VCS_3.5_Solaris_R3.5_20020915

14-14

Service Group Problems: Group Not Configured to Start or Run

• Service group not onlined automatically when VCS starts:

Check AutoStart and AutoStartList attributes:

hagrp –display service_group• Service group not configured to run on the

system:– Check the SystemList attribute.– Verify that the system name is included.

VCS_3.5_Solaris_R3.5_20020915

14-15

Service Group AutoDisabled

• Autodisable occurs when:– GAB sees a system but had is not running on the system.– Resources of the service group are not fully probed on all

systems in the SystemList.– A particular system is visible through disk heartbeat only.

• Make sure that the service group is offline on all systems in SystemList attribute.

• Clear the AutoDisabled attribute:hagrp –autoenable service_group -sys system

• Bring the service group online.

VCS_3.5_Solaris_R3.5_20020915

14-16

Service Group Not Fully Probed

Usually a result of improperly configured resourceattributes:• Check ProbesPending attribute:

hagrp -display service_group • Check which resources are not probed:

hastatus -sum• Check Probes attribute for resources:

hares -display• To probe resources:

hares –probe resource -sys system

VCS_3.5_Solaris_R3.5_20020915

14-17

Service Group Frozen

• Verify value of Frozen and TFrozen attributes:

hagrp -display service_group• Unfreeze the service group:

hagrp -unfreeze group [-persistent]• If you freeze persistently, you must unfreeze

persistently.

VCS_3.5_Solaris_R3.5_20020915

14-18

Service Group Is Not Offline Elsewhere

• Determine which resources are online/offline:hastatus -sum

• Verify the State attribute:hagrp -display service_group

• Offline the group on the other system:hagrp -offline

• Flush the service group:hagrp -flush service_group -sys system

VCS_3.5_Solaris_R3.5_20020915

14-19

Service Group Waiting for Resource

• Review Istate attribute of all resources to determine which resource is waiting to go online.

• Use hastatus to identify the resource.• Make sure the resource is offline (at the operating

system level).• Clear the internal state of the service group:hagrp –flush service_group -sys system

• Bring all other resources in the service group offline and try to bring these resources online on another system.

• Verify that the resource works properly outside VCS.• Check for errors in attribute values.

VCS_3.5_Solaris_R3.5_20020915

14-20

Incorrect Local Name

A service group cannot be brought online if the systemname is inconsistent in llthosts, llttab, or main.cffiles.

• Check each file for consistent use of system names.• Correct any discrepancies. • If main.cf is changed, stop and restart VCS.

• If ltthosts or ltttab is changed: a. Stop VCS, GAB, and LLT.

b. Restart LLT, GAB, and VCS.

VCS_3.5_Solaris_R3.5_20020915

14-21

Concurrency Violations

• Occurs when a failover service group is online or partially online on more than one system

• Notification provided by the Violation trigger:

– Invoked on the system that caused the concurrency violation

– Notifies the administrator and takes the service group offline on the system causing the violation

– Configured by default with the violation script in /opt/VRTSvcs/bin/triggers

– Can be customized:• Send message to the system log.

• Display warning on all cluster systems.

• Send e-mail messages.

VCS_3.5_Solaris_R3.5_20020915

14-22

Service Group Waiting for Resource to Go Offline

• Identify which resource is not offline:hastatus –summary

• Check logs.

• Manually bring the resource offline, if necessary.

• Configure ResNotOff trigger for notification or action.

VCS_3.5_Solaris_R3.5_20020915

14-23

Resource Problems: Unable to Bring Resources Online

Possible causes of failure while bringingresources online:• Waiting for child resources

• Stuck in a WAIT state

• Agent not running

VCS_3.5_Solaris_R3.5_20020915

14-24

Problems Bringing Resources Offline

• Waiting for parent resources to come offline

• Waiting for a resource to respond

• Agent not running

VCS_3.5_Solaris_R3.5_20020915

14-25

Critical Resource Faults

• Determine which critical resource has faulted:hastatus –summary

• Make sure that the resource is offline.

• Examine the engine log.

• Fix the problem.

• Verify that the resources work properly outside of VCS.

• Clear fault in VCS.

VCS_3.5_Solaris_R3.5_20020915

14-26

Clearing Faults

• After external problems are fixed:

1. Clear any faults on nonpersistent resources.

hares -clear resource -sys system

2. Check attribute fields for incorrect or missing data.

• If service group is partially online:

1. Flush wait states:

hagrp -flush service_group -sys system

• Bring resources offline first before bringing them online.

VCS_3.5_Solaris_R3.5_20020915

14-27

Agent Problems: Agent Not Running

• Determine whether the agent for that resource is FAULTED:hastatus –summary

• Use the ps command to verify that the agent process is not running.

• Check the log files for:• Incorrect pathname for the agent binary• Incorrect agent name• Corrupt agent binary

VCS_3.5_Solaris_R3.5_20020915

14-28

Resource Type Problems

• A corrupted type definition can cause agents to fail by passing invalid arguments.

• Verify that the agent works properly outside of VCS.

• Verify values for ArgList and ArgListValues type attributes:hatype –display res_type

• Restart the agent after making changes:haagent –start res_type -sys system

VCS_3.5_Solaris_R3.5_20020915

14-29

Planning for Disaster Recovery

• Back up key VCS files:

– types.cf and customized types files

– main.cf

– main.cmd

– sysname

– LLT and GAB configuration files

– Customized trigger scripts

– Customized agents

• Use hagetcf to create an archive.

VCS_3.5_Solaris_R3.5_20020915

14-30

The hasnap UtilityUse the hasnap command to take snapshots of VCS configuration files on eachnode in the cluster. You can also restore the configuration from a snapshot.

• hasnap -backup Backs up files in a snapshot format.• hasnap -restore Restores a previously created snapshot.• hasnap -display Displays details of previously created snapshots.• hasnap -sdiff Displays files that were changed on the local system after a specific• snapshot was created.• hasnap -fdiff Displays the differences between a file in the cluster and its copy• stored in a snapshot.• hasnap -export Exports a snapshot from the local, predefined directory to the• specified file.• hasnap -include Configures the list of files or directories to be included in new• snapshots, in addition to those included automatically by the• -backup command.• hasnap -exclude Configures the list of files or directories to be excluded from new• snapshots when backing up the configuration using the -backup• command.• hasnap -delete Deletes snapshots from the predefined local directory on each node.

VCS_3.5_Solaris_R3.5_20020915

14-31

Summary

You should now be able to:• Monitor system and cluster status.• Apply troubleshooting techniques in a VCS

environment.• Resolve communication problems.• Identify and solve VCS engine problems.• Correct service group problems.• Resolve problems with resources.• Solve problems with agents.• Correct resource type problems.• Plan for disaster recovery.

Date post:	27-Jan-2016
Category:	Documents
Upload:	kale
View:	209 times
Download:	12 times

VERITAS Cluster Server for Solaris

Documents