Understanding Red Hat Enterprise Virtualization: Common Issues and TroubleshootingDerrick Ornelas Software Maintenance Engineer, Red Hat 06.28.12
INTENDED AUDIENCERHEV administrators
with good working-level knowledge of product who want to learn how to begin troubleshooting their own issues who want a better understanding of the inner workings of the product
AGENDA
Understanding RHEV
Components Networking structure Storage structure Understanding Logs Common Issues
Troubleshooting
RHEV ARCHITECTURE
Components ManagerWhat is the RHEV Manager?
Management platform for Virtualization Single platform for managing virtual servers and desktops
Built on Red Hat Enterprise Linux and JBoss
Runs on RHEL 6.2+ Server (physical or virtual) Cannot currently run on the same hosts that it manages RHEV-M is written in Java and runs on JBoss AS 5.1.2 JBoss is included with the RHEV subscription but is not supported for use in hosting non-RHEV applications
Uses an embedded PostgreSQL database
Components Hypervisor
Dedicated hypervisor
The minimum OS needed to run/manage virtual machines Well defined management interfaces/APIs ~120MB image size 750MB disk space required for installation Supports the same hardware as RHEL Leverages hardware certification and partner testing
Small footprint
Built using livecd-tools to create RHEL 6 live image
Utilizes KVM (Kernel-based Virtual Machine) Includes libvirt & vdsm for virtual machine management
Components VDSM
vdsm daemon listens for incoming commands from RHEV-M
Operates libvirt for VM lifecycle management Manages Storage Domains, Pools, SPM role, metadata, VM volumes and snapshots Monitors storage domain availability Written in Python Communicates with RHEV-M using XML-RPC on port 54321 Configuration in /etc/vdsm/vdsm.conf Used to operate and control virtual machines: Start/Stop/Restart, Migrations, Monitoring
libvirt starts, stops, pauses and migrates Vms
Components VDSMvdsClient Can be used to interact with vdsmd for troubleshooting only Does not update RHEV-M database! Examples Print a list of running vm's:vdsClient -s 0 list table
Get VM info from host:vdsClient -s 0 getAllVmStats
Start a virtual machine (for emergency situations only):vdsClient -s 0 create /dev/null vmId=b53eff20-7fb2-4b73-8172-76ec279f917b memSize=1024 macAddr=00:1a:4a:40:18:0b display=vnc vmName=rhel6_2 drive=pool:82e6bb7a-8c10-41c9-80c2-f947d6adac13,domain:d964e86d-ac5f-48a6-b7e47742b6fcf271,image:9c997323-36b1-4ce9-906f-c9a7e8ba8e08,volume:c1acf9b6-ac55-44f1bfe6-b38c20c27bec,boot:true,format:cow bridge=rhevm
Direct access to libvirt functionality via virsh is restricted
NETWORKING logical networks
Networking
Each Data Center defines the logical networks that exists in its environment These logical networks are usually assigned by functionality and physical topology, For example:
Guest data network Storage network access Management network (may be out of band for managing the servers) Display network (for Spice/VNC)
Each Cluster may have a different set of logical networks but they must exist in the Data Center definition All the Hosts in the Cluster should have the same network configuration By default the RHEV-M defines only the management network rhevm Logical networks layout do not necessary correspond to the physical NICs on the host, but the infrastructure must support VLAN tags and Bonding to do so, Otherwise logical network == physical NIC
NETWORKING
Support for VLAN tags and bonding
Supported bonding modes
active-backup / mode 1 balance-xor / mode 2 802.3ad / mode 4 balance-tlb / mode 5
balance-rr (mode 0) and balance-alb (mode 6) not supported due to incompatibility with software bridges
A software bridge is created for each logical network and functions like a switch
NETWORKING ports
Note: SPICE/VNC clients connect directly to the host USB redirection clients connect directly to the guest
STORAGE definitionsStorage pool logical equivalent to Data Center, groups storage domains together Storage domain physical chunk of storage that holds virtual machine disks Storage Pool Manager single host in data center that is chosen to manage all storage in a storage pool Host Storage Manager VDSM component on each host that reads/writes messages to the SPM
STORAGERHEV-M VDSMD Host Storage Manager VDSMD Host Storage Manager Storage Pool Manager
Host Host Host
Storage Pool Manager (Host) Storage Pool Manager (Host)
Storage Pool
STORAGE architectureStorage Pool
File NFS share ISCSI---------------------Managed via iscsiadm
Block FC---------------------Direct Access
Each domain is a share Volumes and Metadata are files
Devices are managed by LVM and multipathd Each LUN is a PV Each Domain is represented by VG, tagged with RHAT_storage_domain Metadata & Volumes represented by LVs Domain Types: Data
Domain Types: Data, Export, ISO
STORAGE architectureHow are virtual machines stored?
OVF file
Holds VM description name, NICs, CPU, memory, disks and more Only used when importing/exporting VMs to/from RHEV Managed as image which is a logical group of volumes Volumes in an image are different versions of a disk
VM disk
Stored as files on NFS Stored on LVM logical volumes on iSCSI/FC
STORAGE architectureSnapshots
A new Sparse volume is created, regardless of type of original volume QCOW2 chains the volumes together, grouped as image The last volume on the chain is read-write (rw); all the others are readonly (r) On Block storage, all its volumes/LVs must be active Template volume can be used as head of chain
Templates
Template volume is always read-only in this case
STORAGE architectureVolumes are visible to all hosts in storage pool SPM:
single host to control all storage operations Single storage domain that keeps all the up-to-date information about the storage pool as metadata Storage pool and domain has metadata that describes it Each volume also has metadata describing it On Block storage - volume metadata is stored on LV On NFS storage - volume metadata is a file per volume with .meta suffix
Master Data Storage Domain:
Metadata:
Storage ArchitectureEach storage domain contains the following files/volumes for internal use: ids not used inbox monitored by SPM for HSM messages outbox monitored by HSMs for SPM messages leases SPM writes timestamp here to prevent other hosts from becoming SPM at the same time master ext3 filesystem with vms and tasks directories Only mounted on SPM, and only used on master storage domain metadata contains volume metadata On NFS file for each On Block LV for each
Storage Metadata
Metadata - information describing the storage pool and each of its storage domains that is stored on the physical storage Consists of a combination of text and LVM tags Two storage domain metadata versions exist: V1 and V2
Version 1 used by ISO and Export storage domains, and all RHEV 2.x storage domains Version 2 used by new data storage domains in RHEV 3.0
Block storage metadata V1 storage domain metadata located on first 2k bytes of/dev//metadata
V2 storage domain metadata is part of VG tags Volume medata located on /dev//metadata storage domain metadata located in /rhev/dataVolume metadata located in /rhev/data-
NFS storage metadata
center/mnt///dom_md/metadata
center/mnt///images// .meta
Storage Structure# tree /rhev/data-center/
Show the tree structure of the Storage Pool as seen by host tree package is not installed by default on RHEL 6 tree package not available on RHEV-H Provides a table view of the storage Shows LVM information with RHEV-related tags
# python /usr/share/vdsm/dumpStorageTable.py
# pvs | vgs | lvs -o +tags
TROUBLESHOOTING understanding logs
Main RHEV-M log: /var/log/rhevm/rhevm.log Timestamps will be in the timezone of the OS (eg localtime) Main Hypervisor logs: /var/log/vdsm/vdsm.log /var/log/vdsm/libvirt.log Timestamps will be in UTC on RHEV-H Time difference between manager and hypervisor may need to be taken into account when following task flows from RHEVM to RHEVHrhevm.log: 2012-01-14 17:26:17,803 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-11-thread1425) Running command: RunVmCommand internal: false. Entities affected : ID: 570c6cfd-6fe4-4a33-8fd0d32d5bfa2bd5 Type: VM vdsm.log: Thread-200734::DEBUG::2012-01-14 07:26:19,168::clientIF::54::vds::(wrapper) [10.64.24.140]::call create with ({'bridge': 'rhevm', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.2.0', 'vmId': '570c6cfd-6fe4-4a33-8fd0-d32d5bfa2bd5' ...
Troubleshooting Understanding Logs
All actions are initiated by the manager vdsm daemon listens for incoming tasks Tasks are handled asynchronously by vdsm, manager will poll status Response returned to manager when completed Check vdsm logs for Run and protect to indicate start (and end) of a new taskThread-227417::INFO::2012-01-14 07:54:39,246::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=82e6bb7a-8c10-41c9-80c2-f947d6adac13) ... Thread-227417::INFO::2012-01-14 07:54:39,248::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, Return response: {'status': {'message': 'OK', 'code': 0}, 'spm_st': {'spmId': 3, 'spmStatus': 'SPM', 'spmLver': 8}}
Task flow can be followed in vdsm log by looking for lines that have the same starting: - usually for short tasks, or - for long (async) tasks
TROUBLESHOOTING checking database
Info, eg UUIDs/current state, about VMs, storage pools, domains, VM images, etc, can be obtained by querying the RHEV-M database Embedded postgreSQL database server Database name - rhevm Connect by running 'psql rhevm rhevm' Restoring from database backup
pg_restore -c -d rhevm -U postgres
Graphical client such as pgadmin3 can be used for convenience Available in Fedora or EPEL repository Example query Show all hosts and their infoselect * from vds_static\x\g\x
TROUBLESHOOTING databaseImportant tables:images list of all VM volumes image_vm_map maps VMs to active volumes lun_storage_server_connection_map maps LUNs and iSCSI connection info lun list of physical storage LUNs storage_pool list of all storage pools/data centers storage_domain_static list of all storage domains storage_server_connections list of iSCSI and NFS storage connections vds_static list of all hosts/hypervisors vm_static list of all virtual machines Note: Tables and column names can be confusing. Disk image is referred to as image_group, while disk volume is referred to as image
TROUBLESHOOTING storage issues
For troubleshooting storage issues, generally concentrate on the SPM host (or host attempting to become SPM if problem relates to acquiring SPM role) Current SPM host may not have been the SPM at the time of a problem occurred: Search rhevm.log for 'starting spm on' to find SPM at the time of the problem RHEV storage operations use standard RHEL commands so typical storage troubleshooting applies: Storage commands can be run from hypervisor command line multipath, iscsiadm, showmount/mount/rpcinfo, cat /proc/scsi/scsi, less /var/log/messages, etc Storage domains, VM disks, snapshots and templates on iSCSI/FC data centers are LVM volume groups / logical volumes so typical LVM troubleshooting applies: vgscan, lvs, vgchange, cat /etc/lvm/{archive,backup}/, etc
TROUBLESHOOTING certificates
CA certificate generated during rhevm-setup Located on manager at: http://:/ca.crt /var/lib/jbossas/server/rhevm-slimmed/deploy/ROOT.war/ca.crt /etc/pki/rhevm/ca.pem CA certificate must be in Window's Trusted Root Certificate Authority certificate store on client to connect via HTTPS SSL certificates for rhevmvdsmd communication created during host registration process Located on manager at /etc/pki/rhevm/certs/cert.pem Located on hosts at /etc/pki/vdsm/certs/vdsmcert.pem SSL/TLS requires times to be in sync else connection will fail to be established (use NTP)
TROUBLESHOOTING certificates
WPF application code can not run without being signed
Code-signing cert is installed by RHEV-GUI-CertificateInstaller.exe offered by admin portal on first access
Located on manager at /usr/share/rhevm/rhevm.ear/rhevmanager.war/RHEV-GUICertificateInstaller.exe
WPF code certificate must be installed by installer or admin WPF app will crash on start
TROUBLESHOOTING data center down
Data center = storage pool
Storage pool must have a master storage domain (holds storage pool metadata) Common problem: Cannot activate Master storage domain Storage pool must have an SPM host managing all changes to the storage domains Common problem: Cannot start SPM on any hosts
Usually caused by storage or network related problems Missing LUN(s) and/or volume group(s) (storage domain = volume group) Corrupt / inconsistent storage domain metadata or LVM metadata Current SPM host is non-responsive and no fencing defined for that host
Troubleshooting examples:
Situation where current SPM host is non-responsive with no power fencing: Manually reboot host and 'Confirm host has been rebooted' in RHEVM GUI Put master storage domain into maintenance and try to activate it again Put all hosts in maintenance (or reboot all) and try to activate just one host Focus on vdsm.log on that host to determine why storage pool won't activate
TROUBLESHOOTING problematic host states
Non-responsive
Cause: RHEVM cannot communicate with the host on vdsm port 54321 RHEVM regularly monitors hosts and if host cannot be contacted after a while, errors similar to these will appear in the rhevm.log:
ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = 0b1f2e8e-3b5a-11e1b24a-5254005ef58b : rhevhost1.redhat.com, VDS Network Error, continuing ResourceManager::vdsNotResponding entered for Host 0b1f2e8e-3b5a-11e1-b24a-5254005ef58b, 10.10.1.205
Power fencing will allow a non-responsive host to be rebooted and HA VMs to restart on another host (SPM role will transfer, too, if host was SPM)
Non-operational Cause: RHEVM can still communicate with the host on port 54321 but something is wrong with the configuration/operation of the host.
It is a problem with the host not being able to successfully operate all the components defined for a host in its cluster
Storage domain (volume group) cannot be found/activated Metadata corruption / inconsistency (Wrong master version) Logical network (not rhevm network) cannot be created or is down
TROUBLESHOOTING problematic VM states
Unknown
Usually related to a Host becoming non-responsive and no power management has been defined for that host RHEVM can no longer monitor the Vms on that host to determine their state so it marks them as Unknown
Failed to run Fence script on vds:host5.redhat.com, VMs moved to UnKnown instead
Paused
VM usually enters this state if it encounters storage problems outside the VM. Prevents further operation of the VM to stop it trying to access/write to its disk
libvirtEventLoop::INFO::2012-01-13 06:29:02,646::libvirtvm::1231::vm.Vm::(_onAbnormalStop) vmId=`b53eff20-7fb2-4b73-8172-76ec279f917b`::abnormal vm stop device ide0-0-0 error eio libvirtEventLoop::DEBUG::2012-01-13 06:29:02,646::libvirtvm::1386::vm.Vm:: (_onLibvirtLifecycleEvent) vmId=`b53eff20-7fb2-4b73-8172-76ec279f917b`::event Suspended detail 2 opaque None
Example causes:
Storage connection problems Storage domain full when trying to extend sparse VM volume
TROUBLESHOOTING can't start VM
Disk problems, examples: Moving a VM with multiple volumes between storage domains may have been interrupted and some of the volumes are still in original storage domain and some in the destination domain VM disk based on a template has been moved to a new storage domain but the template wasn't moved Solution: May need to check the DB to get a list images/storage domains for the VM, then use LVM commands (eg lvs) to get an on-disk view of where its images are. If there are discrepancies, may need to run SQL update statements to update the DB view and/or vdsClient commands to modify the on-disk view RHEVM UI problems, examples: VM state appears to be stuck in a state that won't allow it to be started, eg Image Locked or Unknown. RHEVM DB has become out of sync with what is happening on the hypervisors For image locked, need to check if the task (eg creating/moving/deleting an image) has complete. For Unknown, need to check if the VM is running on any hosts Solution: May need to run SQL update statements against the DB to change the state of the VM (eg to Down) so the VM can be started again
TROUBLESHOOTING can't migrate VM
Hostname problems hostname of destination needs to be DNS resolvable by source host
migration destination error: Migration destination has an invalid hostname
Firewall problems Live migration needs ports from 49152-49216 on the destination to be accessible Hardware incompatibilities Eg, CPU on destination host doesn't match source host Timeout problems Live migration timeout is 300 seconds (migrate_timeout in /etc/vdsm/vdsm.conf) Some factors influencing time to migrate VM memory image between hosts: Amount of memory in VM Amount of memory activity happening in the VM Saturation of the network used to migrate the memory image
Live migrating many VMs at once, eg putting a host into maintenance that had many VMs running on it, may mean not all VMs complete their live migration before the timeout expires, hence preventing the host to go into maintenance
TROUBLESHOOTING KVM and QEMUCommon Issues
Time Drift CPU overcommit, heavy server load
Use NTP as much as possible RHEV Administration Guide Appendix G KVM Virtual Machine Timing Management
Performance (suggested best practices): Guest:
noatime elevator=noop vcpu numbers Minimal services elevator=deadline noatime Multipathing Bonding avoid overallocation More threads than physical cpu's - threads block which causes problems as VM should essentially be running in real time Don't over commit servers with more cpus or memory than you actually have
Host:
TROUBLESHOOTING collecting logs
rhevm-log-collector Main tool used by support personnel to capture a snapshot of a customer's RHEV environment
Collects RHEVM log files and database tmp/logcollector/RHEVH-and-PostgreSQL-reports/time_diff.txt Runs sosreports on nominated hypervisors to collect the usual RHEL related info, eg log files / command output, as well as RHEV specific info, eg VDSM/libvirt log files and vdsClient command output
Usage: rhevm-log-collector [options] list rhevm-log-collector [options] collect The 'list' operation will list hosts, data centers, or clusters from which logs may be collected The 'collect' operation will collect the data and compress it This process may take some time depending on log file size https://access.redhat.com/knowledge/techbriefs/troubleshooting-red-hatenterprise-virtualization-manager-log-collection-rhev-3
Stay connected through the Red Hat Customer Portal
Troubleshooting Host Installation for Red Hat Enterprise Virtualization 3.0
Review Tech brief
access.redhat.com