Post on 26-Dec-2015
transcript
1
6/26/2013 Cisco Live 2013
2
6/26/2013 Cisco Live 2013
On average, how do your network administrators and other network IT professionals spend their time on your access (edge) switches?
3
4
6/26/2013 Cisco Live 2013
5
6
7
8
9
10
Port connectivity issues arise from NIC, cable, or switch port faults. These problems may be hardware or software related. There are two different traffic related issues: Not passing any traffic and Passing too much. If no traffic is passing verify configuration settings on both devices of the link. Move one of the devices to a known working partner and see if the problem moves with it or not. Too much traffic could indicate a number of issues: broadcast storm, unicast flood, or oversubscription. A Sniffer capture is usually necessary to identify the types of packets overtaking the interface. Speed/Duplex issues are typically seen at the access layer where user PCs move and change frequently. It is recommended that, where possible, that you hard code speed/duplex setting – particularly for connections that should not change, eg: servers, routers, etc.
11
12
13
%LINK-4-ERROR: FastEthernet0/1 is experiencing errors The above Link error message pops up because of the following potential problems: Excessive Alignment errors -> 1 per 100 ms Excessive FCS errors -> 1 per 100 ms Excessive TX collisions -> 1 per 100ms Late collisions -> 1 per 100 ms
14
Algorithms are used to determine path (platform specific). Either mac address or IP address used for path determination. All packets take same path for a given source to destination.
15
fixing speed and duplex should be done on both sides See CSCtj21335 and workaround: https://supportforums.cisco.com/docs/DOC-23267
17
TDR= Time domain Reflectometry; available for copper interfaces up to 1GE speed Interfaces will be brought down and up when run on active ports
© 2012, Cisco Systems, Inc. All rights reserved. 18
19
Packets with IP options, Packets with expired TTL, Glean packets, ARP, Snooping, Software ACLs, SNMP, etc.
20
Capturing process utilization at the “right” moment is key for identifying the cause Processes : For ex “show tech” causes the virtual exec process to use some CPU resources Traffic Forwarding :
Data traffic not forwarded by ASIC
Excessive Control Plane / Management traffic:
DoS attacks (TTL=1)
SVI ping test
Requires inspecting CPU queues and ASIC
*Note: show tech causes the virtual exec process to use
some CPU resources
CPU util sustained below 50% will not cause problems.
Example of Syslog msg for high CPU
002182: *Jul 20 04:23:36: %SYS-1-CPURISINGTHRESHOLD: Threshold:
Process CPU Utilization(Total/Intr): 9%/0%, Top 3 processes(Pid/Util):
214/3%, 153/0%, 159/0%
Sorting the output is better than filtering the output with “exclude 0.00%”
because that will exclude processes that you want to see.
2960-S will have a higher CPU util that is considered normal.
The “normal” cpu usage depends on number of members in the stack, routing protocols, spanning tree instances, …
High CPU Utilization?
Looking for lost packets?
Use EEM to take the process usage snapshot at the right time: event manager applet High_CPU event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.6 get-type next entry-op gt entry-val 90 poll-interval 10 action 1.1 syslog msg "High CPU DETECTED. Please wait - logging Information to flash:show_proc_cpu.txt" action 1.2 cli command "enable" action 1.3 cli command "end" action 1.4 cli command "term exec prompt timestamp" action 1.5 cli command "show process cpu sorted | redirect flash:show_proc_cpu.txt" action 2.1 syslog msg "Capturing IP Traffics - logging Information to flash:show_ip_traffic.txt" action 2.2 cli command "show ip traffic | redirect flash:show_ip_traffic.txt" action 3.1 syslog msg "Capturing show tech. Please wait - logging Information to flash:show_tech.txt" action 3.2 cli command "show tech | redirect flash:show_tech.txt" action 3.3 syslog msg "Self-removing applet from configuration..." action 3.4 cli command "configure terminal” action 3.5 cli command "no event manager applet High_CPU" action 3.5 cli command "end"
20
21
Each queue has its own tuned buffers and scheduling. These are not configurable, but were highly tested and tweaked before the Catalyst 3750 was released.
rpc: Internal messaging queue STP: STP messages ipc: L3 internal messaging queue Routing protocols: L3 protocols like OSPF L2 protocol: CDP etc other than STP remote console: Stacking slave console Software forwarding: Fast switching, CPU forwarding Host to Host functions: Ping, Telnet etc. Broadcast: L2 broadcast packets cbt-to-spt: Used by Multicast IGMP snooping: Used by Multicast ICMP: Used by IP Logging: ACL logging and Smart Logging RPF-fail: Multicast RPF fail Queue14: Unused CPU heartbeat: Internal wellness check
For example there are more buffers for STP and handle more STP packets since one doesn’t want to drop them. Logging in comparison is a low priority activity. The thresholds are different can vary from 1 to 100s of packets. They are dynamic and can grow. The thresholds numbers may be changed, if required, depending on new feature requirements etc. The stack ring has capability of reserving bandwidth for priority traffic. We use it to ensure the stack messages can work even under heavy user
load. This is guaranteed bandwidth by hardware even under heavy user overloading of the stack. All stack message are handled in the rpc queue which is tuned with larger buffers and scheduling. Flows eligible for CPU forwarding are:
Control plane traffic Management Traffic TCAM overflow traffic
ACL overflow MAC entry overflow Routing Table Overflow
Special protocol flows, these are typically low volume and unofficially supported.
Depth of CPU Qs cannot be modified
The HW (i.e. the port ASIC) will drop on queue congestion
Overload on one CPU Queue should not affect other Queues STP has its own queue – Queue 1 Queue 4 for the other L2 Protocols Values are cumulative Use “clear controllers cpu” or repeat the command multiple times
21
22
Use the debug platform cpu-queues privileged EXEC command to enable debugging of platform central processing unit (CPU) receive queues. software-fwd-q Debug packets received by the software forwarding queue. When running the debug:
Command Purpose
Step 1 configure terminal Enter global configuration mode.
Step 2 no logging console Disable logging to the console terminal.
Step 3 logging buffered 128000 Enable system message logging to a local buffer, and set the buffer size to
12800 bytes.
Step 4 service timestamps debug datetime msecs localtime
Configure the system to apply a timestamp to debugging messages or
system logging messages.
Step 5 exit Return to privileged EXEC mode.
BGL-3700-3#sh cd ne
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone,
D - Remote, C - CVTA, M - Two-port Mac Relay
Device ID Local Intrfce Holdtme Capability Platform Port ID
BGL14-TACLAB-ASW-J08
Gig 1/0/2 158 S I WS-C3550- Fas 0/16
BGL14-TACLAB-ASW-J08
Gig 2/0/2 131 S I WS-C3550- Fas 0/40
BGL-3700-3#sh arp
Protocol Address Age (min) Hardware Addr Type Interface
Internet 14.160.38.130 - c471.fe1e.f0c0 ARPA Vlan1
Internet 14.160.38.1 1 0007.7d75.88c0 ARPA Vlan1
BGL-3700-3#
Ping with options from 14.160.38.1
*Mar 1 10:37:33.205 AEDT: SW-FWD-Q:IP packet: Local Port Fwding L3If:Vlan1
22
L2If:GigabitEthernet2/0/2 DI:0x2F, LT:7, Vlan:1 SrcGPN:56, SrcGID:56, ACLLogIdx:0x0, MacDA:c471.fe1e.f0c0, MacSA: 0007.7d75.88c0 IP_SA:14.160.38.1 IP_DA:14.160.38.130 IP_Proto:1 IP Opts
TPFFD:D8C00038_00010001_00A00076-0000002F_E2C50000_00000000
*Mar 1 10:37:33.205 AEDT: SW-FWD-Q:Consumed by SW-Bridging: Local Port Fwding L3If:Vlan1 L2If:GigabitEthernet2/0/2 DI:0x2F, LT:7, Vlan:1 SrcGPN:56, SrcGID:56, ACLLogIdx:0x0, MacDA:c471.fe1e.f0c0, MacSA: 0007.7d75.88c0 IP_SA:14.160.38.1 IP_DA:14.160.38.130 IP_Proto:1 IP Opts
TPFFD:D8C00038_00010001_00A00076-0000002F_E2C50000_00000000
*Mar 1 10:37:53.765 AEDT: SW-FWD-Q:IP packet: Local Port Fwding L3If:Vlan1 L2If:GigabitEthernet2/0/2 DI:0x2F, LT:7, Vlan:1 SrcGPN:56, SrcGID:56, ACLLogIdx:0x0, MacDA:c471.fe1e.f0c0, MacSA: 0007.7d75.88c0 IP_SA:14.160.38.1 IP_DA:14.160.38.130 IP_Proto:1 IP Opts
TPFFD:D8C00038_00010001_00A00076-0000002F_E2C50000_00000000
A good practice to protect and to monitor the CPU utilization is to confifure the process cpu threshold and to configure the SW to control the broadcast, multicast and unicast traffic per interface.
22
23
Debug traffic received by CPU. In case below “routing-protocol-q” is shown Packet ingress intf, Dest MAC, SrcMAC, Dest IP, Src IP are shown
24
When free buffers reaches below watermark(32), throttling might occur, resulting in packet drops slow responsiveness to network management
25
Receives a copy of the traffic for which an ICMP packet needs to be generated. Hardware forwarding of the packet still occurs
26
(due to throttling mechanism it won’t reach 99%) of which
27
Add featureset influence over CPU --3750X - 22%- 50% (depending on number of switches)
28
Configuring Traffic Storm Control to avoid packets flood the LAN, creating excessive traffic and degrading network performance.
29
30
I/O memory is not used for normal packet switching
31
Note: lowest free level since boot up
32
Memory Allocation Failure Memory Allocation failure is the condition where the system has used all available memory (temporarily or permanently), or the memory has fragmented into such small pieces that the switch cannot find a usable available block. Memory Leak A memory leak occurs when a process requests or allocates memory and then forgets to free (de-allocate) the memory when it is finished with that task. As a result, the memory block is reserved until the system is reloaded. Over time, more and more memory blocks are allocated by that process until there is no free memory available.
33
34
Use caution while running the command. Might cause cpu spikes Run multiple times to benchmark
In the above output , Total represent the total number of buffers in the pool, which include used and unused buffers. Permanent identifies the permanent number of allocated buffers in the pool. These buffers are always in the pool and can not be trimmed. In free list identifies the number of buffers currently in the pool that are available for use. Min identifies the minimum number of buffers that the system should attempt to keep in the free list. If the number of buffers in free list falls below the min value, system attempts to create more buffers for that pool. Max-allowed identifies the maximum number of buffers that are allowed in the free list Hits identifies the number of buffers that have been requested from the pool. The hits counter provides a mechanism to determine which pool must meet the highest demand for buffers. Misses identifies the number of times that a buffer has been requested and the system detected in which pool additional buffers were required. The misses counter represents the number of times the system has been forced to create additional buffers. Trims identifies the number of buffers that the system has trimmed from the pool, when the number of buffers in the free list exceeded the number of max-allowed buffers.
35
Created identifies the number of buffers that have been created in the pool. Failures identifies when IOS fails to get a Small buffer, it does not drop the packet. It increments the failed counter and falls through to the next level buffer, which is the Middle buffer and requests a buffer there. If it fails to get a middle buffer, it requests the next level buffer, which is a Big buffer. This process continues until it hits the Huge buffer pool. If it fails to get a Huge buffer, then it drops the packet. No memory identifies the number of failures caused by insufficient memory to create additional buffers. Buffer Misses do not necessarily mean a bad thing, as long as the system is able to create additional buffers . The fields to look for in the 'show buffers' output are Failures and No Memory. If there a lot of Failures and No Memory (constantly incrementing) for any Buffer Pool, try to narrow down the source of the buffer failure.You can use the 'show memory debug leaks' to detect I/O memory leaks as well. However remember - it is mandatory that memory leak detector be invoked multiple times and that only leaks that consistently appear in all reports be interpreted as leaks. This is especially true for packet buffer leaks.
35
36
37
38
39
40
41
42
View Asic stats for Ingress Queue (enqueue’d and dropped) & supervisor Queue- - -Output is different for C3750X than C3750G
C2960S does not have ingress Queues
43
44
45
Why have Ingress QoS?
C2960-S Smallest configurable policing rate is 16Kbps, and 8Kbps for everything else.
2960-X and 2960-XR will follow a similar model
© 2009, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 48
49
Assume there’s a policer defined on the interface via a service policy
Transition to slower speed link – packets take longer to egress than ingress
Eg: Gigabit interfaces for Data Center Servers and old IP Phones
Over Subscription : Many interfaces transmitting to one egress interface
50
A small burst from the 10Gig (faster) interface causes congestion on 100Mbps (slower) interface
51
Total Passengers: 538 First Class: 9 Business Class: 80 Economy Upper Deck: 106 Economy Lower Deck: 343
© 2009, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 53
Buffers: relative allocation among queues Reserved: minimal amount percentage of buffer reserved by each queue; extra amount is released to common buffer T1, T2, MAX: flexible threshold, expressed as a percentage of nominal queue buffer, which can be used Each traffic class maps to a specific queue number and threshold number (T1, T2 or T3=MAX) For example, the orange class maps to Q2-T1 and the violet class to Q2-T3 (Q2-MAX), allowing more violet packets to queue as they can use common pool buffer space
© 2009, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 54
Using “maps”, traffic classes mapped to Queue and threshold
55
56
57
58
59
Note: egress interface speed change at top.
60
Queue numbering is 0 based in this slide, rather than 1 based on previous slide (with DSCP mapping) Old IOS version do not have detailed Queue output. Starts in 12.2(46)SE
61
Besides, using “show mls qos interface <intf> statistics” command, use the platform command to get per interface per queue statistics for drop (and for successfully egress)
62
Fixing drops from previous example on Q4 and T1
63
Modify the queue-set from the previous slide to prevent packet drops Threshold maximum is 3200%
64
65
66
See Chapter “Configuring SDM Templates” in the Catalyst Switch Configuration Guide for more information
67
68
Content Addressable Memory (CAM) • Very high speed lookup in large tables • Binary operation—matches based on 0 or 1 values • Exact match returns “hit” • Useful for lookups where lookup key must exactly match a table entry (VLAN + MAC in bridge table) TCAM Tables • Ternary Content Addressable Memory (TCAM) • Very high-speed, fixed latency lookups with wildcarding • Ternary operation—matches based on 0, 1 or X (don’t care) • Longest match returns “hit” • Memory structure broken into groups of “patterns” and associated “masks” • Masks used to “wildcard” some bits in the patterns • Useful for lookups where not all fields of lookup key (CEF, ACL lookups)
69
70
71
72
73
Power discovery allows switches and PoE capable devices to convey power information. LLDP-MED provides information related to how the device is powered (from the line, from a backup source, from external power source, etc.), power priority (how important is it that this device has power?), and how much power the device needs. NOTE: LLDP-MED just advertises device consumption and is not a negotiation protocol. A third party IEEE device to be able to use PoE+ power ( > 15.4W) needs IEEE 802.3at LLDP Power-via- MDI protocol. Cisco Discovery Protocol includes separate TLVs for power requested and power available, allowing the switch and the PoE capable device to negotiate the power used. Some Cisco IP phones can operate at multiple power settings, lowering their consumption when less power is available. Using CDP the PD requests the worst-case power (including the link loss) required. LLDP PD requests only the power required, the PSE has to add the link loss values. When a powered device connected to a PoE+ port restarts and sends a CDP or LLDP packet with a power TLV, the switch locks to the power-negotiation protocol of that first packet and does not respond to power requests from the other protocol. For example, if the switch is locked to CDP, it does not provide power to devices that send LLDP requests. If CDP is disabled after the switch has locked on it, the switch does not respond to LLDP power requests and can no longer power on any accessories. In this
74
case, you should restart the powered device.
74
In the PoE controllers, there are three separate current thresholds that are used for different purposes. These are I(cut), I(limit), and I(short). I(cut) is used as the threshold at which power is removed from the port if the PD draws more power than allocated. (ex. 15.4W) I(limit) is used as the threshold at which the PoE controller will start to reduce the port voltage in order to control the current, but is not used to remove power from the port completely. I(short) is used as the threshold at which the port sees a very fast current spike that must be dealt with immediately and bypasses all the timers that are used to remove power from the port and shuts down immediately. Imax error is reported by PoE controller of the switch, when a PoE PD device misbehaves and draws more power ( Port Current ) beyond theirs specified limit. Imax error is reported after the device is Powered up and it1s an operating fault. When Iport > Icut for a time period of Tovld (50-75 milliseconds).( Iport - port current, Icut - port cut-off current, Tovld - time duration for the overload condition to be reported). The typical values of Icut in a switch varies by their PoE controller components, but they are always within IEEE range. Tstart error is reported when the device violates Tinsrush (50 milliseconds), what that means, while powering up the device, the device draws a port inrush current which is greater than Iinrush (450 mA) for at least Tinrush (50 milliseconds). Tstart is a start up fault before even the Device reported Power Good. ( T-inrush a Port start up current monitor time, I-inrush a Start up current)
75
6/26/2013 Cisco Live 2013
The workaround is present in the following platforms only : 3750-E,3560-E, 3750-X, 3560-X, 3560-C, 2960S, 2960,2960C The other platforms do not support 2X power mode and the workaround would be to use a longer cable. See DDTS CSCsw18530
75
show platform frontend-controller subordinate <number> Displays the statistics of errors received as reported by a subordinate. In this command output check the state of the Subordinate and look for I2c errors. If the I2C errors are non-zero check if they are incrementing, if yes reload. If issue still exists it could be a bad hardware.
76
Debug commands should always be run with care. Specific debug conditions can be used where available A debug condition can help to only keep the debug condition x/x
77
6/26/2013 Cisco Live 2013
78
79
Check Major version
80
81
82
show switch stack-ring activity was introduced in 12.2(20)SE A new, out-of-the-box switch (one that has not joined a switch stack or has not been manually assigned a stack member number) ships with a default stack member number of 1. When it joins a switch stack, its default stack member number changes to the lowest available member number in the stack. Stack members in the same switch stack cannot have the same stack member number. Every stack member, including a standalone switch, retains its member number until you manually change the number or unless the number is already being used by another member in the stack. If you manually change the stack member number by using the “switch current-stack-member-number renumber new-stack-member-number” global configuration command, the new number goes into effect after that stack member resets (or after you use the “reload slot stack-member-number” privileged EXEC command) and only if that number is not already assigned to any other members in the stack. “show platform sf-asic stat ?” gives more detailed stack statistics for the 3750E
LED on the port with the corresponding switch number will illuminate
For ex, if the switch is # 4 in the stack, port 4’s LED will light up
83
to stop stack port flap Switch <> stack port <> en/disable show switch stack-ports summary was introduced in 12.2(50)SE “# Changes to LinkOK” = number of times stack port went into Link OK “cable length” in CentiMetres
84
85
86
87
88
Note: this mac matches slide with “show platfrom forward ...” command example
89
90
91
92
Use this command to view the egress interface for Layer2 forwarding. In this case egress is Gi1/0/4
93
----- Meeting Notes (4/11/13 15:58) ----- example 1 and 2
94
95
96
97
Notes: ios view of how things should be.
98
The switch does not make fwding decisions based on icmp values. But, the command requires them. Just put in anything in range 0-255
99
100
101
102
Some failure scenarios
103
104
105
Note: change date / time to use datetime, and not uptime. For consistency
106
107
109
6/26/2013 Cisco Live 2013
110
111
6/26/2013 Cisco Live 2013
112
113
114
The Catalyst 2960/3650/3750 supports four egress queues, which can be configured on a per-interface basis to operate in either 4Q3T or 1P3Q3T modes. Additionally, the Catalyst 2960/3650/3750 supports two queue-sets, allowing certain interfaces to be configured in one manner and others to be configured in a different manner. The Catalyst 2960/3650/3750 has Queue 1 (not Queue 4) as the optional priority queue; in a converged campus environment it is recommended to enable the priority queue via the priority-queue out interface command. The three remaining egress queues on the Catalyst 2960/3650/3750 are scheduled by a Shaped Round-Robin (SRR) algorithm, which can be configured to operate in shaped mode or in shared mode. In shaped mode, assigned bandwidth is limited to the defined amount; in shared mode, any unused bandwidth is shared among other classes (as needed). To make the queuing structure consistent with the previously discussed best-practice queuing principles:
Queues 2 through 4 should be set to operate in shared mode (which is the default mode of operation on Queues 2 through 4). The ratio of the shared weights determines the relative bandwidth allocations (the absolute values are meaningless). Since the PQ of the Catalyst 2960/3650/3750 is Q1 (not Q4 as in the Catalyst 3550), the entire queuing model can be flipped upside down, with Q2 representing the Critical Data queue, Q3 representing the Best Effort queue, and Q1 and Q4 representing the Scavenger queue. Therefore, shared weights of 70, 25, and 5 can be assigned to Queues 2, 3, and 4, respectively.
115
116
6/26/2013 Cisco Live 2013
117
Note: add X and S. C3560E(config)#diagnostic monitor test ? <1-6> Test ID Number ID Test Name [On-Demand Test Attributes] --- ------------------------------------------- 1 TestPortASICStackPortLoopback [B*N****] 2 TestPortASICLoopback [B*D*R**] 3 TestPortASICCam [B*D*R**] 4 TestPortASICRingLoopback [B*D*R**] 5 TestMicRingLoopback [B*D*R**] 6 TestPortASICMem [B*D*R**] --- ------------------------------------------- WORD Test ID list (e.g. 1,3-6) or Test Name all Select all test ID C3560E(config)# Scheduled Example:
switch(config)# diagnostic schedule switch 1 test 1 on jan 3 2003
23:32
switch(config)# diagnostic schedule switch 1 test 1 daily 14:45
switch(config)# diagnostic schedule switch 1 test all weekly Monday
3:33
switch(config)# diagnostic schedule switch 5 test 1,3-6 daily 23:55
Router# show run
Building configuration...
Current configuration : 4618 bytes
diagnostic schedule switch 1 test 1 on January 3 2003 23:32 cardindex
1
diagnostic schedule switch 1 test 2 daily 14:45 cardindex 1
diagnostic schedule switch 1 test all weekly Monday 3:33 cardindex 1
diagnostic schedule switch 5 test 1,3-6 daily 23:55 cardindex 1
117
118
119
The bottom two boxes are referencing diag test 2.
120
Here is an example of failure Overall diagnostic result: MAJOR ERROR Test results: (. = Pass, F = Fail, U = Untested) 1) TestPortAsicStackPortLoopback ---> . 2) TestPortAsicLoopback ------------> F 3) TestPortAsicCam -----------------> . 4) TestPortAsicRingLoopback --------> . 5) TestMicRingLoopback -------------> . 6) TestPortAsicMem ----------------->
121
Note: updated switch support for X and S To disable OBFL: no hw-module module [switch-number] logging onboard To clear all the OBFL data in he flash memory except for uptime and CLI commands information clear logging onboard -”show logging onboard status” to see if it is running and for which features -“summary” option contains historical data (compressed) -”continuous” option contains current data (more detailed) -”detail” option contains summary and continuous output -”copy logging onboard module flash:” creates a tar file in flash -”archive tar /xtract flash:” to extract tar file contents as ascii text
122
123
124
125
126
Assumes that Mcast routing or IGMP snooping is setup
127
128
6/26/2013 Cisco Live 2013