Post on 06-Aug-2018
transcript
ACI TroubleshootingMioljub Jovanovic, Technical Leader
BRKACI-2102
• Introduction
• Understanding Faults and Health status
• Tools
• Troubleshooting scenarios
• Conclusion / Q&A
Agenda
Step 1: Download the Mobile App
Get all the information you need at
your fingertips!
Participate in session polling and Q&A
Step 2: Access the session
Log into the app using your Cisco
Live login & find your session
http://bit.ly/clus2015
# show int eth 1/1 | grep input
30 seconds input rate 97064 bits/sec, 66 packets/sec
input rate 97064 bps, 66 pps; output rate 95008 bps, 57 pps
20297397 input packets 6494649266 bytes
0 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 input with dribble 72 input discard
The way we’re used to do it
Good old CLI!!!
Example: Checking input rate on specific interface
John Chambers@CiscoLive #clus, San Diego 2015
The way we do it in APIC
Visualize interface input/output
> moquery -c eqptIngrPkts5min -f 'eqpt.IngrPkts5min.unicastRate>"1000"' -o xml…<eqptIngrPkts5min childAction="" cnt="18" dn="topology/pod-1/node-101/sys/phys-[eth1/34]/CDeqptIngrPkts5min" … status="" unicastAvg="10833" unicastBase="0" unicastCum="2390904" unicastLast="18809" unicastMax="31630" unicastMin="2075" unicastPer="194995" unicastRate="1089.254093" unicastSpct="0" unicastThr="" unicastTr="0" unicastTrBase="503518"/></imdata>
> moquery -c eqptIngrPkts5min -f 'eqpt.IngrPkts5min.unicastRate>"1000"' | egrep -e "^dn|^unicastRate"
dn : topology/pod-1/node-101/sys/phys-[eth1/34]/CDeqptIngrPkts5min
unicastRate : 1742.12
The way we can do it with ACI
Query any managed object (MO) for data we need!
example: finding interface with unicast rate > 1000
• Q: that’s cool, but how do I know which object/class to query …?
• Q: it looks cryptic to me ... how do I find meaning of each field? check next slide for the answer
APIC Management Information Model Reference
direct URL
https://apic/doc/html/
From the WebUI
Connect to APIC
apic 3
apic 2
apic 1
AP
IC C
lust
er
CLI (ssh)
Visore
APIC UI
Web Browser
Connect to switchspine 1 spine 2
leaf 1 leaf 2 leaf 3 leaf 4 leaf 5
ACI Fabric
We could connect directly to switches as well
- ssh or console
- visore
- REST
CLI Available at the SwitchAAA via TACACS+, Radius and LDAP is supported when logging into switch CLI console.
Configuration mode is not supported at switch console.
There are two scenarios where administrators would log into switch console:
• From APIC UI, admin can remote login to switch console
• Login directly via serial console port on the switch front panel or SSH to management
IP via out of band or inband
For majority of use cases,
admin should utilize APIC.
Using username "admin".Application Policy Infrastructure Controlleradmin@apic1:~> acidiag fnvread
ID Name Serial Number IP Address Role State LastUpdMsgId-------------------------------------------------------------------------------------------------
101 leaf1 SAL18CLUX85 10.0.40.66/32 leaf active 0102 leaf2 SAL18CBRU00 10.0.64.69/32 leaf active 0103 leaf3 SAL18CLHR05 10.0.40.95/32 leaf active 0104 leaf4 SAL18CAMS14 10.0.40.65/32 leaf active 0105 leaf5 SAL18CCHD53 10.0.112.69/32 leaf active 0201 spine1 SAL18CMUC75 10.0.64.65/32 spine active 0202 spine2 SAL18CFRA11 10.0.64.64/32 spine active 0203 spine3 SAL18CSAN15 10.0.40.69/32 spine inactive 0x4000000ef664f204 spine4 SAL18CSFO14 10.0.112.67/32 spine inactive 0x4000000ef6650
Total 9 nodes
admin@apic1> ssh leaf1
Fabric Health Overview
Troubleshooting: Where do we start?
Statistics Faults Diagnostics
…
Faults,
Health Scores
ELAM SPANAtomic
Counters
On-Demand
Diagnostics
Drill-Downs
Thresholds
Fabric-wide monitoring
StatsSwitch
Nxos Cli
Troubleshooting, Drill Downs
After logging in to the APIC, you’ll
see the initial ‘Dashboard’ screen.
The APIC dashboard provides you with an ‘at-a-glance’ view of the system health and fault counts.
‘System Health’ shows you a view of the
overall health of the ACI system (all nodes, tenants, etc).
Graph is plotted as per fabricOverallHealthHist5min
fabricHealthTotal
API Inspectorenables us to see REST API calls (GET, DELETE, POST) from WebUI to APIC
admin@apic1> moquery -d "/topology/HDfabricOverallHealth5min-0"Total Objects shown: 1
# fabric.OverallHealthHist5minindex : 0childAction :cnt : 31dn : /topology/HDfabricOverallHealth5min-0healthAvg : 82healthMax : 82healthMin : 82healthSpct : 0healthThr :healthTr : 0lastCollOffset : 310modTs : neverrepIntvEnd : 2015-04-10T19:24:03.530+01:00repIntvStart : 2015-04-10T19:18:53.442+01:00rn : HDfabricOverallHealth5min-0status :Prefer JSON or XML instead of text in moquery?
-> no problem just specify “–o json” or “-o xml” with moquery
82
How is topology built?
admin@apic1:~> moquery -c fabricLink…# fabric.Linkn1 : 203s1 : 1p1 : 1n2 : 101s2 : 1p2 : 51dn : topology/pod-1/lnkcnt-101/lnk-203-1-1-to-101-1-51lcOwn : locallinkState : okmodTs : 2015-03-13T14:26:39.526+01:00monPolDn : uni/fabric/monfab-defaultrn : lnk-203-1-1-to-101-1-51status :wiringIssues :
admin@bdsol-aci2-apic1:~> moquery -c fabricLink | egrep -e ^dn | head -5dn : topology/pod-1/lnkcnt-1/lnk-102-1-2-to-1-2-2dn : topology/pod-1/lnkcnt-2/lnk-102-1-4-to-2-2-2dn : topology/pod-1/lnkcnt-3/lnk-102-1-6-to-3-2-2dn : topology/pod-1/lnkcnt-201/lnk-102-1-49-to-201-1-34dn : topology/pod-1/lnkcnt-202/lnk-102-1-50-to-202-1-34
• APIC WebUI and API inspector• Identify which objects are used
to plot topology• Re-using fabricLink objects to
identify the links• We could create our own tool
for topology, monitoring or troubleshooting
Visore – Web based MO query and browser tool
https://<IP>/visore.html
<?xml version="1.0" encoding="UTF-8"?><imdata totalCount="1"><fabricNode
adSt="on" childAction="" delayedHeartbeat="no" dn="topology/pod-1/node-101"
fabricSt="active" id="101" lcOwn="local" modTs="2015-04-08T14:38:44.546+02:00"
model="N9K-C9396PX" monPolDn="uni/fabric/monfab-default" name="bdsol-9396px-
02" role="leaf" serial="SAL18CLUS15" status="" uid="0" vendor="Cisco Systems, Inc"
version=""/></imdata>
fabricNode
adSt on
childAction
delayedHeartbeat no
dn topology/pod-1/node-101
fabricSt active
id 101
lcOwn local
modTs 2015-04-08T14:38:44.546+02:00
model N9K-C9396PX
monPolDn uni/fabric/monfab-default
name bdsol-9396px-02
role leaf
serial SAL18CLUS15
status
uid 0
vendor Cisco Systems, Inc
version
icurl -k 'https://apic/api/node/class/fabricNode.xml?query-target-filter=and(eq(fabricNode.id,"101"))'
The lower half of the screen shows node and tenant health.
The lower half of the screen shows node and tenant health.
Move these sliders down to
show only nodes / tenants
with lower health.
On the right, you’ll see the fault
counts by domain
(e.g. access, tenant, security)…
…type
(config, environmental, etc)…
…and APIC cluster health.
How to get object DN from GUI
2
1
3
Health Score
∑
Number
between
0 and 100Health Score
100 Perfect Health Score = 100
Almost perfect score
Let me think … weighted score
I need 1 more to become perfect …
Tools and utilities
Physical Network
• ping
• traceroute
• show (interface / table / etc)
• syslog
• SPAN
Abstracted Network
• properties (EP / TEP / contract)
• health scores / faults / events / audit
• itraceroute
• atomic counters
• statistics
• diagnostics (on-demand)
• SPAN
• ELAM
Network Monitoring and Troubleshooting Tools
FaultsHealth Audits Events
Statistics Call-home Syslogs SNMP
UI Tools
MIT access from ishell
admin@apic1:mit> cd /mitadmin@apic1:mit> ls -1ltotal 3drw-rw---- 1 admin admin 512 Apr 2422:48 compdrw-rw---- 1 admin admin 512 Apr 2422:48 dbgsdrw-rw---- 1 admin admin 512 Apr 2422:48 expcontdrw-rw---- 1 admin admin 512 Apr 2422:48 fwrepodrw-rw---- 1 admin admin 512 Apr 2422:48 topologydrw-rw---- 1 admin admin 512 Apr 2422:48 uni
moquery – CLI based MO query tooladmin@apic1:~> moquery -c fabricNode -f 'fabric.Node.id=="1"'Total Objects shown: 1
# fabric.Nodeid : 1adSt : ondelayedHeartbeat : nodn : topology/pod-1/node-1fabricSt : unknownlcOwn : localmodTs : 2015-04-08T14:27:16.290+02:00model : APICmonPolDn : uni/fabric/monfab-defaultname : apic1rn : node-1role : controllerserial : SAL18CLUS15status :uid : 0vendor : Cisco Systems, Incversion :
• Find all EPGs with access encapsulation VLAN 3399
moquery -c fvRsPathAtt -o json -f ‘fv.RsPathAtt.encap=="vlan-3399"‘
• Obtain AAEP based on interface policy group
moquery -c "infraAccPortGrp" | egrep "^dn" | awk ' { print "moquery -d "$3" -x query-target=children \| egrep tDn" } ‘
• Query the actual policy group
moquery -d "uni/infra/funcprof/accportgrp-N3k_PG_ddastoli" -x query-target=children
moquery – some examples
mobrowser – CLI based MO browser tool
APIC Logs Switch Logs
• /var/log/dme/log
• /var/log/dme/oldlog
• /var/log/dme/log
• /var/log/dme/oldlog
• /var/sysmgr/tmp_logs/
DME running on switch
NX
OS
Pro
ce
ss
NX
OS
Pro
ce
ss
NX
OS
Pro
ce
ss
Objectstore (Shared memory)
Switch
Get logical MO from PM and
push concrete MO to configure
switch
Delegate local faults, events,
records, health score Atomic counters, core handlingCollect stats from NXOS and
push to APIC
Opflex server for external
opflex elem
acidiag – your friend at tough timesadmin@apic1:~> acidiag --help...
avread read appliance vectorfnvread read fabric node vectorfnvreadex read fabric node vector (extended mode)rvread read replica vectorrvreadle read replica leader summarycrashsuspecttracker
read crash suspect tracker statevalidateimage validate imageversion show ISO versionpreservelogs stash away logs in preparation for hard rebootplatform show platformverifyapic run apic installation verify commandbond0test run bond0 testtouch touch special filesrun run specific commands and capture outputinstaller installerstart start a servicestop stop a servicerestart restart a servicereboot reboot
mkdir /tmp/tac-655555555
cd /tmp/tac-655555555
icurl –k 'https://localhost/api/class/faultInfo.xml' > faultInfo.xml
icurl –k 'https://localhost/api/class/faultRecord.xml' > faultRecord.xml
icurl –k 'https://localhost/api/class/eventRecord.xml' > eventRecord.xml
icurl –k 'https://localhost/api/class/aaaModLR.xml' > aaaModLR.xml
icurl –k 'https://localhost/api/class/aaaSessionLR.xml' > aaaSessionLR.xml
cd /tmp
tar zcvf tac-655555555.tgz tac-655555555
cp tac-655555555.tgz /data/techsupport
icurl – CLI utility for data transfer
Now you may download file from following URL:https://apic/files/1/techsupport/tac-655555555.tgz
We can import and analyze active faults, fault history, events history, accounting log, login history
iShell filesystem - scriptcontainer
/ - APIC root filesystem/var/run/bashroot…bashroot/var/log/dme/log
…/mgmt/log/scriptcontainer.log
/ - ishell root folder/var/log/dme/log/debug/aci/mit
Linux
admin shell
Troubleshooting scenarios
2 x spine
2 x leaf N9K-9396px(48 x 1/10G SFP+)
2 x leaf N9K-93128tx(96 x 1/10G Base-T)
1 x leaf N9K-C9372px(48 x 1/10G SFP+)
3 x APIC
Topologyspine 1 spine 2
leaf 1 leaf 2 leaf 3 leaf 4 leaf 5
apic 3apic 2apic 1
10Gbps
ACI Fabric
That’s all nice, but what if I can’t connect to WebUI
Troubleshooting Web UI performanceOpen Web Browser’s Developer Tools Network tab
Web Browser’s Developer tool Network tab
Showing latency for each HTTP Request to APIC server
Ctrl + Shift + I or F12orCmd + Opt + I
REST API call without webtokenVerify if APIC is able to process REST API
withoutLogin / APIC-cookie
http://apic/api/aaaListDomains.xml
Double-click on the specific request to
check timing details.
10ms looks good
zegrep -A5 "aaaListDomains.xml" /var/log/dme/log/nginx.bin.log.*
nginx.bin.log.14.gz:
29701||15-05-10 23:11:05.701+02:00||nginx||DBG4||||Request received /api/aaaListDomains.xml||../common/src/rest/./Rest.cc||62 bico 56.827
29701||15-05-10 23:11:05.701+02:00||nginx||DBG4||||httpmethod=1; from 10.48.16.90; url=/api/aaaListDomains.xml; urloptions=||../common/src/rest/./Request.cc||103
29720||15-05-10 23:11:05.705+02:00||nginx||DBG4||co=doer:255:127:0xff00000003249f06:1||outCode: 200||../common/src/rest/./Worker.cc||357
29720||15-05-10 23:11:05.705+02:00||nginx||DBG4||co=doer:255:127:0xff00000003249f06:1||notifyEvent data ready 0x0||../common/src/rest/./Worker.cc||370
29701||15-05-10 23:11:05.706+02:00||nginx||DBG4||||Reply data (request 831 size 211) <?xml version="1.0" encoding="UTF-8"?><imdata totalCount="4"><aaaLoginDomain name="LOCAL"/><aaaLoginDomain name="RADIUS"/><aaaLoginDomainname="TACACS"/><aaaLoginDomain name="DefaultAuth" guiBanner=""/></imdata> Cookie: NONE||../common/src/rest/./Rest.cc||120
How does it look from APIC’s side?
We could use any other criteria for grep:IP, time stamp etc
zegrep -A5 "aaaListDomains.json" /var/log/dme/log/nginx*
Note JSON is usedbyAPIC WebUI, while we
used XML.
APIC DME Debug URL
http://apic1/api/nginx/debug/tacacs.xml
Debug data of DMEs is also exposed via REST
Same debug data is accessible from ishell alsoadmin@apic1:~> cat /debug/bdsol-aci3-apic1/nginx/tacacs/moRequestsDispatched : 1511ResponsesReceived : 1498
Check all other nifty stats by executing “find /debug/* …”
Example:
admin@apic1:~> find /debug/* -print -type f -exec cat {} \;
You can also check logs matching certain criteria
Example below, looking for tacacs logs or specific time.
zegrep TAC_ /var/log/dme/log/nginx*zegrep TAC_ /var/syslog/tmp_logs/nginx*zegrep “15-05-09 03:48” /var/log/dme/log/*
Finding changes, faults during certain timeframe
System health change
We noticed slight decrease in System health
Is the cause known?Do we need to perform Root Cause Analysis?Were there any known changes, maintenance etc?
… we’re not sure … should we call SWAT?
Déjà vu?
We’ve suddenly experienced connectivity loss … nothing has been changed …
Let’s think for a second:
What is the the most common
cause of all network incidents?
Change!
aaaModLRaaaModLR - AAA audit log record,which is automatically generatedwhenever a user modifies an object.
moquery -c aaaModLR -f 'aaa.ModLR.created>" 2015-05-07" and aaa.ModLR.created<" 2015-05-10"'
we want to check if there were any config changes
Match audit records (aaaModLR)between 2015-05-07 AND 2015-05-10
We noticed slight decrease in System health
moquery -c aaaLogLR -f 'aaa.Mod.LR.created=="2015-05-10“'
Match only on May 10th 2015
Example looking for audit records by date / time
admin@bdsol-aci2-apic1:~> moquery -c aaaModLR -f 'aaa.ModLR.created>" 2015-05-07T17:00" and aaa.ModLR.created<"2015-05-11"'# aaa.ModLRid : 8589938110affected : uni/fabric/outofsvc/rsoosPath-[topology/pod-1/paths-101/pathep-[eth1/12]]cause : transitionchangeSet :childAction :code : E4208269created : 2015-05-08T15:22:04.317+01:00descr : Interface topology/pod-1/paths-101/pathep-[eth1/12] enableddn : subj-[uni/fabric/outofsvc/rsoosPath-[topology/pod-1/paths-101/pathep-[eth1/12]]]/mod-8589938110ind : deletionmodTs : neverrn : mod-8589938110severity : infostatus :trig : configtxId : 10720396user : admin
We don’t do changes on non-business days and the day before, so let’s see who has performed any config between
Thursday evening and Monday morning
admin configured interface eth1/12 on node 101
ok so we found there was some admin changes on eth1/12
faultRecord in GUI
We could also check:
eventRecord
healthRecord
double click
admin@apic1:~> moquery -c faultInst | egrep -e "^descr" | sort | uniq -c
2 descr : Configuration failed for EPG default due to Not Associated With Management Zone3 descr : Datetime Policy Configuration for F5clock failed due to : access-epg-not-specified1 descr : Failed to form relation to MO AbsGraph-VEStandAloneFuncProfile of class vnsAbsGraph1 descr : Failed to form relation to MO fwP-default of class nwsFwPol in context uni/infra1 descr : Ntp configuration on leaf leaf1 is Not Synchronized1 descr : Ntp configuration on leaf leaf2 is Not Synchronized1 descr : Ntp configuration on spine spine1 is Not Synchronized1 descr : Power supply shutdown. (serial number DCB18CLUS15)
Using moquery to dump/sort active faults (faultInst)
moquery –c faultInst –f fault.Inst.descr==“: Failed to form relation to MO AbsGraph-VEStandAloneFuncProfile …”
Now we could query all faults by criteria – such as description (fault.Inst.descr)
quickly sorts all active faults
L4–L7 Integration debuging
Troubleshooting: APIC Faults / Visore / debug.log / LTM log
APIC Faults
https://<APIC>/visore.html
/data/devicescript/F5.BIGIP.1.1.0/logs/debug.log/var/log/*
APIC Faults
Double click on faultsIf need more details,
copy the affect object
Example L4-L7 fault details using Visore Toolhttps://apic/visore.htm
Paste the affected object in “Class or DN” field
Provide full details of the issues
APIC debug.logLocate the APIC that contains the shard configuring the BIG-IP, then go to the following location:
You will see debug.log and periodic.log
You can “tail -f debug.log” to monitor the process
admin@apic1:~> cd /data/devicescript/F5.BIGIP.1.0.0/logs
admin@apic1:logs> ls –all-rw-r--r-- 2 nobody nobody 52688 Sep 30 11:31 debug.log-rw-r--r-- 2 nobody nobody 35492 Sep 30 11:30 periodic.log
APIC debug.log (faults)
2014-07-25 18:04:00,675 DEBUG 139789634365184 [172.23.76.198, 8534]: Faults: []
2014-07-25 18:05:47,466 DEBUG 139789634365184 [172.23.76.198, 8543]: result: serviceAudit {'stats': {'max': 20.035178899765015, 'num': 2, 'last': 20.035178899765015, 'avg': 16.63836646080017, 'min': 13.241554021835327}, 'result': {'faults': [([], 82, "Line 100 apic/service.py::modify: Could not configure service state: Server raised fault: 'Exception caught in Networking::urn:iControl:Networking/RouteDomainV2::get_identifier()\nException: Common::OperationFailed\n\tprimary_error_code : 17237812 (0x01070734)\n\tsecondary_error_code : 0\n\terror_string : 01070734:3: Configuration error: Invalid mcpd context, folder not found (/apic_5794)'")], 'state': 3, 'health': [([], 0)]}}
2014-07-25 18:05:47,467 DEBUG 139789634365184 [172.23.76.198, 8543]: Faults: [([], 82, "Line 100 apic/service.py::modify: Could not configure service state: Server raised fault: 'Exception caught in Networking::urn:iControl:Networking/RouteDomainV2::get_identifier()\nException: Common::OperationFailed\n\tprimary_error_code : 17237812 (0x01070734)\n\tsecondary_error_code : 0\n\terror_string : 01070734:3: Configuration error: Invalid mcpd context, folder not found (/apic_5794)'")]
Example: mcpd
APIC debug.log (faults)
2014-10-07 13:09:51,166 DEBUG 140447157077760 [198.18.128.130, 76]: Faults: []
2014-10-07 13:09:51,187 DEBUG 140447157077760 [None, None]: Waiting for task
2014-10-07 13:09:53,847 DEBUG 140447148685056 [198.18.128.130, 76]: route_domain: Allocated route domain 907
2014-10-07 13:09:53,957 DEBUG 140447148685056 [198.18.128.130, 76]: route_domain: Setting route domain 907 on device BIGIP1
2014-10-07 13:09:54,140 INFO 140447148685056 [198.18.128.130, 76]: Line 664 apic/service.py::_modify_vlan: Target: : Creating VLAN '4663_16387' ID 202
2014-10-07 13:09:56,532 INFO 140447148685056 [198.18.128.130, 76]: Line 679 apic/service.py::_modify_vlan: Target: : Modifying VLAN '4663_16387' interface '1.1'
2014-10-07 13:09:57,304 DEBUG 140447148685056 [198.18.128.130, 76]: result: serviceModify {'stats': {'max': 39.48741388320923, 'num': 4, 'last': 6.139014005661011, 'avg': 21.184859931468964, 'min': 6.139014005661011}, 'result': {'faults': [([(0, '', 4663), (7, '', '2752512_16387')], 81, "Line 383 apic/handlers.py::set_interface: device: : VLAN ifc update fail: Server raised fault: 'Exception caught in Networking::urn:iControl:Networking/VLAN::add_member()\nException: Common::OperationFailed\n\tprimary_error_code : 17236569 (0x01070259)\n\tsecondary_error_code : 0\n\terror_string : 01070259:3: Requested member (1.1) is untagged on another VLAN'")], 'state': 2, 'health': []}}
Example: Tagging mismatch
BIG-IP LTM log
Jul 19 11:57:53 apic-bigip2 notice mcpd[7439]: 01070638:5: Pool /apic_5668/apic_5668_webPool member /apic_5668/192.168.10.101%1295:80 monitor status down. [ /apic_5668/apic_5668_webMonitor: down ] [ was up for 20hrs:55mins:46sec ]
Jul 19 11:57:54 apic-bigip2 notice mcpd[7439]: 01070638:5: Pool /apic_5668/apic_5668_webPool member /apic_5668/192.168.10.102%1295:80 monitor status down. [ /apic_5668/apic_5668_webMonitor: down ] [ was up for 20hrs:55mins:47sec ]
Jul 19 11:57:54 apic-bigip2 notice mcpd[7439]: 01071682:5: SNMP_TRAP: Virtual /apic_5668/apic_5668_4096_Virtual-Server has become unavailable
Jul 19 11:57:54 apic-bigip2 err tmm[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 11:57:54 apic-bigip2 err tmm1[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 11:57:54 apic-bigip2 err tmm2[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 11:57:54 apic-bigip2 err tmm3[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 12:03:02 apic-bigip2 err iprepd[6725]: 015c0004:3: failed connect to 208.87.136.155 on 443
Jul 19 12:03:03 apic-bigip2 err iprepd[6725]: 015c0004:3: Certificate verification error: 18
Jul 19 12:03:03 apic-bigip2 err iprepd[6725]: 015c0004:3: nSendReceiveSsl failed SSL handshake
Jul 19 12:04:11 apic-bigip2 info pfmand[6925]: 01660009:6: Link: 2.1 is DOWN
Jul 19 12:04:11 apic-bigip2 info pfmand[6925]: 01660009:6: Link: 2.2 is DOWN
[root@bigip:Active:In Sync] log # cd /var/log[root@bigip:Active:In Sync] log # ls ltm*ltm ltm.11.gz ltm.2.gz ltm.4.gz ltm.6.gz ltm.8.gzltm.10.gz ltm.1.gz ltm.3.gz ltm.5.gz ltm.7.gz ltm.9.gz
SSH as root into BIG-IP and go to:
Example output
Access Encapto
Fabric Encap
spine 1 spine 2
leaf 3 leaf 4 leaf 5leaf 1 leaf 2
EP A to EPB - simplified
1
1
1 2
3
2
3
Regular L2 packet
iVXLAN packet
Regular L2 packet
EP A EP B
spine 1 spine 2
leaf 3 leaf 4 leaf 5leaf 1 leaf 2
How to identify VLAN mapping
VM A
MAC: 00:00:33:33:33:33
linux VM A:
connected to ACI fabric
VLAN 3399
Scenario:
VM A is unable to reach other endpoints connected to the Fabric
- ping doesn’t work
- ARP doesn’t work
leaf 1
eth 1/34
What happens when packet from EP A reaches leaf
EP A
MAC: 00:00:33:33:33:33
To Spines
Cisco ASIC
Merchant
ASIC
8/12 x 40G
To servers/blade, switches
8/12 x 40G
48/96 x 10G
leaf 1
1
2
3
packet first comes to
Merchant ASIC (BCM)
forwarded to destination
if it’s known on BCM
if destination not learned in BCM forwarding table, then send to Cisco ASIC
Linux view
VM thinks it’s interface is in VLAN 3399
VM MAC: 00:00:33:33:33:33
switch# bcm-shell-hw "l2 show"
mac=52:54:00:b0:c4:81 vlan=57 GPORT=0x22 modid=0 port=34/xe33 Hit
mac=58:f3:9c:24:2e:87 vlan=15 GPORT=0x2 modid=0 port=2/xe1 Hit
mac=00:00:33:33:33:33 vlan=57 GPORT=0x22 modid=0 port=34/xe33 Hit
mac=52:54:00:c3:b8:2c vlan=58 GPORT=0x22 modid=0 port=34/xe33 Hit
mac=00:22:bd:e2:e2:e2 vlan=49 GPORT=0x7f modid=2 port=127 Static
bcm-shell-hw
Broadcom says it’s
VLAN 57
checking l2 forwarding table on Broadcom
switch# show mac address-table interface ethernet 1/34
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
VLAN MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 53 0000.3333.3333 dynamic - F F eth1/34
* 53 5254.00b0.c481 dynamic - F F eth1/34
* 54 5254.00c3.b82c dynamic - F F eth1/34
MAC learning from ACI switch
iShell CLI says it’s VLAN 53
from ishell command interface
module-1# show system internal eltmc info vlan access_encap_vlan 3399
vlan_id: 53 ::: hw_vlan_id: 57
vlan_type: FD_VLAN ::: bd_vlan: 52
access_encap_type: 802.1q ::: access_encap: 3399
fabric_encap_type: VXLAN ::: fabric_encap: 9891
sclass: 16387 ::: scope: 8
bd_vnid: 9891 ::: untagged: 0
acess_encap_hex: 0xd47 ::: fabric_enc_hex: 0x26a3
so which VLAN is it?
it’s iVXLAN 9891 ??
note: we’re in vsh_lc CLI
Is this actually possible with ACI?
ELAM
ELAM stands for Embedded Logic Analyzer Module
It is a logic that is present in the ASICs that provides the capability to capture and view one or more packets, that match a user specified criteria, from the stream of packets that are processed by the ASIC
What is ELAM?
ELAM Support in Cisco ASIC
Lookup Block
Egress Pipeline (FabricFrontPanel)
ELAM
ELAM
Ingress Pipeline (FrontPanelFabric)
Parser Block
Packet RW Sideband
To Fabric
From BCM
Lookup Block
ELAM
ELAM
Parser Block
Packet RW Sideband
From Fabric
To BCM
Input
Select
Lines
Output
Select
Lines
Input
Select
Lines
Output
Select
Lines
ELAM Support in North Star• North Star data path divided into ingress and egress pipelines
• 2 ELAM’s are present in each pipeline (Input ELAM and Output ELAM)
• These ELAM’s are present at the beginning and end of the lookup block.
• ELAM’s can be configured using the available select lines
• Packets can be captured on the input ELAM based on a output condition by configuring ELAM in “reverse” mode
Limitations
• Packets can be captured based on either input select lines or output select lines but not both.
• ELAM Configuration should happen in a single user mode
• Cisco ASIC data path divided into ingress and egress pipelines
• 2 ELAM’s are present in each pipeline (Input ELAM and Output ELAM)
• These ELAM’s are present at the beginning and end of the lookup block.
• ELAM’s can be configured using the available select lines
• Packets can be captured on the input ELAM based on a output condition by configuring ELAM in “reverse” mode
Limitations
• Packets can be captured based on either input select lines or output select lines but not both.
• ELAM Configuration should happen in a single user mode
ELAM Support
Input Select Lines Supported3 Outerl2-outerl3-outerl44 Innerl2-innerl3-inner l45 Outerl2-innerl26 Outerl3-innerl37 Outerl4-innerl4
Output Select Lines Supported0 Pktrw5 Sideband
ELAM Support
Note:Only output select lines 0 and 5 are supported for capturing packets based on output at both output and input
ELAM ConfigurationThe diagram flow during ELAM configuration.
• Init – Initialize the ELAM – select the asic instance, pipeline and select lines
• Config – Configure the trigger based on different fields in the packet
• Arm – Arm the trigger by setting the fields to match in hardware
• Read – Once the trigger is triggered, read the report.
• Reset – Once the process is complete, reset the trigger to restart the process
1. Init
2. Config
5. Reset
3. Arm
4. Read
Trigger
Show the trigger
The configured trigger can be verified using the show command
root@module-1(NS-elam-insel3)# show
ELAM configuration
ELAM Report Analysis
Elam report is very detailed and dumps many fields.
In Pktrw the important fields are• adj_index
• ol_encap_idx
• sclass
• src_tep_idx
• sup_redirect
In Sideband the important fields are • l2flood
• fwddrop
• bnce
ELAM Example
spine 1 spine 2
leaf 3 leaf 4 leaf 5leaf 1 leaf 2
ELAM Example
1
1
12
32
3
leaf1: input ingress
outer header
spine: input ingress
inner header
leaf4: input egress
inner header
EP A EP B
ingress
egress
spine 1 spine 2
leaf 3 leaf 4 leaf 5leaf 1 leaf 2
ELAM Example
1
1
1 leaf1: input ingress
outer header
EP A EP B
ingress
vsh_lcdebug platform internal ns elam asic 0trigger resettrigger init ingress in-select 3 out-select 0set outer l2 src_mac 00:25:b5:aa:00:0aset outer l2 dst_mac ff:ff:ff:ff:ff:ffstartstatusreport
MAC: 00:25:b5:aa:00:0a MAC: 00:25:b5:bb:00:0b
outerNote: outer header
Packet is not yet encapsulated in iVXLANOuter header is still original frame from EP
NOTE:1) Without the "reset" command, trigger buffers are never reset other than reboot.2) Users can move in and out of the ELAM mode, and there will be no impact on the configured triggers.
ELAM configurationleaf1# vsh_lcmodule-1# debug platform internal ns elam asic 0module-1(NS-elam)# trigger resetmodule-1(NS-elam)# trigger init ingress in-select 3 out-select 0module-1(NS-elam-insel3)# set outer l2 src_mac 00:25:b5:aa:00:0amodule-1(NS-elam-insel3)# set outer l2 dst_mac ff:ff:ff:ff:ff:ffmodule-1(NS-elam-insel3)# startmodule-1(NS-elam-insel3)# statusStatus: Armedmodule-1(NS-elam-insel3)# ?
report Show trigger report…module-1(NS-elam-insel3)# reportELAM not triggered. No report available
We’re looking to confirm if broadcast packet sourced from
MAC00:25:b5:aa:00:0a
is reaching Cisco ASIC
module-1(NS-elam-insel3)# report | egrep ce_|ar_|drop|hg2_srcGBL_C++: [INFO] hg2_srcpid: 0AGBL_C++: [INFO] ce_da: FFFFFFFFFFFFGBL_C++: [INFO] ce_sa: 0025B5AA000AGBL_C++: [INFO] ce_etype: 0806GBL_C++: [INFO] ar_sha: 0025B5AA000AGBL_C++: [INFO] ar_spa: 0A108030GBL_C++: [INFO] ar_tha: 000000000000GBL_C++: [INFO] ar_tpa: 0A108001GBL_C++: [INFO] ar_spare: 0000000000000000000000000000GBL_C++: [MSG] - pktrw is completeGBL_C++: [INFO] drop: 0GBL_C++: [INFO] hg2_srcpid: 0AGBL_C++: [INFO] hg2_vid_lo: 63GBL_C++: [INFO] vlan0: 063GBL_C++: [INFO] adj_index: 000CGBL_C++: [INFO] ol_encap_idx: 2FF6GBL_C++: [INFO] ol_ttl: 08GBL_C++: [INFO] ol_segid: 2A8001GBL_C++: [INFO] sclass: C005GBL_C++: [INFO] sup_redirect: 0GBL_C++: [INFO] mcast: 0
•module-1(NS-elam-insel3)# show platform internal ns forwarding encap 0x2FF6•TABLE INSTANCE : 0•Legend•MD: Mode (LUX & RWX) LB: Loopback•LE: Loopback ECMP LB-PT: Loopback Port•ML: MET Last TD: TTL Dec Disable•DV: Dst Valid DT-PT: Dest Port•DT-NP: Dest Port Not-PC ET: Encap Type•OP: Override PIF Pinning HR: Higig DstMod RW•HG-MD: Higig DstMode KV: Keep VNTAG•------------------------------------------------------------• M PORT L L LB MET M T D DT DT E TST O H HG K M E•POS D FTAG B E PT PTR L D V PT NP T IDX P R MD V D T Dst MAC DIP•---------------------------------------------------------------------------------------------------------------------------------------------------•---•12278 0 c00 0 1 0 0 0 0 0 0 0 3 4 0 0 0 0 0 3 00:00:00:00:00:00 10.0.200.127
ELAM Report Analysis(trigger went off)
hg2_srcpid: source port on front panelce_sa: Source MAC addressce_etype: Ethertype 0x806 = ARP (Address Resolution)ar_spa: Source IP address = 10.16.128.48ar_tpa: Destination IP address: 10.16.128.1
People that read hex on the fly appreciate this output!
VXLAN Destination TEP address derived
from encap: 10.0.200.127
acidiag fnvread | egrep 10.0.200.127
moquery -c tunnelIf -f 'tunnel.If.dest=="10.0.200.127"‘
show isis dtep vrf overlay-1
We have destination TEP address, what next?
Find which switch has specific TEPOn APIC or Switch
# show isis dtep vrf overlay-1IS-IS Dynamic Tunnel End Point (DTEP) database:DTEP-Address Role Encapsulation Type10.0.120.95 SPINE N/A PHYSICAL10.0.200.64 SPINE N/A PHYSICAL,PROXY-ACAST-MAC10.0.200.65 SPINE N/A PHYSICAL,PROXY-ACAST-V410.0.8.65 SPINE N/A PHYSICAL,PROXY-ACAST-V610.0.8.64 LEAF N/A PHYSICAL10.0.200.127 LEAF N/A PHYSICAL10.0.200.126 SPINE N/A PHYSICAL
switch outputAPIC is not running ISIS
protocol
spine 1 spine 2
leaf 3 leaf 4 leaf 5leaf 1 leaf 2
ELAM Example
1
2
EP A EP B
2 spine: input ingress
inner header
ingress
vsh_lcdebug platform internal alp elam asic 0 | 1trigger init ingress in-select 3 out-select 0set inner l2 src_mac 00:25:b5:aa:00:0aset inner l2 dst_mac 00:25:b5:bb:00:0bstartstatusreport
innerCisco ASIC
in spine
MAC: 00:25:b5:aa:00:0a MAC: 00:25:b5:bb:00:0b
Packet is now encapsulated in iVXLAN, so we’re looking for inner header
Hint: don’t forget trigger reset
spine 1 spine 2
leaf 3 leaf 4 leaf 5leaf 1 leaf 2
ELAM Example
1
3
host A host B
MAC: 00:25:b5:aa:00:0a MAC: 00:25:b5:bb:00:0b
3 leaf4: input egress
inner header
egress
vsh_lcdebug platform internal ns elam asic 0trigger init egress in-select 3 out-select 0set inner l2 src_mac 00:25:b5:aa:00:0aset inner l2 dst_mac 00:25:b5:bb:00:0bstartstatusreport
innerCisco ASIC
in leaf
report
*** report will be available when trigger went off
Egress because we’re egressing the fabric
References
Quick Start / Videos
APIC Help pages
API Documentation
Python SDK
APIC resources
ACI Documentation - cisco.com/go/aci
Cisco.com – APIC Troubleshooting
Cisco Support Forums
Cisco DevNet
GitHub/datacenter
Online resources
GitHub – a resource for ACI scripts and tools
• ACI Toolkit:http://datacenter.github.io/acitoolkit/https://github.com/datacenter/acitoolkit
• ACI Diagramhttps://github.com/cgascoig/aci-diagram
• ACI Endpoint Trackerhttp://datacenter.github.io/acitoolkit/docsbuild/html/endpointtracker.html
Policy Driven Data Center with ACI, The: Architecture, Concepts, and Methodology
ISBN: 9781587144905
Designing Data Centers with Cisco's ACI LiveLessons--Networking Talks
ISBN: 978-1-58714-436-3
Participate in the “My Favorite Speaker” Contest
• Promote your favorite speaker through Twitter and you could win $200 of Cisco Press products (@CiscoPress)
• Send a tweet and include
• Your favorite speaker’s Twitter handle <@miojovanovic>
• Two hashtags: #CLUS #MyFavoriteSpeaker
• You can submit an entry for more than one of your “favorite” speakers
• Don’t forget to follow @CiscoLive and @CiscoPress
• View the official rules at http://bit.ly/CLUSwin
Promote Your Favorite Speaker and You Could Be a Winner
Complete Your Online Session Evaluation
Don’t forget: Cisco Live sessions will be available for viewing on-demand after the event at CiscoLive.com/Online
• Give us your feedback to be entered into a Daily Survey Drawing. A daily winner will receive a $750 Amazon gift card.
• Complete your session surveys though the Cisco Live mobile app or your computer on Cisco Live Connect.
Continue Your Education
• Demos in the Cisco campus
• Walk-in Self-Paced Labs
• Table Topics
• Meet the Engineer 1:1 meetings
• Related sessions
Thank you