Anomalies: Prevention, Detection and
Diagnosis with OraChk and TFA
Jared Still
2016-12-28
© 2016 Pythian. Confidential 2
About Me
•Jared Still
•At Pythian since 2011
•Oracle DBA since 1994
•Oaktable
•Oracle ACE
•Known to dabble in Perl
•Claim to fame: started Oracle-L
ORAchk
Get Proactive
ORAchk – where to find it and Documentation
• ORAchk - Health Checks for the Oracle Stack (Doc ID 1268927.2)
• Extensive documentation
• Documention: http://docs.oracle.com/cd/E68491_01/index.htm
• Prereqs for ORAchk: http://bit.ly/2hgvjGH
• Bash 3.2
• /usr/bin/expect ▪ [root@ora12c102rac01 ~]# yum install expect
Loaded plugins: refresh-packagekit, security
Setting up Install Process
Package expect-5.44.1.15-5.el6_4.x86_64 already installed and latest version
• ssh equivalency
• Runs as root (preferred) ▪ Especially for RAC
© 2016 Pythian. Confidential 4
What does ORAchk check?
• Engineered Systems and Features
• Oracle Database Installations
• Standalone
• RAC
• Elastic Stack Integration
• E-Business Suite
• Pre-install checks
• Much more – check the docs
© 2016 Pythian. Confidential 5
• What does it look for?
• Database ▪ Patches up to date?
▪ Bug fixes
▪ Vulnerabilities
▪ Log Switch time
▪ Redo write time
▪ …
• OS – Linux ▪ ShellShock Bash bug
▪ Memory config
• Configuration ▪ Parameters
▪ Sizing
▪ …
Focus on ORAchk
• Considering single use and analysis
• This is an introduction
• Not Covered in this presentation
• Oracle Health Check Collections Manager
• APEX based catalog of ORAchk results
• EXAchk – Engineered Systems Specific
• Various other features; Application Continuity, OID,…
© 2016 Pythian. Confidential 6
Installation
• Installed by default as of 11.2.0.4+
• Much more useful if upgraded from default in 11.2.0.4
• Check for new versions – upgraded quarterly
• Easy to install
• Get the file
• Unzip it
• Done
© 2016 Pythian. Confidential 7
10gR2 example - Installation
• Please upgrade if you are on 10g…
• This is a test DB only on OEL 5.5
[root@ora10gR2 js01]# cd $ORACLE_HOME
[root@ora10gR2 js01]# ls -l orachk
orachk: No such file or directory
[root@ora10gR2 js01]# mkdir orachk
[root@ora10gR2 js01]# cd orachk
[root@ora10gR2 orachk]# pwd
/u01/app/oracle/product/10.2/js01/orachk
[root@ora10gR2 orachk]# unzip /tmp/orachk.zip
…
• Installation is complete
© 2016 Pythian. Confidential 8
10gR2 example – Execution
• Run for all databases found [root@ora10gR2 orachk]# ./orachk -output /tmp/orachk-01 -dball -a
Checking Status of Oracle Software Stack - Clusterware, ASM, RDBMS
. . . . . . .
Checking for prompts for root user on all nodes...
. . . . . . . .
-------------------------------------------------------------------------------------------------------
Oracle Stack Status [0m
-------------------------------------------------------------------------------------------------------
Host Name CRS Installed ASM HOME RDBMS Installed CRS UP ASM UP RDBMS UP DB Instance Name
-------------------------------------------------------------------------------------------------------
ora10gr2 No No Yes No No Yes js01
-------------------------------------------------------------------------------------------------------
Copying plug-ins
. . . . . . . . .
*** Checking Best Practice Recommendations
Collections and audit checks log file is
/tmp/orachk-01/orachk_ora10gr2_js01_122716_155227/log/orachk.log
...
OraCHK Report
• Examine Report (orachk_ora10gr2_js01_122716_155227.html)
• Fix FAIL Item and recheck
• MAA Scorecard – log_buffer size failed
• Fix and recheck
SYSDBA> alter system set log_buffer=8388608 scope=spfile;
System altered.
-- restart the database
SYSDBA> show parameter log_buffer
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_buffer integer 10175488
* Oracle has rounded the size up to consume a granule of memory
See Oracle Calculation of Log_Buffer Size in 10g (Doc ID 604351.1)
© 2016 Pythian. Confidential 10
10gR2 – Rerun ORAchk for just one check
• Re-run just 1 check:
• Click on Show Check Ids
• Check ID: CB02802D637C344DE0431EC0E50AE8DE
• orachk -output /tmp/orachk-01 -dball -check CB02802D637C344DE0431EC0E50AE8DE
© 2016 Pythian. Confidential 11
Rerun just the log_buffer check
[root@ora10gR2 orachk]# ./orachk -output /tmp/orachk-01 -dball -check CB02802D637C344DE0431EC0E50AE8DE
...
-------------------------------------------------------------------------------------------------------
Host Name CRS Installed ASM HOME RDBMS Installed CRS UP ASM UP RDBMS UP DB Instance Name
-------------------------------------------------------------------------------------------------------
ora10gr2 No No Yes No No Yes js01
-------------------------------------------------------------------------------------------------------
*** Checking Best Practice Recommendations (PASS/WARNING/FAIL) ***
...
=============================================================
Node name - ora10gr2
=============================================================
. . . . .
Collecting - Database Parameters for js01 database
© 2016 Pythian. Confidential 12
10gR2 – Examine the new report
• Issue corrected
• See orachk_ora10gr2_js01_122716_164736.html
© 2016 Pythian. Confidential 13
12.1.0.2 – 2 Node RAC
• Something a bit more up to date.
• TFA has been upgraded on this cluster
• (more on TFA to come)
• Server may need some cleanup on Aisle Oracle.
• Old version is still installed in different location
[root@ora12c102rac01 db_1]# find /u01/cdbrac/app/grid/product/12.1.0.2/ -name orachk -type f | xargs ls -ld
-rwxr-x---. 1 oracle oinstall 1604239 Jun 9 2014 /u01/cdbrac/app/grid/product/12.1.0.2/grid_1/suptools/orachk/orachk
-rwxr-xr-x. 1 root root 2895859 Dec 17 12:24
/u01/cdbrac/app/grid/product/12.1.0.2/grid_1/tfa/ora12c102rac01/tfa_home/ext/orachk/orachk
[root@ora12c102rac01 db_1]# $ORACLE_HOME/suptools/orachk/orachk -v
ORACHK VERSION: 2.2.5_20140530
[root@ora12c102rac01 db_1]# $ORACLE_HOME/tfa/ora12c102rac01/tfa_home/ext/orachk/orachk -v
ORACHK VERSION: 12.2.0.1.2_20161215
© 2016 Pythian. Confidential 14
Troubleshooting hang issue – p1
• Hung due to NFS mount hanging
• See orahk-troubleshooting-nfs-stuck.txt orachk -output /tmp/orachk-01 -dball –a
===>>> Hung here on localcmd.sh, which is running df -k
# ps -fluroot| grep [o]rachk
4 S root 617 25366 0 80 0 - 34277 pipe_w Dec27 pts/6 00:00:10 … orachk -dball -a
0 S root 1144 617 0 80 0 - 2325 wait Dec27 pts/6 00:00:01 bash /tmp/orachk-
01/.input_122716_170752/watchdog.sh
1 S root 11443 617 0 80 0 - 34277 wait Dec27 pts/6 00:00:00 … orachk -dball -a
0 S root 11444 11443 0 80 0 - 2351 wait Dec27 pts/6 00:00:00 bash /root/.orachk/localcmd.sh
[root@ora12c102rac01 ~]# pstree -p 617
bash(617)-+-bash(1144)---sleep(5964)
`-bash(11443)---bash(11444)-+-df(11457)
`-grep(11458)
# ps -flp 11457
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D root 11457 11444 0 80 0 - 1049 rpc_wa Dec27 pts/6 00:00:00 df
15
Troubleshooting hang issue – p2
• RPC Waits – nearly always NFS
• RPC is a TCP Service, most frequently seen in NFS
• Operations error – I decommissioned NFS server while there were remote connections
• Oops
• Remove mount from /etc/fstab and reboot
• For in depth NFS trouble shooting articles, see Resources
[root@ora12c102rac01 fd]# strace df -k
stuck here statfs("/mnt/oraback", ^C <unfinished ...>
[root@ora12c102rac01 fd]# mount| grep oraback
lestrade:/mnt/zpool1/oraback on /mnt/oraback type nfs
(rw,bg,intr,hard,rsize=32768,wsize=32768,noac,nolock,nfsvers=3,addr=192.168.1.116)
16
Report Analysis
• Runtime installation issues cleared up…
• Examine orachk_ora12c102rac01_js122a_122816_083224.html
• Test environment, many failures and warnings…
• First five checks have FAIL status
• “For our environment, these do not need to be corrected”
• What if you want to always exclude certain checks?
© 2016 Pythian. Confidential 17
Excluding Checks
• The -excludecheck option may be used
• orachk -excludecheck check_id1,check_id2,…
• See orachk_ora12c102rac01_js122a_010217_141755.html#excluded_checks
• Or use excluded_check_ids.txt
• Place in same directory as orachk
# cat excluded_check_ids.txt
F9519FD7525B4F01E04313C0E50AC339
CB95A1BF5B1160ACE0431EC0E50A12EE
3B36CD93FCE82634E0530C98EB0A543F
429C554ED92E4F54E0530D98EB0AE367
AA8C83A023362C5EE040E50A1EC0146A
• As of orachk 12.2.0.1.2, the exclude file does not work.
• Have I created a bug SR for this? Not Yet
© 2016 Pythian. Confidential 18
excluded_check_ids.txt – Works in 12.2.0.1.4
• Verify orachk.log
#!/bin/bash
grepStr=''
while read chkid
do
echo $chkid
grepStr="${grepStr}|${chkid}"
done < excluded_check_ids.txt
# remove leading pipe
grepStr=${grepStr:1}
grep -iE "skipping.*($grepStr)" orachk.log | cut -d' ' -f 8- | sort -u
© 2016 Pythian. Confidential 19
excluded_check_ids.txt – Works in 12.2.0.1.4
• orachk may already be skipping some of the excluded checks
Checks to Skip:
F9519FD7525B4F01E04313C0E50AC339
CB95A1BF5B1160ACE0431EC0E50A12EE
3B36CD93FCE82634E0530C98EB0A543F
429C554ED92E4F54E0530D98EB0AE367
AA8C83A023362C5EE040E50A1EC0146A
Checking orachk.log
Skipping check(429C554ED92E4F54E0530D98EB0AE367) on version 1 db_version= versions_to_run=
Skipping check(AA8C83A023362C5EE040E50A1EC0146A) on version 1 db_version= versions_to_run=
Skipping check(CB95A1BF5B1160ACE0431EC0E50A12EE) on version 4 db_version= versions_to_run=
Skipping check(CB95A1BF5B1160ACE0431EC0E50A12EE) on version 4 db_version = versions_to_run =
© 2016 Pythian. Confidential 20
ORAchk Patch Recommendations
• ORAchk updated quarterly
• Knows about current patches
• See Patch Recommendation orachk_ora12c102rac01_js122a_122816_083224.html
© 2016 Pythian. Confidential 21
OraCHK – some runtime options
• Use -dbserial
• Reduce load on production
• RAT_* environment variables
• RAT_DBNAMES=“db1 db2 …”
• RAT_TMPDIR=‘/some-tmp-dir’
• RAT_OUTPUT=‘/tmp/outputfiles-here’
• RAT_CRS_HOME=‘/crs-home-location’
• Many, many more in the documentation ▪ Some are platform and feature specific
▪ Check the documentation
• CLI options
• -p patch check only
• -output directory for output
• -localonly only local node
• -diff old-report new-report
• Many more…
© 2016 Pythian. Confidential 22
TFA
Trace File Analyzer
TFA – Where to find it
• TFA Collector - TFA with Database Support Tools Bundle (Doc ID 1513912.1)
• Diagnostics tool
• Gathers information from many log files
• Especially useful with RAC ▪ Checks more files than you probably know about
▪ Gathers from all nodes to one location
▪ Query tool to drill down on problem times
• Bonus – other tools installed with TFA
• ORAchk
• OS Watcher
© 2016 Pythian. Confidential 24
TFA Install
Trace File Analyzer
Install or Upgrade TFA • Installed by default as of 11.2.0.4+
• Much more useful if upgraded from default
• Simple install and upgrade
© 2016 Pythian. Confidential 26
[root@ora12c102rac01 TFA]# unzip p21757377_121020_Generic.zip Archive: p21757377_121020_Generic.zip inflating: TFA_User_Guide_12.1.2.8.4.pdf inflating: installTFALite inflating: README.txt [root@ora12c102rac01 p21757377_121020_Generic]# ./installTFALite TFA Installation Log will be written to File : /tmp/tfa_install_30974_2016_12_17-12_22_18.log Starting TFA installation TFA HOME : /u01/cdbrac/app/grid/product/12.1.0.2/grid_1/tfa/ora12c102rac01/tfa_home TFA Build Version: 121284 Build Date: 201612160926 Installed Build Version: 121284 Build Date: 201611221014 TFA is already installed. Patching /u01/cdbrac/app/grid/product/12.1.0.2/grid_1/tfa/ora12c102rac01/tfa_home...
Start TFA
[root@oravm01 ~]# /etc/rc.d/init.d/init.tfa start
Starting TFA..
Waiting up to 100 seconds for TFA to be started..
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Successfully started TFA Process..
. . . . .
TFA Started and listening for commands
© 2016 Pythian. Confidential 27
TFA – Enable Automatic Diagnostic Collection
• tfactl set autodiagcollect=ON reposizeMB=5120
• Now enabled for all nodes
[root@oravm01 ~]# tfactl set autodiagcollect=ON reposizeMB=5120
Successfully set autodiagcollect=ON
.----------------------------------------------------------.
| oravm01 |
+---------------------------------------------+------------+
| Configuration Parameter | Value |
+---------------------------------------------+------------+
| TFA version | 12.1.2.4.1 |
| Automatic diagnostic collection | ON |
| Trimming of files during diagcollection | ON |
| Repository current size (MB) | 347 |
| Repository maximum size (MB) | 10240 |
| Inventory Trace level | 1 |
| Collection Trace level | 1 |
| Scan Trace level | 1 |
| Other Trace level | 1 |
| Max Size of TFA Log (MB) | 50 |
| Max Number of TFA Logs | 10 | ...
© 2016 Pythian. Confidential 28
TFA – Enable Automatic Diagnostic Collection
• ‘print config’ will show ‘auto’ enabled on only one node
tfactl print config | grep 'Automatic diag'
| Automatic diagnostic collection | ON |
| Automatic diagnostic collection | OFF |
• Diagnostics collected from all nodes.
• What happens if that node crashes?
• Next slide…
© 2016 Pythian. Confidential 29
Adhoc
Collection
• If auto collection not enabled
• tfactl diagcollect -all -from “Jan/27/2017 12:00:00" -to " Jan/27/2017 14:00:00"
• tfactl analyze -from " Jan/27/2017 12:00:00" -to " Jan/27/2017 14:00:00“ > rpt.txt
© 2016 Pythian. Confidential 30
TFA print
tfactl print -help
Usage: /u01/cdbrac/app/grid/product/12.1.0.2/grid_1/bin/tfactl print
[status|components [[component_name1] [component_name2] ...
[component_nameN]]|config|directories|hosts|actions|repository]
Prints requested details.
Options:
status Print status of TFA across all nodes in cluster
components Print the desired components in the Configuration
config Print current TFA config settings
directories Print all the directories to Inventory
hosts Print all the Hosts in the Configuration
actions Print all the Actions requested and their status
repository Print the zip file repository information
protocols Print available and restricted protocols in TFA
© 2016 Pythian. Confidential 31
TFA Config tfactl print config .---------------------------------------------------------------.
| ora12c102rac01 |
+--------------------------------------------------+------------+
| Configuration Parameter | Value |
+--------------------------------------------------+------------+
| TFA Version | 12.1.2.8.4 |
| Java Version | 1.6 |
| Public IP Network | ON |
| Automatic diagnostic collection | OFF |
| Alert Log Scan | ON |
| Trimming of files during diagcollection | ON |
| Repository current size (MB) | 276 |
| Repository maximum size (MB) | 10240 |
| Inventory Trace level | 1 |
| Collection Trace level | 1 |
| Scan Trace level | 1 |
| Other Trace level | 1 |
| Max Size of TFA Log (MB) | 50 |
| Max Number of TFA Logs | 10 |
| Max Size of Core File (MB) | 20 |
| Max Collection Size of Core Files (MB) | 200 |
| Automatic Purging | ON |
| Minimum Age of Collections to Purge (Hours) | 12 |
| Minimum Space Free to enable Alert Log Scan (MB) | 500 |
'--------------------------------------------------+------------'
32
TFA Repo
# find /u01/cdbrac/app/oracle/tfa/repository -type f -name \*.dat -mmin -10 | awk -F \/ '{ print $NF }'
ora12c102rac02.jks.com_top_17.01.04.2100.dat
ora12c102rac02.jks.com_vmstat_17.01.04.2100.dat
ora12c102rac02.jks.com_iostat_17.01.04.2100.dat
ora12c102rac02.jks.com_ifconfig_17.01.04.2100.dat
ora12c102rac02.jks.com_mpstat_17.01.04.2100.dat
ora12c102rac02.jks.com_netstat_17.01.04.2100.dat
ora12c102rac02.jks.com_meminfo_17.01.04.2100.dat
ora12c102rac02.jks.com_ps_17.01.04.2100.dat
33
• Repo files on remote node
• Find all files modified in last 10 minutes
• So we know it is working
TFA Repo Check Repo
tfactl print repository
.--------------------------------------------------------------.
| ora12c102rac01 |
+----------------------+---------------------------------------+
| Repository Parameter | Value |
+----------------------+---------------------------------------+
| Location | /u01/cdbrac/app/oracle/tfa/repository |
| Maximum Size (MB) | 10240 |
| Current Size (MB) | 278 |
| Free Size (MB) | 9962 |
| Status | OPEN |
'----------------------+---------------------------------------'
.--------------------------------------------------------------.
| Remote Node ===>>> ora12c102rac02 |
+----------------------+---------------------------------------+
| Repository Parameter | Value |
+----------------------+---------------------------------------+
| Location | /u01/cdbrac/app/oracle/tfa/repository |
| Maximum Size (MB) | 10240 |
| Current Size (MB) | 309 |
| Free Size (MB) | 9931 |
| Status | OPEN |
'----------------------+---------------------------------------'
34
TFA Status
tfactl print status
.-----------------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status |
+----------------+---------------+------+------+------------+----------------------+------------------+
| ora12c102rac01 | RUNNING | 2892 | 5000 | 12.1.2.8.4 | 12128420161216092619 | COMPLETE |
| ora12c102rac02 | RUNNING | 2932 | 5000 | 12.1.2.8.4 | 12128420161216092619 | COMPLETE |
'----------------+---------------+------+------+------------+----------------------+------------------'
© 2016 Pythian. Confidential 35
TFA
Directories
Scanned
Small partial listing
tfactl print directories
see tfa-directories-scanned.txt
Last Rediscovery Run on ora12c102rac01 : Wed Jan 04 17:29:06 EST 2017
.---------------------------------------------------------------------------------------------------------------------.
| ora12c102rac01 |
+------------------------------------+--------------------------------------------------------+------------+----------+
| Trace Directory | Component | Permission | Added By |
+------------------------------------+--------------------------------------------------------+------------+----------+
| /etc/oracle | [CRS] | public | root |
| Collection policy : Exclusions | | | |
+------------------------------------+--------------------------------------------------------+------------+----------+
| /u01/cdbrac/app/grid/product/12.1. | [CFGTOOLS] | public | root |
| 0.2/grid_1/cfgtoollogs | | | |
| Collection policy : Exclusions | | | |
+------------------------------------+--------------------------------------------------------+------------+----------+
| /u01/cdbrac/app/grid/product/12.1. | [CFGTOOLS] | public | root |
| 0.2/grid_1/cfgtoollogs/cfgfw | | | |
| Collection policy : Exclusions | | | |
+------------------------------------+--------------------------------------------------------+------------+----------+
| /u01/cdbrac/app/grid/product/12.1. | [CFGTOOLS] | public | root |
| 0.2/grid_1/cfgtoollogs/crsconfig | | | |
| Collection policy : Exclusions | | | |
+------------------------------------+--------------------------------------------------------+------------+----------+
© 2016 Pythian. Confidential 36
TFA – What has been deployed?
[root@ora12c102rac01 db_1]# tfactl toolstatus
.---------------------------------------------.
| External Support Tools |
+----------------+--------------+-------------+
| Host | Tool | Status |
+----------------+--------------+-------------+
| ora12c102rac01 | alertsummary | DEPLOYED |
| ora12c102rac01 | exachk | DEPLOYED |
| ora12c102rac01 | pstack | DEPLOYED |
| ora12c102rac01 | orachk | DEPLOYED |
| ora12c102rac01 | oratop | DEPLOYED |
...
| ora12c102rac01 | oswbb | RUNNING |
| ora12c102rac01 | dbperf | DEPLOYED |
| ora12c102rac01 | changes | DEPLOYED |
| ora12c102rac01 | events | DEPLOYED |
| ora12c102rac01 | ps | DEPLOYED |
| ora12c102rac01 | srdc | DEPLOYED |
'----------------+--------------+-------------'
© 2016 Pythian. Confidential 37
TFA Usage
Trace File Analyzer
TFA Help tfactl -help
Usage : /u01/cdbrac/app/grid/product/12.1.0.2/grid_1/bin/tfactl <command> [options]
<command> =
start Starts TFA
stop Stops TFA
enable Enable TFA Auto restart
disable Disable TFA Auto restart
print Print requested details
access Add or Remove or List TFA Users
purge Delete collections from TFA repository
directory Add or Remove or Modify directory in TFA
host Add or Remove host in TFA
diagcollect Collect logs from across nodes in cluster
collection Manage TFA Collections
analyze List events summary and search strings in alert logs.
set Turn ON/OFF or Modify various TFA features
toolstatus Prints the status of TFA Support Tools
run <tool> Run the desired support tool
start <tool> Starts the desired support tool
stop <tool> Stops the desired support tool
syncnodes Generate/Copy TFA Certificates
diagnosetfa Collect TFA Diagnostics
uninstall Uninstall TFA from this node
© 2016 Pythian. Confidential 39
Options
[root@ora12c102rac01 ~]# tfactl analyze -help
Usage : /u01/cdbrac/app/grid/product/12.1.0.2/grid_1/bin/tfactl analyze [-search "pattern"]
© 2016 Pythian. Confidential 40
Components • db
• asm
• crs
• acfs
• os
• osw
• oswslabinfo
• oratop
• all
-type <error|warning|generic>
Dates, nodes and output
• [-since <n>[h|d]]
• [-from "MMM/DD/YYYY HH24:MI:SS"]
• [-to "MMM/DD/YYYY HH24:MI:SS"]
• [-for "MMM/DD/YYYY HH24:MI:SS"]
• [-node <all | local | n1,n2,..>]
• [-verbose]
• [-o <file>]
Adhoc
Collection
• If auto collection not enabled
• tfactl diagcollect -all -from “Jan/27/2017 12:00:00" -to " Jan/27/2017 14:00:00"
• tfactl analyze -from " Jan/27/2017 12:00:00" -to " Jan/27/2017 14:00:00“ > rpt.txt
© 2016 Pythian. Confidential 41
oratop
• Like 'top‘ - continuous updates
• Show video excerpts
tfactl analyze -comp oratop -database js122a -d
• Output once to stdout
tfactl analyze -comp oratop -database js122a
© 2016 Pythian. Confidential 42
oratop character output
[root@ora12c102rac01 ~]# tfactl analyze -comp oratop -database js122a
Cycle 1 - oratop: Release 14.1.2 Production on Fri Jan 27 14:20:48 2017
Oracle 12c - js1 17:20:46 up: 1.8d, 2 ins, 0 sn, 0 us, 3.4G mt, 1.4% db
ID %CPU LOAD %DCU AAS ASC ASI ASW AST IOPS %FR PGA UTPS UCPS SSRT %DBT
-------------------------------------------------------------------------------
2 32 2 0 1 0 0 0 0 5 7 287M 0 7 340u 84.8
1 5 1 0 1 0 0 0 0 3 11 242M 0 2 326u 15.2
EVENT (C) TOT WAITS TIME(s) AVG_MS PCT WAIT_CLASS
-------------------------------------------------------------------------------
DB CPU 18750 54
control file parallel write 209974 6038 28.9 18 System I/O
db file parallel write 194182 3972 20.7 12 System I/O
db file sequential read 435013 3122 7.3 9 User I/O
enq: PS - contention 443333 2553 9.4 7 Other
ID SID SPID USR PROG S PGA SQLID/BLOCKER OPN E/T STA STE EVENT/*LA W/T
-------------------------------------------------------------------------------
1 60 13112 B/G QM03 D 1.8M a3vfsb1vvtr3s SEL 2.0d ACT WAI reliable 106m
© 2016 Pythian. Confidential 43
oratop wide example
• tfactl analyze -comp oratop -f -i 2 -s -database js122a ▪ -f 132 column detail
▪ -i interval seconds
▪ -s SQL mode
Oracle 12c - Primary js122a 16:04:41 up: 2.8d, 2 ins, 0 sn, 0 us, 3.4G mt, 4% fra, 8 er, 4 pdb, 35.1% db
ID %CPU LOAD %DCU AAS ASC ASI ASW ASP AST UST MBPS IOPS IORL LOGR PHYR PHYW %FR PGA TEMP UTPS UCPS SSRT DCTR DWTR %DBT
1 43 2 25 0.6 0 0 0 1 0 0 0 28 478u 1 0 0 7 260M 0 0 50 596u 30 69 81.3
2 58 2 38 0.1 0 0 0 0 0 0 0 7 512u 0 0 0 7 291M 0 0 142 249u 66 33 18.7
EVENT (C) TOTAL WAITS TIME(s) AVG_MS PCT WAIT_CLASS
DB CPU 24549 54
control file parallel write 268246 7739 28.9 17 System I/O
db file parallel write 245596 5016 20.6 11 System I/O
db file sequential read 595676 4109 6.8 9 User I/O
enq: PS - contention 553361 3726 9.7 8 Other
ID CID USERNAME MODULE ACTION SQL_ID SQL_TEXT X ELAP CPUT IOWT WAIT EXEC ROWS BUFG DISK BH% LOAD
© 2016 Pythian. Confidential 44
Explore TFA Components - db
• db with –type generic option
• -type generic can turn up interesting issues
[root@ora12c102rac01 ~]# tfactl analyze -comp db -since 6d
Unique generic messages for last ~6 day(s)
Occurrences percent server name generic
----------- ------- -------------------- -----
...
(Frequent Resize)
1 0.1% ora12c102rac02 Resize operation completed for file# 201, old size 5513216K, new size 5514240K
Resize operation completed for file# 201, old size 5514240K, new size 5515264K
Resize operation completed for file# 201, old size 5515264K, new size 5516288K
Resize operation completed for file# 201, old size 5516288K, new size 5517312K
...
© 2016 Pythian. Confidential 45
Explore TFA Components - db
• db with –type generic option
• -type generic can turn up interesting issues
( underallocated memory in one node )
1 0.1% ora12c102rac02
WARNING: Heavy swapping observed on system in last 5 mins.
pct of memory swapped in [1.67%] pct of memory swapped out [1.39%].
Please make sure there is no memory pressure and the SGA and PGA
are configured correctly. Look at DBRM trace file for more details.
Errors in file /u01/cdbrac/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/-MGMTDB_dbrm_8189.trc
(incident=12098) (PDBNAME=CDB$ROOT):
ORA-00700: soft internal error, arguments: [kskvmstatact: excessive swapping observed], [], [], [], [],
[], [], [], [], [], [], []
Incident details in: /u01/cdbrac/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/incident/incdir_12098/-
MGMTDB_dbrm_8189_i12098.trc
© 2016 Pythian. Confidential 46
Explore TFA Components - osw
INFO: analyzing host: ora12c102rac02 (content edited to fit the screen)
Report title: OSW top logs
Report date range: last ~6 day(s)
Report (default) time zone: EST - Eastern Standard Time
Analysis started at: 28-Jan-2017 06:12:59 PM EST
Elapsed analysis time: 3 second(s).
Configuration file: /u01/cdbrac/app/grid/product/12.1.0.2/grid_1/tfa/ora12c102rac02/tfa_home/ext/tnt…
Configuration group: osw
Parameter:
Total osw rec count: 5,767, from 26-Jan-2017 06:00:22 PM EST to 28-Jan-2017 06:12:38 PM EST
OSW recs matching last ~6 day(s): 5,767, from 26-Jan-2017 06:00:22 PM EST to 28-Jan-2017 06:12:38 PM EST
statistic: t first highest lowest average non zero 3rd last trend
top.cpu.util.id: % 39.8 100.0 0.0 81.6 5,572 91.4 -100%
top.cpu.util.si: % 2.4 47.4 0.0 0.9 2,928 0.0 287%
top.cpu.util.sy: % 13.3 77.8 0.0 4.5 5,722 1.2 5%
top.cpu.util.us: % 9.6 94.3 0.0 8.1 5,740 6.2 698%
top.cpu.util.wa: % 34.9 95.4 0.0 4.9 4,391 1.2 -100%
top.loadavg.last01min: 3.07 23.89 1.02 2.06 5,473 1.40 -12%
top.mem.buffers: k 156488 230100 46824 141571 5,762 130836 -16%
top.mem.free: k 336160 718324 33444 224517 5,762 235172 -85%
top.mem.used: k 4745048 5047764 4362884 4856691 5,762 4846036 6%
top.swap.cached: k 1085320 1437792 816320 1189525 5,763 1159368 8%
top.swap.free: k 2296228 2366048 2277208 2319930 5,763 2333156 1%
top.swap.used: k 1832536 1851556 1762716 1808834 5,763 1795608 -2%
...
© 2016 Pythian. Confidential 47
TFA
Incident Simulation
Simulation
• “Reports of issues with system”
• Deliberately caused by this command
[root@ora12c102rac02 ~]# ip link set eth1 down
• crsctl stat res -t reports node offline
[root@ora12c102rac02 ~]# tfactl analyze -comp crs -since 1h
INFO: analyzing host: ora12c102rac02
Report title: CRS Alert Logs
Report date range: last ~1 hour(s)
Report (default) time zone: EST - Eastern Standard Time
Analysis started at: 25-Jan-2017 06:18:03 PM EST
Elapsed analysis time: 3 second(s).
…
© 2016 Pythian. Confidential 49
Simulation
Unique error messages for last ~1 hour(s)
Occurrences percent server name error
----------- ------- -------------------- -----
1 33.3% ora12c102rac02 [OCSSD(19266)]CRS-1607: Node ora12c102rac01 is
being evicted in cluster incarnation 378741272; details at (:CSSNM00007:) in
/u01/cdbrac/app/oracle/diag/crs/ora12c102rac02/crs/trace/ocssd.trc.
1 33.3% ora12c102rac02 [OCSSD(19266)]CRS-1601: CSSD Reconfiguration
complete. Active nodes are ora12c102rac02 .
1 33.3% ora12c102rac02 [OCSSD(19266)]CRS-1610: Network communication
with node ora12c102rac01 (1) missing for 90% of timeout interval. Removal of this node
from cluster in 2.370 seconds
----------- -------
3 100.0%
© 2016 Pythian. Confidential 50
Investigation
• eth1 is down
hmm, eth1 is down
[root@ora12c102rac02 ~]# ip a
…
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000
link/ether 08:00:27:03:80:e6 brd ff:ff:ff:ff:ff:ff
inet 169.254.24.84/16 brd 169.254.255.255 scope global eth1:1
• Bring interface back up and all goes back to normal
• Yes, you can do this with more manual processes
• TFA just reduces time required to diagnose
© 2016 Pythian. Confidential 51
TFA
Real Life Examples
Bad NIC
• Found Node evicted due to network issue
• Only this node – found to be NIC
• Much faster diagnosis with TFA
tfactl analyze -from "Oct/17/2014 11:00:00" -to "Oct/17/2014 12:30:00" > tfa-rpt-
20141017-1100-1230.txt
From the report it was seen that all nodes reported the following issue:
CRS-1610:Network communication with node uornoldbp02 (2) missing for 90% of
timeout interval. Removal of this node from cluster in 2.190 seconds
This would coincide with the findings that a NIC had failed on the server.
There don't appear to be other contributing errors.
© 2016 Pythian. Confidential 53
TFA diagnosis – disk error
Pythian client - we used TFA to find cause of node eviction
3 Node RAC
tfactl analyze -from "Oct/17/2014 08:45:00" -to "Oct/17/2014 09:15:00"
> tfa-rpt-20141017-0845-0915.txt
NODE 3
This looks like the smoking gun:
1 2.9% rac3rd-node
WARNING: Write Failed. group:1 disk:1 AU:176116 offset:344064 size:8192
ASMB (ospid: 3398): terminating the instance due to error 15064
ERROR: unrecoverable error ORA-15188 raised in ASM I/O path; terminating process 3368
Termination issued to instance processes. Waiting for the processes to exit
© 2016 Pythian. Confidential 54
TFA
Other troubleshooting help
Stack Trace • Why use pstack?
• Often an easy method to locate bugs in Oracle Support
• Stack trace appears in:
• Trace files generated by errors
• System/State dumps
• Stack trace for running process
• Use pstack PID
pstack 30965 #0 0x0000003f5b40e810 in __read_nocancel () from /lib64/libpthread.so.0
#1 0x000000000cef52f0 in snttread ()
#2 0x000000000cef4785 in nttfprd ()
#3 0x000000000ced4e95 in nsbasic_brc ()
#4 0x000000000ced4c96 in nsbrecv ()
#5 0x000000000cee3a4e in nioqrc ()
#6 0x000000000cb53aad in opikndf2 ()
#7 0x0000000001baf0d2 in opitsk ()
#8 0x0000000001bb3e31 in opiino ()
#9 0x000000000cb56f5d in opiodr ()
...
© 2016 Pythian. Confidential 56
Stack Trace
• Some process/program acting strangely
• Bug suspected?
• Use gv$session to get PIDs of interest.
SQL> l
1 select s.username, s.inst_id, s.sid
2 , s.sql_id, s.program, p.spid spid
3 from gv$session s
4 left outer join gv$process p on s.inst_id = p.inst_id
5 and p.addr = s.paddr
6 where s.username = 'JKSTILL'
7 and s.program like 'sqlplus@poirot%'
8* order by username, sid
11:15:28 ora12c102rac01.jks.com - jkstill@js122a1 SQL> /
SRVR
USERNAME INST SID SQL ID PROGRAM PID
---------- ----- ------ -------------- ------------------------------------------------ -----
JKSTILL 1 48 [email protected] (TNS V1-V3) 30965
1 71 gyd5fpd63tfs0 [email protected] (TNS V1-V3) 4777
2 88 [email protected] (TNS V1-V3) 27669
© 2016 Pythian. Confidential 57
Stack Trace
• Not necessary to logon to each server in a cluster
• TFA makes this simple – especially for RAC
[root@ora12c102rac01 ~]# tfactl pstack 30965 4777 27669
Output from host : ora12c102rac01
------------------------------
# pstack output for pid : 30965
#0 0x0000003f5b40e810 in __read_nocancel () from /lib64/libpthread.so.0
#1 0x000000000cef52f0 in snttread ()
#2 0x000000000cef4785 in nttfprd ()
• But, searches all nodes …
© 2016 Pythian. Confidential 58
Stack
Trace
[root@ora12c102rac01 ~]# tfactl pstack 30965 4777 27669
Output from host : ora12c102rac01
------------------------------
# pstack output for pid : 30965
#0 0x0000003f5b40e810 in __read_nocancel () from /lib64/libpthread.so.0
#1 0x000000000cef52f0 in snttread ()
...
# pstack output for pid : 4777
#0 0x0000003f5b40e810 in __read_nocancel () from /lib64/libpthread.so.0
#1 0x000000000cef52f0 in snttread ()
# pstack output for pid : 27669
Process 27669 not found.
Output from host : ora12c102rac02
------------------------------
# pstack output for pid : 30965
Process 30965 not found.
# pstack output for pid : 4777
Process 4777 not found.
# pstack output for pid : 27669
#0 0x0000003a4b60e740 in __read_nocancel () from /lib64/libpthread.so.0
#1 0x000000000ceb67d0 in snttread ()
© 2016 Pythian. Confidential 59
grep • Search all alert logs
• Find all occurrences of ORA-00600 in all alert logs
© 2016 Pythian. Confidential 60
[root@ora12c102rac01 ~]# tfactl grep -E 'ORA-[0]{0,2}600' alert_
Output from host : ora12c102rac01
------------------------------
Searching 'ORA-[0]{0,2}600' in alert_
Searching /u01/cdbrac/app/oracle/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
Searching /u01/cdbrac/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/alert_-MGMTDB.log
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
Searching /u01/cdbrac/app/oracle/diag/rdbms/js122a/js122a1/trace/alert_js122a1.log
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
ORA-00600: internal error code, arguments: [25027], [3], [3], [3], [1], [], [], [], [], [], [], []
Output from host : ora12c102rac02
------------------------------
Similar output for ora12c102rac02
summary [root@ora12c102rac01 ~]# tfactl summary
Output from host : ora12c102rac01
------------------------------
=====
Nodes
=====
ora12c102rac01
ora12c102rac02
=====
Homes
=====
.----------------------------------------------------------------------------------------------------------.
| Home | Type| Version | Database| Instance | Patches |
+----------------------------------------------+-----+------------+---------+----------+-------------------+
| /u01/cdbrac/app/grid/product/12.1.0.2/grid_1 | GI | 12.1.0.2.0 | | | |
| /u01/cdbrac/app/oracle/product/12.1.0.2/db_1 | DB | 12.1.0.2.0 | js122a | js122a1 | 23054327,23054246 |
'----------------------------------------------+-----+------------+---------+----------+-------------------'
Other nodes similar
© 2016 Pythian. Confidential 61
summary [root@ora12c102rac01 ~]# tfactl summary
Output from host : ora12c102rac01
------------------------------
=====
Nodes
=====
ora12c102rac01
ora12c102rac02
=====
Homes
=====
.----------------------------------------------------------------------------------------------------------.
| Home | Type| Version | Database| Instance | Patches |
+----------------------------------------------+-----+------------+---------+----------+-------------------+
| /u01/cdbrac/app/grid/product/12.1.0.2/grid_1 | GI | 12.1.0.2.0 | | | |
| /u01/cdbrac/app/oracle/product/12.1.0.2/db_1 | DB | 12.1.0.2.0 | js122a | js122a1 | 23054327,23054246 |
'----------------------------------------------+-----+------------+---------+----------+-------------------'
Other nodes similar
© 2016 Pythian. Confidential 62
ls • Find all alert logs
[root@ora12c102rac01 ~]# tfactl ls alert
Output from host : ora12c102rac01
------------------------------
/u01/cdbrac/app/oracle/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log
/u01/cdbrac/app/oracle/diag/crs/ora12c102rac01/crs/trace/alert.log
/u01/cdbrac/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/alert_-MGMTDB.log
/u01/cdbrac/app/oracle/diag/rdbms/js122a/js122a1/trace/alert_js122a1.log
Output from host : ora12c102rac02
------------------------------
/u01/cdbrac/app/oracle/diag/asm/+asm/+ASM2/trace/alert_+ASM2.log
/u01/cdbrac/app/oracle/diag/crs/ora12c102rac02/crs/trace/alert.log
/u01/cdbrac/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/alert_-MGMTDB.log
/u01/cdbrac/app/oracle/diag/rdbms/js122a/js122a2/trace/alert_js122a2.log
© 2016 Pythian. Confidential 63
param
• Listener parameters for all nodes
[root@ora12c102rac01 ~]# tfactl param .*_listener
Output from host : ora12c102rac01
------------------------------
JS122A.ora12c102rac01.jks.com.JS122A.js122a1.local_listener = (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.233)(PORT=1521))
JS122A.ora12c102rac02.jks.com.JS122A.js122a2.local_listener = (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.234)(PORT=1521))
JS122A.ora12c102rac01.jks.com.JS122A.js122a1.remote_listener = ora12c102rac-scan:1521
JS122A.ora12c102rac02.jks.com.JS122A.js122a2.remote_listener = ora12c102rac-scan:1521
Output from host : ora12c102rac02
------------------------------
JS122A.ora12c102rac01.jks.com.JS122A.js122a1.local_listener = (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.233)(PORT=1521))
JS122A.ora12c102rac02.jks.com.JS122A.js122a2.local_listener = (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.234)(PORT=1521))
JS122A.ora12c102rac01.jks.com.JS122A.js122a1.remote_listener = ora12c102rac-scan:1521
JS122A.ora12c102rac02.jks.com.JS122A.js122a2.remote_listener = ora12c102rac-scan:1521
© 2016 Pythian. Confidential 64
Even more
• procwatcher
• sqlt
• alertsummary
• vi
• tail
• dbglevel
• history
• RDA
• changes
© 2016 Pythian. Confidential 65
TFA
Miscellaneous
TFA Data Masking
• Data masked during collection
• Create tfa_home/resources/mask_strings.xml
<mask_strings>
<mask_string>
<original>WidgetNode1</original>
<replacement>Node1</replacement>
</mask_string>
<mask_string>
<original>192.168.5.1</original>
<replacement>Node1-IP</replacement>
</mask_string>
...
</mask_strings>
© 2016 Pythian. Confidential 67
SR Data Collection • Error Types
• ora600
• ora7445
• ora700
• ora4031
• ora4030
• ora27300
• ora27301
• Ora27302
• Performance
• dbperf
• Internal Errors
• internalerror
• Does not work as root
[root@ora12c102rac01 ~]# tfactl diagcollect -srdc -help
SRDC diagostic collections must be run as an oracle privileged user - not root
© 2016 Pythian. Confidential 68
SR Data Collection Examples [oracle@ora12c102rac01 ~]$ tfactl diagcollect -srdc ora700
Enter the time of the ORA-00700 [YYYY-MM-DD HH24:MI:SS,<RETURN>=ALL] : 2017-01-
29
Enter the Database Name [<RETURN>=ALL] : js122a
No events matching the timestamp Jan/28/2017 18:00:00-Jan/29/2017 06:00:00.
The timestamp must be between Dec/28/2016 11:14:08 and Jan/25/2017 16:58:35.
...
2017/01/30 18:31:10 EST : Completed collection of zip files.
...
Logs are being collected to:
/u01/cdbrac/app/oracle/tfa/repository/srdc_ora700_collection_Mon_Jan_30_15_30_16
_PST_2017_node_local
/u01/cdbrac/app/oracle/tfa/repository/srdc_ora700_collection_Mon_Jan_30_15_30_16
_PST_2017_node_local/ora12c102rac01.tfa_srdc_ora700_Mon_Jan_30_15_30_16_PST_2017
.zip
© 2016 Pythian. Confidential 69
What’s left?
Start Using Them!
© 2016 Pythian. Confidential 70
ORAchk
and TFA
QR Code
http://bit.ly/orachk-and-tfa
Resources
Resources
• Oracle Blog – OraCHK Overview
• https://community.oracle.com/community/support/support-blogs/database-support-
blog/blog/2015/11/09/how-to-use-orachk-to-reduce-your-risk
• Oracle Blog – TFA Overview
• https://community.oracle.com/community/support/support-blogs/database-support-
blog/blog/2016/12/12/oracle-trace-file-analyzer-tfa-an-overview-guide
• NFS rpc_wait issues • http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-
quickndirty-troubleshooting/
• NFS Troubleshooting
• https://wiki.archlinux.org/index.php/NFS/Troubleshooting
• How to Use OraCHK to Reduce You Risk
• Oracle Trace File Analyzer (TFA) - an Overview Guide
© 2016 Pythian. Confidential 73
OS Watcher locates intermittent issue
•http://berxblog.blogspot.ca/2016/12/interconnect-fragmentation-kills-cluster.html
• Used oswatcher to find issue.
• Check the referenced Oracle Support Notes as well.
© 2016 Pythian. Confidential 74
THANK YOU
© 2016 Pythian. Confidential 75
ABOUT PYTHIAN
Pythian’s 400+ IT professionals
help companies adopt and
manage disruptive technologies
to better compete
© 2016 Pythian. Confidential 76
Systems currently
managed by Pythian
EXPERIENCED
Pythian experts
in 35 countries
GLOBAL
Millennia of experience
gathered and shared over
19 years
EXPERTS
11,800 2 400
© 2016 Pythian. Confidential 77
TECHNICAL EXPERTISE
© 2016 Pythian. Confidential 78
Infrastructure: Transforming and
managing the IT infrastructure
that supports the business
DevOps: Providing critical velocity
in software deployment by adopting
DevOps practices
Cloud: Using the disruptive
nature of cloud for accelerated,
cost-effective growth
Databases: Ensuring databases
are reliable, secure, available and
continuously optimized
Big Data: Harnessing the transformative
power of data on a massive scale
Advanced Analytics: Mining data for
insights & business transformation
using data science