Date post: | 25-Jul-2015 |
Category: |
Technology |
Upload: | emre-baransel |
View: | 792 times |
Download: | 3 times |
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output
Emre Baransel – Advanced Support Engineer, Employee ACE- Oracle
A Deep Dive into ASM Redundancy in Exadata
A Deep Dive into ASM
Redundancy in Exadata
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output
Storage Server 1 Storage Server 2 Storage Server 3
We’ll consider 3 storage servers in examples
Storage Servers Notation
A Deep Dive into ASM
Redundancy in Exadata
12
1
2
3
4
5
6
7
8
9
10
11
Storage Server 1 Storage Server 2 Storage Server 3
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Disks on Storage Servers
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
PHYSICAL DISC
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Physical Disks
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
SYSTEM PARTITIONS DBFS DG RECO DG DATA DG
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Logical Partitions/Diskgroups
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
RECO DG DATA DG
GRID/ASM DISCS
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Grid Disks (Partitions)
SYSTEM PARTITIONS DBFS DG
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
RECO DG DATA DG
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Disks Usage Notation
SYSTEM PARTITIONS DBFS DG
A Deep Dive into ASM
Redundancy in Exadata
FAILGROUP 1 FAILGROUP 2 FAILGROUP 3
NORMAL REDUNDANCY
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Normal Redundancy Diskgroups
A Deep Dive into ASM
Redundancy in Exadata
HIGH REDUNDANCY
FAILGROUP 1 FAILGROUP 2 FAILGROUP 3
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output High Redundancy Diskgroups
A Deep Dive into ASM
Redundancy in Exadata
- Disk Failure - transient disk failure
- physical disk failure
- Storage Server Failure
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Types of Failures
This presentation examines failures in groups, in order to provide clarity. There may be exceptional cases.
A Deep Dive into ASM
Redundancy in Exadata
TRANSIENT FAILURE (OFFLINE)
Storage Server 1 Storage Server 2 Storage Server 3
RECO DG DATA DG
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Transient Disk Failures
SYSTEM PARTITIONS DBFS DG
A Deep Dive into ASM
Redundancy in Exadata
FAILURE CORRECTED or NEW DISK
Storage Server 1 Storage Server 2 Storage Server 3
FAILURE CORRECTED or DISK REPLACED BEFORE DISK_REPAIR_TIME EXCEEDS
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Transient Disk Failures
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
DISK IS RESYNCED WITH ASM FAST MIRROR RESYNC
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Transient Disk Failures
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
IF DISK_REPAIR_TIME EXCEEDS THEN
ASM DROPS THE DISKS AND REBALANCE DATA IF THERE IS ENOUGH SPACE
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Transient Disk Failures
A Deep Dive into ASM
Redundancy in Exadata
• DISK_REPAIR_TIME is a diskgroup attribute.
• Default is 3.6 hours.
• alter diskgroup data set attribute 'disk_repair_time' = '4.5h‘
• Altering the DISK_REPAIR_TIME attribute has no effect on offline disks
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output DISK_REPAIR_TIME Attribute
A Deep Dive into ASM
Redundancy in Exadata
PHYSICAL DISC FAILURE
Storage Server 1 Storage Server 2 Storage Server 3
RECO DG DATA DG
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Physical Disk Failures
SYSTEM PARTITIONS DBFS DG
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
ASM DOESN’T WAIT FOR DISK_REPAIR_TIME,
DROPS THE DISK AND REBALANCE DATA IF THERE IS ENOUGH SPACE
(Pro-Active Disk Quarantine - 11.2.1.3.1)
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Physical Disk Failures
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
WHEN DISK IS REPLACED GRID DISCS ARE CREATED & 2. REBALANCE STARTS AUTOMATICALLY
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Physical Disk Failures
A Deep Dive into ASM
Redundancy in Exadata
AUTO DISK MANAGEMENT feature in EXADATA
Exadata Automation Manager (XDMG)
initiates automation tasks. Monitors all configured storage cells for state changes.
Exadata Automation Worker (XDWK)
performs automation tasks requested by XDMG.
_AUTO_MANAGE_EXADATA_DISKS controls the auto disk management feature. To disable the feature
set this parameter to FALSE. Range of values: TRUE [default] or FALSE.
_AUTO_MANAGE_NUM_TRIES controls the maximum number of attempts to perform an automatic
operation. Range of values: 1-10. Default value is 2.
_AUTO_MANAGE_MAX_ONLINE_TRIES controls maximum number of attempts to ONLINE a disk.
Range of values: 1-10. Default value is 3.
NOTE:1484274.1 - Auto disk management feature in Exadata
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Auto Disk Management
A Deep Dive into ASM
Redundancy in Exadata
F A I L E D
Storage Server 1 Storage Server 2 Storage Server 3
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Storage Server Failures
A Deep Dive into ASM
Redundancy in Exadata
• WHEN A STORAGE SERVER FAILS IT MEANS THE FAILURE OF THE
WHOLE FAILGROUP IN ASM
• ASM DOES NOT DROP DISKS BEFORE DISK_REPAIR_TIME EXCEEDS
• SAME WHEN REBOOTING THE STORAGE SERVER
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Storage Server Failures
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
IF SERVER IS ALIVE BEFORE DISK_REPAIR_TIME EXCEEDS,
DISKS WILL BE SYNCED – NO REBALANCE
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Storage Server Failures
A Deep Dive into ASM
Redundancy in Exadata
F A I L E D
Storage Server 1 Storage Server 2 Storage Server 3
IF DISK_REPAIR_TIME EXCEEDS,
ASM WILL REBALANCE DATA IF THERE IS ENOUGH SPACE
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Storage Server Failures
A Deep Dive into ASM
Redundancy in Exadata
Storage Server 1 Storage Server 2 Storage Server 3
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Storage Server Failures
WHEN STORAGE SERVER COMES BACK THERE WILL BE A SECOND REBALANCE
A Deep Dive into ASM
Redundancy in Exadata
In Normal Redundancy;
What happens at second failure, is first related with when it occurs.
- If after rebalance/sync is completed,
then procedure is same with the first failure.
- If before rebalance/sync is completed,
then what happens is related with which disk is failed.
- If first & second failed disks are not partner disks, a new rebalance is
in question, if there’s enough space
- If first & second failed disks are partner disks data loss occurs.
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Second Failure / Bad Chance
• This is a small possibility but needs consideration. • Partner disks are on different storage servers (failgroups). • First incident doesn’t have to be a failure, storage server reboot causes the same.
Exadata Database Machine : How to identify cell failgroups and Partner disks for a grid disk (Doc ID 1431697.1)
A Deep Dive into ASM
Redundancy in Exadata
In High Redundancy;
There are three copies of each extent
So second failure never cause a data loss in High Redundancy
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Second Failure / Bad Chance
A Deep Dive into ASM
Redundancy in Exadata
”MOUNT RESTRICTED FORCE FOR RECOVERY” feature
>= 11.2.0.4 BP16
>= 12.1.0.2 BP4
Applicable to NORMAL redundancy diskgroups only.
Potential Use Cases that this procedure will be applicable to :
1. Exadata cell rolling upgrade/patching and a partner disk failure at the same time
2. Transient disk failure in a cell followed by a permanent partner disk failure before the first failed disk
comes back online.
NOTE:1968642.1 - Recover from diskgroup failure using the 12.1.0.2 “mount restricted force for recovery” feature - An Example
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output A New Feature
A Deep Dive into ASM
Redundancy in Exadata
”MOUNT RESTRICTED FORCE FOR RECOVERY” example:
o Cell 1 CellCLI> Alter cell shutdown services all;
o Cell 2 alter physicaldisk <disk> simulate failureType=failed; database crashes
o SQL> alter diskgroup datac1 mount restricted force for recovery;
o CellCLI> Alter cell start services all;
o SQL> alter diskgroup datac1 online disks in failgroup CELLFG1;
o Wait until MODE_STATUS column in v$asm_disk for the disks being onlined changes to
ONLINE from SYNCING.
o Do NOT execute the subsequent steps if the mode_status column shows SYNCING. It
will lead to data corruption.
o In resync, due to the second disk failure, Arb0 will not be able to read some of the required extents
(which are in the failed second disk) and hence marks those missing extents with BADFDA7A.
(arb0 trace file => WARNING: group 1, file 258, extent 100: filling extent with BADFDA7A during recovery)
o SQL> alter diskgroup datac1 dismount;
SQL> alter diskgroup datac1 mount;
o Start database & Perform RMAN block media recovery
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Example Procedure
A Deep Dive into ASM
Redundancy in Exadata
In an Exadata ASM Diskgroup, we can mention following disk spaces:
Total Raw Size (TRS)
Used Raw Size (URS)
Free Raw Size (FRS)
Total Allocatable Size (TAS) TRS / Redundancy Factor
Used Allocatable Size (UAS) URS / Redundancy Factor
Free Allocatable Size (FAS) FRS / Redundancy Factor
Size Needed for Disk Failure Coverage (SNDFC) Largest Disk (or 2 Disks for High R.)
Size Needed for Cell Failure Coverage (SNCFC) Largest Cell (or 2 Cells for High R.)
Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / Redundancy Factor
Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / Redundancy Factor
Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / Redundancy Factor
Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / Redundancy Factor
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output What kind of Usable Space?
A Deep Dive into ASM
Redundancy in Exadata
Total Raw Size (TRS) 360
Used Raw Size (URS) 120
Free Raw Size (FRS) 240
Total Allocatable Size (TAS) TRS / 2 = 180
Used Allocatable Size (UAS) URS / 2 = 60
Free Allocatable Size (FAS) FRS / 2 = 120
Size Needed for Disk Failure Coverage (SNDFC) 10
Size Needed for Cell Failure Coverage (SNCFC) 120
Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / 2 = 175
Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / 2 = 120
Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / 2 = 115
Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / 2 = 60
Normal Redundancy
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Calculations for Normal Redundancy
A Deep Dive into ASM
Redundancy in Exadata
Total Raw Size (TRS) 360 360
Used Raw Size (URS) 120 120
Free Raw Size (FRS) 240 240
Total Allocatable Size (TAS) TRS / 2 = 180 TRS / 3 = 120
Used Allocatable Size (UAS) URS / 2 = 60 URS / 3 = 40
Free Allocatable Size (FAS) FRS / 2 = 120 FRS / 3 = 80
Size Needed for Disk Failure Coverage (SNDFC) 10 20
Size Needed for Cell Failure Coverage (SNCFC) 120 240
Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / 2 = 175 (TRS - SNDFC) / 3 = 113.3
Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / 2 = 120 N/A for Quarter Rack
Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / 2 = 115 (FRS - SNDFC) / 3 = 73.3
Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / 2 = 60 N/A for Quarter Rack
Normal Redundancy High Redundancy
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output Calculations for High Redundancy
A Deep Dive into ASM
Redundancy in Exadata
ASMCMD> lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED NORMAL N 512 4096 4194304 27942912 16708892 9314304 3697294 0 N DATAC1/ MOUNTED NORMAL N 512 4096 4194304 1038240 1036984 346080 345452 0 Y DBFS_DG/ MOUNTED NORMAL N 512 4096 4194304 11973312 7966060 3991104 1987478 0 N RECOC1/
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output What we have in ASMCMD
Total_MB Total Raw Size (TRS) Free_MB Free Raw Size (FRS) Req_mir_free_MB ≥11.2.0.4.9 & ≥ 12.1.0.2 Size Needed for Disk Failure Coverage (SNDFC) <11.2.0.4.9 & <12.1.0.2 Size Needed for Cell Failure Coverage (SNCFC) Usable_file_MB ≥11.2.0.4.9 & ≥ 12.1.0.2 Free Disk Failure Safe Allocatable Size ≥11.2.0.4.9 & ≥ 12.1.0.2 Free Cell Failure Safe Allocatable Size
A Deep Dive into ASM
Redundancy in Exadata
References
1 – Overview
2 – Failure
3 – Second Failure
4 – Usable Space
5 – ASMCMD "lsdg" Output
Oracle Exadata Database Machine Maintenance Guide
Automatic Storage Management Administrator's Guide
NOTE:1484274.1 - Auto disk management feature in Exadata
NOTE: 443835.1 - ASM Fast Mirror Resync - Example To Simulate Transient Disk Failure And Restore Disk
NOTE:1431697.1 - Exadata Database Machine : How to identify cell failgroups and Partner disks for a grid disk
NOTE:1968642.1 - Recover from diskgroup failure using the 12.1.0.2 “mount restricted force for recovery” feature - An Example
NOTE:1386147.1 - How to Replace a Hard Drive in an Exadata Storage Server (Hard Failure)
NOTE:1339373.1 - Operational Steps for Recovery after Losing a Disk Group in an Exadata Environment
NOTE:1551288.1 - Understanding ASM Capacity and Reservation of Free Space in Exadata
NOTE:1319567.1 - ASM Usable Space Calculations in Exadata Environment along with cell failure considerations