edba: Failures in RAC

Failure of Voting disk

Move voting disk to OCR_VOTE
=============================

CORRUPTION : Create the new diskgroup since the current diskgroup is corrupted

1. Stop on all nodes and start in exclusive mode(root user)

# crsctl stop crs -f

# crsctl start crs -excl -nocrs

2. Start using pfile

SQL>startup pfile='/u01/app/oracle/init+ASM1.ora';

If all the voting disks are corrupted, create the new diskgroup

CREATE DISKGROUP OCR_VOTE NORMAL REDUNDANCY
FAILGROUP controller01 DISK '/dev/asm-ocr_vote1'
FAILGROUP controller02 DISK '/dev/asm-ocr_vote2'
FAILGROUP controller03 DISK '/dev/asm-ocr_vote3'
ATTRIBUTE
'au_size'='1M',
'compatible.asm' = '12.1';

SQL> ! srvctl start diskgroup -g ocr_vote -n node2 --mount diskgroup on other nodes
or
ASMCMD>lsdg
mount data

$GRID_HOME/bin/crsctl query css votedisk -- check current location

$GRID_HOME/bin/crsctl replace votedisk +OCR_VOTE -- moves to OCR_VOTE

$GRID_HOME/bin/crsctl query css votedisk

Else , if we have the surviving copyy (i.e only 2 out of 3 are corrupted ) , create the new diskgroup and drop the old diskgroup

CREATE DISKGROUP OCR_VOTE NORMAL REDUNDANCY
FAILGROUP controller01 DISK '/dev/asm-ocr_vote1'
FAILGROUP controller02 DISK '/dev/asm-ocr_vote2'
FAILGROUP controller03 DISK '/dev/asm-ocr_vote3'
ATTRIBUTE
'au_size'='1M',
'compatible.asm' = '12.1';

$GRID_HOME/bin/crsctl query css votedisk -- check current location

$GRID_HOME/bin/crsctl replace votedisk +OCR_VOTE -- moves to OCR_VOTE

$GRID_HOME/bin/crsctl query css votedisk

SQL>drop diskgroup test force including contents;

3. Stop and start (root user) on all nodes

# crsctl stop crs -f

# crsctl start crs -- Run on other nodes as well

#crsctl start cluster -all

# $GRID_HOME/bin/crsctl status resource –t

Failure of OCR

Corruption : Restore OCR
===============================

1. Check if corrupted

# ocrcheck

2. Stop and start in exclusive mode(root user)

# crsctl stop crs -f

# crsctl start crs -excl -nocrs

3. Check OCR location

$ cat /etc/oracle/ocr.loc

$GRID_HOME /log/<hostname>/client/ocrcheck_<pid>.log

4. Check latest OCR backup

$GRID_HOME\bin\ocrconfig –showbackup

5. Restore as a root user

# ocrconfig -restore $GRID_HOME/cdata/bhurac/backup00.ocr

6. Stop and start (root user)

# crsctl stop crs -f

# crsctl start crs

# $GRID_HOME/bin/crsctl status resource –t

7. Check for corruption

# ocrcheck

No Corruption : Create new OCR_VOTE diskgroup and move OCR from +CRS_TMP to OCR_VOTE

====================================================================

/u01/app/12.1.0/grid/bin/ocrcheck

It shows Device/File Name : +CRS_TMP

/u01/app/12.1.0/grid/bin/ocrconfig -add +OCR_VOTE

# /u01/app/12.1.0/grid/bin/ocrcheck

It shows Device/File Name : +CRS_TMP and

and shows Device/File Name : +OCR_VOTE

/u01/app/12.1.0/grid/bin/ocrconfig -delete +CRS_TMP

/u01/app/12.1.0/grid/bin/ocrcheck

It will now show only Device/File Name : +OCR_VOTE

Check on other modes as well with ocrcheck command.

Failure of VIP : VIP Failover

Failure of disk

======================== case I : ASM detects read/write error ==================================

ASM detects READ_ERRS/WRITE_ERRS and updates these columns in v$asm_disk for the ASM disk

1. Check for failed disk

select path,name,mount_status,header_status from v$asm_disk where WRITE_ERRS > 0

select path,name,mount_status,header_status from v$asm_disk where READ_ERRS > 0;

Note : header_status column may still be shown as "MEMBER"

2. Drop the disk

alter diskgroup #name# drop disk #disk name#;

state,power,group_number,EST_MINUTES from v$asm_operation;

Run until no rows returned

Note : Physically remove the disk only after the header_status for the failed disk becomes "FORMER"

3. Add new disk

SELECT NVL(a.name, '[CANDIDATE]') disk_group_name , b.path disk_file_path, b.name disk_file_name , b.failgroup disk_file_fail_group
FROM v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
ORDER BY a.name;

ALTER DISKGROUP testdb_data1 ADD FAILGROUP controller1 DISK '/dev/raw/raw5'
FAILGROUP controller2 DISK '/dev/raw/raw6' REBALANCE POWER 11;

OR

select distinct header_status from v$asm_disk where name = '/dev/sdk1'; (New disk must show as CANDIDATE)

select state,power,group_number,EST_MINUTES from v$asm_operation;

Run until no rows returned

================================ case II : ASm drop disk =============================================

When ASM drop the disk on its own, in the ASM alert log it will give alerts as below

ORA-27061: waiting for async I/Os failed
WARNING: IO Failed. subsys:System dg:0, diskname:/dev/sds1

ASM will automatically rebalnce the data which can be checked using

select state,power,group_number,EST_MINUTES from v$asm_operation;

Failure of Node : Node Eviction

Failure of Instance : Instance Recovery