Failures in RAC



Failure of Voting disk

Move voting disk  to OCR_VOTE
=============================



CORRUPTION : Create the new diskgroup since the current diskgroup is corrupted


1. Stop on all nodes and start in exclusive mode(root user)


# crsctl stop crs -f

# crsctl start crs -excl -nocrs



2. Start using pfile

SQL>startup pfile='/u01/app/oracle/init+ASM1.ora';


If all the voting disks are corrupted, create the new diskgroup

CREATE DISKGROUP OCR_VOTE NORMAL REDUNDANCY
     FAILGROUP controller01 DISK '/dev/asm-ocr_vote1'
     FAILGROUP controller02 DISK '/dev/asm-ocr_vote2'
     FAILGROUP controller03 DISK '/dev/asm-ocr_vote3'
     ATTRIBUTE
     'au_size'='1M',
     'compatible.asm' = '12.1';

SQL>  ! srvctl start diskgroup -g ocr_vote -n node2         --mount diskgroup on other nodes
or
ASMCMD>lsdg
        mount data

$GRID_HOME/bin/crsctl query css votedisk -- check current location

$GRID_HOME/bin/crsctl replace votedisk +OCR_VOTE -- moves to OCR_VOTE

$GRID_HOME/bin/crsctl query css votedisk



Else , if we have the surviving copyy (i.e only 2 out of 3 are corrupted ) , create the new diskgroup and drop the old diskgroup

CREATE DISKGROUP OCR_VOTE NORMAL REDUNDANCY
     FAILGROUP controller01 DISK '/dev/asm-ocr_vote1'
     FAILGROUP controller02 DISK '/dev/asm-ocr_vote2'
     FAILGROUP controller03 DISK '/dev/asm-ocr_vote3'
     ATTRIBUTE
     'au_size'='1M',
     'compatible.asm' = '12.1';


$GRID_HOME/bin/crsctl query css votedisk -- check current location

$GRID_HOME/bin/crsctl replace votedisk +OCR_VOTE -- moves to OCR_VOTE

$GRID_HOME/bin/crsctl query css votedisk

SQL>drop diskgroup test force including contents;





3. Stop and start (root user) on all nodes

# crsctl stop crs -f

# crsctl start crs   -- Run on other nodes as well

#crsctl start cluster -all



# $GRID_HOME/bin/crsctl status resource –t


Failure of OCR

Corruption : Restore OCR 
===============================

1. Check if corrupted


# ocrcheck


2. Stop and start in exclusive mode(root user)


# crsctl stop crs -f

# crsctl start crs -excl -nocrs


3. Check OCR location

$ cat /etc/oracle/ocr.loc

$GRID_HOME /log/<hostname>/client/ocrcheck_<pid>.log



4. Check latest OCR backup

$GRID_HOME\bin\ocrconfig –showbackup


5. Restore as a root user

# ocrconfig -restore $GRID_HOME/cdata/bhurac/backup00.ocr


6. Stop and start (root user)

# crsctl stop crs -f

# crsctl start crs

# $GRID_HOME/bin/crsctl status resource –t

7. Check for corruption

# ocrcheck



No Corruption : Create new OCR_VOTE diskgroup and move OCR from +CRS_TMP to OCR_VOTE
====================================================================


/u01/app/12.1.0/grid/bin/ocrcheck

It shows Device/File Name         :   +CRS_TMP



/u01/app/12.1.0/grid/bin/ocrconfig -add +OCR_VOTE 


# /u01/app/12.1.0/grid/bin/ocrcheck


It  shows Device/File Name         :   +CRS_TMP  and
and shows Device/File Name         :   +OCR_VOTE


/u01/app/12.1.0/grid/bin/ocrconfig -delete +CRS_TMP


/u01/app/12.1.0/grid/bin/ocrcheck


It will now show only Device/File Name         :   +OCR_VOTE


Check on other modes as well with ocrcheck command.



Failure of VIP : VIP Failover

Failure of disk



======================== case I : ASM detects read/write error ==================================

ASM detects READ_ERRS/WRITE_ERRS and updates these columns in v$asm_disk for the ASM disk


1. Check for failed disk

select path,name,mount_status,header_status  from v$asm_disk  where WRITE_ERRS > 0

select path,name,mount_status,header_status  from v$asm_disk  where READ_ERRS > 0;


Note : header_status column may still be shown as "MEMBER"


2. Drop the disk

alter diskgroup #name# drop disk #disk name#;

state,power,group_number,EST_MINUTES  from v$asm_operation;

Run until no rows returned


Note : Physically remove the disk only after the header_status for the failed disk becomes "FORMER"







3. Add new disk

SELECT   NVL(a.name, '[CANDIDATE]') disk_group_name  , b.path  disk_file_path, b.name disk_file_name , b.failgroup disk_file_fail_group
FROM  v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
ORDER BY a.name;


ALTER DISKGROUP testdb_data1 ADD   FAILGROUP controller1 DISK '/dev/raw/raw5'
                                   FAILGROUP controller2 DISK '/dev/raw/raw6' REBALANCE POWER 11;

OR


select distinct header_status from v$asm_disk where name = '/dev/sdk1'; (New disk must show as CANDIDATE)



select state,power,group_number,EST_MINUTES from v$asm_operation;

Run until no rows returned



================================ case II : ASm drop disk =============================================

When ASM drop the disk on its own, in the ASM alert log it will give alerts as below

ORA-27061: waiting for async I/Os failed
WARNING: IO Failed. subsys:System dg:0, diskname:/dev/sds1


ASM will automatically rebalnce the data which can be checked using

select state,power,group_number,EST_MINUTES from v$asm_operation;


Failure of Node  : Node Eviction

Failure of Instance : Instance Recovery