edba: RAC Performance Tuning

Run the awrgrpt.sql report

check for following metrics

System Statistics
System Statistics - Per Second
System Statistics - Per Transaction

SysStat and Global Messaging - RAC
SysStat and Global Messaging (per Sec)- RAC
SysStat and Global Messaging (per Tx)- RAC

Refer the Interconnect Stats segment in the Global AWR report.
and check for Fragmented/dropped network packets (caused, database gc cr block lost wait events) , Packets timeouts due to buffer overflow and Network latency in the following metrics

Cluster Interconnect
ping Statistics
Interconnect Client Statistics
Interconnect Device Statistics
Dynamic Remastering Statistics

gv$instance_cache_transfer : To identify any potential inter-instance performance problems
gv$cache_transfer : Monitor blocks transferred by object
gv$class_cache_transfer : Monitor block transfer by class
gv$file_cache_transfer : Monitor the blocks transferred per file
gv$temp_cache_transfer : transfer of temporary tablespace blocks

global block transfer wait event : gc cr block 2-way, gc cr block 3-way,
gc current block 2-way and gc current block 3-way.

Sequence ORDER and CACHE: use of ORDER and CACHE attribute would provide better performance. especially AUDSESS$ sequence in SYS, used in Auditing
Instance affinity : try to keep to keep the work on to a particular instance.
Hash partition : Hash partitioning the table and create hash partitioned indexes locally for an excessive/concurrent insert intensive tables
DB_FILE_MULTIBLOCK_READ_COUNT parameter also influence average GCS CR block receive time
_LM_DLMD_PROCS parameter with consultation with Oraxcle support.

Keep undo and redo on fastest disks
Evaluate and reduce the number of full table scans on the database.

gc current/cr block lost : Lost blocks due to Interconnect or CPU
gc curent/cr block busy : Consistent read request delayed,most likely an I/O bottleneck
gc current/cr block congested : Long run queues and/or paging due to memory deficiency.

Use ASM preferred reads feature : Under this mode all read database IOs are local, only the dirty writes have to cross the network.

One major issue in RAC is parallel query that goes across many nodes : Define instance groups

Check cluster interconnects for transfer time.

select b1.inst_id, b2.value "RECEIVED",
b1.value "RECEIVE TIME",
((b1.value / b2.value) * 10) "AVG RECEIVE TIME (ms)"
from gv$sysstat b1, gv$sysstat b2
where b1.name = 'global cache cr block receive time' and
b2.name = 'global cache cr blocks received' and b1.inst_id = b2.inst_id

LONG_SESSIONS.SQL

Col Sid Format 99999
Col Serial# Format 999999
Col Machine Format A15 Truncate
Col Event Format A30 Truncate
Col Inst Format 9999

Select Inst_Id Inst,Sid,Serial#,Username,Machine,Event,
Logon_Time,Sql_Id,Prev_Sql_Id
From Gv$Session
where type != 'BACKGROUND' and event not like 'SQL*Net%'
and event not like 'Streams AQ: waiting%'
And Nvl(24 * (Sysdate - Logon_Time),0) > 8
Order By Username;

ALL_ACTIVE.SQL

SeleCT DISTINCT osuser, gv$session.inst_id, Sid, username,
Substr(program,1,19) PROG , sql_text
From gV$Session, gV$Sql
Where status = 'ACTIVE'
and gv$session.inst_id = gv$sql.inst_id
And username is not null
And gv$session.sql_hash_value = hash_value
and gv$session.sql_address = gv$sql.address
and sql_text not like '%DISTINCT osuser%'
order by 2;

ASH_BY_TIME.SQL

With P1 As (Select /*+Parallel(A 6) */
Distinct A.*
From Dba_Hist_Active_Sess_History A
Where Sample_Time Like '22-APR-18%4.%AM'
) Select Instance_Number, Sql_Id, Count(*)
From P1
Group By Instance_Number, Sql_Id Having Count(*) > 20
Order By Count(*)

HIGH RESOURCE.SQL

Select Inst_id, Sql_Id, First_Load_Time, Round(Elapsed_Time/1000000)
Secs, Rows_Processed, Executions, Buffer_Gets, Disk_Reads, Sql_Text
From Gv$Sql
Where Upper(Sql_Text) Like '%SELECT TBD%'
And Executions > 0
Order By 1

select * from v$instance_cache_transfer v$instance_cache_transfer
where class = 'data block' where class = 'data block'
and instance = 1; and instance = 1;

Check increae in CURRENT_CONGESTED

SELECT
INST_ID,
NAME,
FILE#,
CLASS#,
MAX(XNC)
FROM GV$CACHE_TRANSFER
GROUP BY INST_ID,
NAME,
FILE#,
CLASS#

SELECT
VDF.NAME,
VFCT.INST_ID,
VFCT.X_2_NULL,
VFCT.X_2_NULL_FORCED_WRITE,
VFCT.X_2_NULL_FORCED_STALE,
VFCT.X_2_S,
VFCT.S_2_NULL
VFCT.CR_TRANSFERS,
VFCT.CUR_TRANSFERS
FROM GV$FILE_CACHE_TRANSFER VFCT,
V$DATAFILE VDF
WHERE VFCT.FILE_NUMBER = VDF.FILE#
AND VFCT.CUR_TRANSFERS > 0;

diagcollection :

/u01/app/11.2.0/grid/bin/diagcollection.pl --collect
./diagcollection.sh -help --collect
[--crs] For collecting crs diag information
[--adr] For collecting diag information for ADR; specify ADR location
[--chmos] For collecting Cluster Health Monitor (OS) data
[--all] Default.For collecting all diag information. <<<>>>
[--core] Unix only. Package core files with CRS data
[--afterdate] Unix only. Collects archives from the specified date.
[--aftertime] Supported with -adr option. Collects archives after the specified
[--beforetime] Supported with -adr option. Collects archives before the specified
[--crshome] Argument that specifies the CRS Home location
[--incidenttime] Collects Cluster Health Monitor (OS) data from the specified
[--incidentduration] Collects Cluster Health Monitor (OS) data for the duration

/u01/app/11.2.0/grid/bin/diagcollection.pl --clean

diagcollection.pl --collect --oh $ORACLE_HOME
./diagcollection.sh --collect --crs $GRID_HOME

diagcollection.pl --coreanalysis

TFA :

# /etc/init.d/init.tfa start
# /etc/init.d/init.tfa stop
# /etc/init.d/init.tfa shutdown

# /u01/app/grid/tfa/bin/tfactl print config

# /u01/app/grid/tfa/bin/tfactl print status

# /u01/app/grid/tfa/bin/tfactl set tracelevel=4 -c

# /u01/app/grid/tfa/bin/tfactl diagcollect -all -from "Oct/18/2013 00:00:00" -to "Oct/18/2013 06:00:00"

# $GRID_HOME/bin/tfactl diagcollect -all -for "Mar/22/2014"

# /u01/app/grid/tfa/bin/tfactl diagcollect -all -since 1h

Cluster Health Monitor :

CHM should be able to collect data even under CPU starvation

[grid@grac41 ~] $ $GRID_HOME/bin/crsctl status res ora.crf -init

[grid@grac41 ~]$ oclumon manage -get MASTER

$GRID_HOME/bin/crsctl start res ora.crf -init
$GRID_HOME/bin/crsctl stop res ora.crf -init

# ps -ef | egrep "sysmond|loggerd"

Using oclumon to detect potential root causes for node evictions

$ oclumon dumpnodeview -n grac41 -last "00:15:00"

% ~/print_sys.sh grac41_CHMOS pcpus | egrep '#pcpus|cpuq:|03-22-14 09.55'

% ~/print_disk.sh grac41_CHMOS dm-1 pcpus dm-1 | egrep 'ior:|Device|03-22-14 09.55'

% ~/print_top.sh grac41_CHMOS topcpu: | egrep 'topcpu|03-22-14 09.55'

% ~/print_proc.sh grac41_CHMOS mp_stress | egrep 'name:|03-22-14 09.55'

% ~/print_proc.sh grac41_CHMOS ora_lms0 | egrep 'name:|03-22-14 09.55'

http://www.hhutzler.de/blog/raccheck/

http://www.datadisk.co.uk/html_docs/rac/performance.htm