Exadata disk confinement tests

We recently had an issue reported where the following messages were reported in the RDBMS alert log of an Exadata database –


Errors in file /u01/app/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_pr0c_00001.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.10.11/DATA01_CD_05_exa01cel01 at offset 20905918464 for data length 294912
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:17 AU:4984 offset:1507328 size:294912
path:o/192.168.10.11/DATA01_CD_05_exa01cel01
incarnation:0xe9691a44 asynchronous result:'I/O error'
subsys:OSS iop:0x7fd92e712100 bufp:0x7fd92d68b000 osderr:0xc9 osderr1:0x0
WARNING: failed to read mirror side 1 of virtual extent 321 logical extent 0 of file 1386 in group [1.2631526429] from disk DATA01_CD_05_EXA01CEL01 allocation unit 4984 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 321 logical extent 1 of file 1386 in group [1.2631526429] from disk DATA01_CD_03_EXA01CEL07 allocation unit 5360
Tue May 28 18:07:14 2013
NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for reads
NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for writes
NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for reads
NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for writes

If we connect to the cell in question and run “cellcli -e list alerthistory” we can also see the following –


13_1 2013-05-28T18:07:13+01:00 warning "Hard disk entered confinement offline status. The LUN 0_5 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status : WARNING - CONFINEDOFFLINE Manufacturer : SEAGATE Model Number : ST360057SSUN600G Size : 600G Serial Number : 1018E0MDQC Firmware : 0B25 Slot Number : 5 Cell Disk : CD_05_exa01cel01 Grid Disk : RECO01_CD_05_exa01cel01, DATA01_CD_05_exa01cel01, SYSTEMDG_CD_05_exa01cel01 Reason for confinement : threshold for service time exceeded"
13_2 2013-05-28T18:10:47+01:00 clear "Hard disk status changed to normal. Status : NORMAL Manufacturer : SEAGATE Model Number : ST360057SSUN600G Size : 600GB Serial Number : 1018E0MDQC Firmware : 0B25 Slot Number : 5 Cell Disk : CD_05_exa01cel01 Grid Disk : RECO01_CD_05_exa01cel01, DATA01_CD_05_exa01cel01, SYSTEMDG_CD_05_exa01cel01"

So we can see from the cell alert that the disk has been temporarily ring-fenced for testing and not because of media errors but because it exceeded a performance threshold. As you can see from the alert, it passed the test and was subsequently put back into a normal state. No need for disk replacement, and no further issue. In a way it is a shame this error gets passed up as far as the RDBMS alert log – it tends to show up on monitoring systems that way and isn’t a real issue for anyone.

ASM Support Guy has written an article on this new feature, introduced in Storage Server 11.2.3.2 –

http://asmsupportguy.blogspot.co.uk/2013/05/identification-of-under-performing.html

Advertisements
Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: