On the day before last day, I helped my friend to handle an Oracle block corruption issue of her SAP system.
From the alert log file, I found many error messages of corrupted blocks, and I tried to scan some files using dbv command, but I was surprised that some corrupted blocked appearing in the alert log file were not found in the output of dbv command. To get more accurate result, I checked the files using rman 'CHECK LOGICAL VALIDATE' command, and got another different results. This was really interesting, and in fact I should find the root cause of it before I went to next step.
But at that time I did a wrong decision that I tried to recover all the corrupted blocks using rman blockrecover, and all the backups were restored from the tape library so it lasted for several hours and the restoring of archive logs occupied more than 2/3 of the total time.
During the restoring, I learned the volumes were mirrored on two different SAN storage system, and one hard disk was replaced on the new system. Just after a short time, the block corruption issue happened and finally the instance crashed. The time was so close that I had to think maybe something wrong on the new storage system. But from the management tool of the storage system, there was no related issue and the status of the volumes were normal.
I decided to do a verify of the whole database, and after some time I got more than 290K records in the view V$DATABASE_BLOCK_CORRUPTION, and a big portion of them were 'ALL ZERO' type, so there must be something wrong with the storage system.
I knew another thing soon that the mirror was not built on the storage system level, but on OS level -- mklvcopy, so I thought if I could let the system just read the old storage system, I could verify if this issue was caused by the new storage system. I searched the knowledge center of the AIX document, and found below link:
Scheduling policies for mirrored writes to disk
If I set the policy to 'Parallel write with sequential read-scheduling policy', the purpose would be archived. Such change would need to umount the filesystems so I stopped the instance, changed them then verified the corrupted files again, and now all the corrupted blocks disappeared!
So it was lucky that I did not restore the whole database (more than 1TB), although I wasted several hours but I did not waste more.
From this issue I learned two things:
- If the DBA has SA background, sometimes the platform experience do help you to resolve issue quickly, and also it will widen your mind.
- When creating the backup plan, especially the target is the tape library, need to be careful to make sure the restore is efficient.
I will write a new post about the second thing soon.