Last night my friend called me to check the network service issue after she added two SCAN IPs to one RAC system.
She added these two IPs to resolve load balance issue as the customer found the connections were not distributed on two nodes evenly. I suggested her to check the service configuration as I thought even there was just one SCAN IPs the connections should also be distributed according to the service configuration.
She still added two SCAN IPs and found the issue was not fixed so she tried to restart the network service when the cluster service was running, but for some reason the network service could not be started. I asked her to stop the Clusterware service first then restarted the network service. She did while still failed to start the network service.
I got confused as the system works well for years so I told her to restart the node and hoped to recover the error. Of course it still failed even we could not access it through the public IP. Here I found I made a mistake that I should not reboot the node before I fixed the network issue. Luckily we still had a workable node so through it we logged in the failed node using private IP. This time I reviewed the network config file carefully and found one line 'DEVICE=eno52' was missed in the failed public network interface and I added it then restarted the network service successfully. It was weird as the config file was not changed from year 2019 so I guessed something changed on the Linux system side and now this line was required.
Next we found the Clusterware failed to start automatically. I tried to recover the system first so I started the ASM instance manually and most clusterware services were booted except the crsd service. I checked the crsd log and found the crsd service failed to access the ASM instance with wrong password error and the path of password file was pointed to one deleted ASM disk group.
We still had a workable node so we confirmed the wrong path on it.
[grid@node01:/home/grid]$srvctl config asm ASM home: <CRS home> Password file: +griddata/orapwASM Backup of Password file: ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM
The bad new was that we even could not find a backup of the ASM password file.
Several months ago she migrated the OCR and vote disks from the GRIDDATA to OCRNEW and dropped the GRIDDATA.
From this note, we found a scare information:
If manual backup of the ASM passwordfile is not available then, please deconfigure and re-configure the clusterware by following the document "Doc ID 1377349.1"
And below one provides much more details in such situation.
How to recreate shared ASM password file in 12c GI cluster (Doc ID 1929673.1)
We recreated the lost ASM password file following below commands:
[grid@node02:/home/grid]$$ORACLE_HOME/bin/ocrdump /tmp/ocr.dmp PROT-310: Not all keys were dumped due to permissions. [grid@node02:/home/grid]$vi /tmp/ocr.dmp ##Search SYSTEM.ASM.CREDENTIALS.USERS.CRSUSER__ASM_001 [SYSTEM.ASM.CREDENTIALS.USERS.CRSUSER__ASM_001] ORATEXT : cc081ab75e72ff20bf7628cfedad880d:grid SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_NONE, USER_NAME : grid, GROUP_NAME : oinstall}
Wrote down the string 'cc081ab75e72ff20bf7628cfedad880d' and got the needed username and password using it.
[grid@node02:/home/grid]$crsctl get credmaint -path /ASM/Self/cc081ab75e72ff20bf7628cfedad880d -credtype userpass -id 0 -attr user -local CRSUSER__ASM_001 [grid@node02:/home/grid]$crsctl get credmaint -path /ASM/Self/cc081ab75e72ff20bf7628cfedad880d -credtype userpass -id 0 -attr passwd -local nMvLqKfQr3P0FupelnGluQJEbDF02
The string 'nMvLqKfQr3P0FupelnGluQJEbDF02' was the password and we had to reuse it.
[grid@node01:/home/grid]$asmcmd ASMCMD> pwget --asm +griddata/orapwASM ASMCMD> pwcreate --asm +OCRNEW/orapwASM 'Abcd#1234' OPW-00010: Could not create the password file. This resource has a Password File. ASMCMD-9454: could not create new password file ASMCMD> pwcreate --asm +OCRNEW/orapwASM 'Abcd#1234' -f ASMCMD> pwget --asm +OCRNEW/orapwasm ASMCMD> lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE FALSE ASMCMD> orapwusr --grant sysasm SYS ASMCMD> orapwusr --add ASMSNMP Enter password: ********* ASMCMD> orapwusr --grant sysdba ASMSNMP ASMCMD> lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE ASMSNMP TRUE FALSE FALSE ASMCMD> orapwusr --add CRSUSER__ASM_001 Enter password: ***************************** ####Must use the password got in the previous result ASMCMD> lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE ASMSNMP TRUE FALSE FALSE CRSUSER__ASM_001 FALSE FALSE FALSE ASMCMD> orapwusr --grant sysdba CRSUSER__ASM_001 ASMCMD> orapwusr --grant sysasm CRSUSER__ASM_001 ASMCMD> lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE ASMSNMP TRUE FALSE FALSE CRSUSER__ASM_001 TRUE FALSE TRUE ASMCMD> exit [grid@node01:/home/grid]$srvctl config asm ASM home: <CRS home> Password file: +OCRNEW/orapwasm Backup of Password file: ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM
Then we restarted the clusterware and this time it worked well.
So to avoid such things, we'd better make a manual backup of the ASM password file.