I never thought mounting a shared filesystem maybe cause server reboot!
One node in a three-node RAC database was rebooted without any reason, and I was contacted to find the possible reason.
I checked the servers and found it was hard to confirm the root cause as I could not get the system information before the reboot, and had to get all the information from the system and database log and trace files.
From the /var/log/messages of the affected server, got below errors:
Jan 21 22:18:51 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49711 Jan 21 22:19:40 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49715 Jan 21 22:20:29 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49719 Jan 21 22:21:18 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49723 Jan 21 22:22:07 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49727 Jan 21 22:22:56 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49731 Jan 21 22:23:45 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49735 Jan 21 22:24:34 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49739 Jan 21 22:25:23 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49743 Jan 21 22:26:12 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49747 Jan 21 22:27:01 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49751 Jan 21 22:27:50 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49755 Jan 21 22:28:39 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49759 Jan 21 22:29:28 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49763 Jan 21 22:29:58 prod3 kernel: SysRq : Resetting Jan 21 22:32:12 prod3 syslogd 1.4.1: restart.
So the server was restarted at 22:29:58, and before these errors I also found other errors:
Jan 21 21:48:49 prod3 kernel: CIFS VFS: Write2 ret -28, wrote 0 Jan 21 21:49:20 prod3 last message repeated 21269 times Jan 21 21:50:21 prod3 last message repeated 43405 times Jan 21 21:51:22 prod3 last message repeated 42692 times Jan 21 21:52:23 prod3 last message repeated 43849 times Jan 21 21:53:24 prod3 last message repeated 44724 times Jan 21 21:54:17 prod3 last message repeated 88025 times Jan 21 21:55:02 prod3 kernel: CIFS VFS: No response to cmd 47 mid 49592 Jan 21 21:55:02 prod3 kernel: CIFS VFS: Write2 ret -11, wrote 0 Jan 21 21:55:02 prod3 kernel: CIFS VFS: No response to cmd 47 mid 49591 Jan 21 21:55:02 prod3 kernel: CIFS VFS: Write2 ret -11, wrote 0 Jan 21 21:55:02 prod3 kernel: CIFS VFS: No response to cmd 47 mid 49593
The server mounted a large share folder from a Windows server for backup purpose.
From other two nodes, got the related information:
prod1: Jan 21 22:30:30 prod1 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down. Jan 21 22:30:30 prod1 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516545000.229633 now 1516545030.229094 dr 1516545000.229617 adv 1516 545000.229647:1516545000.229648 func (49156591:504) 1516544986.229775:1516544986.229787) Jan 21 22:30:30 prod1 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -112 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -112 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -112 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -112 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res P000000000000000000000000000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2 Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2 Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2 Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2 Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:31:00 prod1 kernel: (o2net,7350,23):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res S000000000000000000000200000000, error -107 send AST to node 2 Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2 Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res S000000000000000000000200000000, error -107 send AST to node 2 Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res S000000000000000000000200000000, error -107 send AST to node 2 Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2 Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_send_proxy_ast_msg:486 ERROR: 65D58ADDE3984ED79283A1DF83AEABD2: res S000000000000000000000200000000, error -107 send AST to node 2 Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_flush_asts:609 ERROR: status = -107 Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2 Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2 Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2 Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2 Jan 21 22:31:06 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2 Jan 21 22:31:07 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2 Jan 21 22:32:39 prod1 kernel: o2net: accepted connection from node prod3 (num 2) at 172.20.20.62:7777 Jan 21 22:32:43 prod1 kernel: o2dlm: Node 2 joins domain 3ADDC1A2B0634122BC64ED298F64E0CD Jan 21 22:32:43 prod1 kernel: o2dlm: Nodes in domain 3ADDC1A2B0634122BC64ED298F64E0CD: 0 1 2 Jan 21 22:32:48 prod1 kernel: o2dlm: Node 2 joins domain B9CD211F85F6429AAC04FB815C6EDC22 Jan 21 22:32:48 prod1 kernel: o2dlm: Nodes in domain B9CD211F85F6429AAC04FB815C6EDC22: 0 1 2 Jan 21 22:32:52 prod1 kernel: o2dlm: Node 2 joins domain 755B051EE3AD402784A78865F351F6F4 Jan 21 22:32:52 prod1 kernel: o2dlm: Nodes in domain 755B051EE3AD402784A78865F351F6F4: 0 1 2 Jan 21 22:32:56 prod1 kernel: o2dlm: Node 2 joins domain 65D58ADDE3984ED79283A1DF83AEABD2 Jan 21 22:32:56 prod1 kernel: o2dlm: Nodes in domain 65D58ADDE3984ED79283A1DF83AEABD2: 0 1 2 prod2: Jan 21 22:30:29 prod2 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down. Jan 21 22:30:29 prod2 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516544999.275659 now 1516545029.274388 dr 1516544999.275645 adv 1516 544999.275670:1516544999.275671 func (70a70579:505) 1516544815.278433:1516544815.278436) Jan 21 22:30:29 prod2 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777 Jan 21 22:30:59 prod2 kernel: (o2net,7207,11):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2 Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2 Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2 Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2 Jan 21 22:31:03 prod2 kernel: (ocfs2rec,1910,0):ocfs2_replay_journal:1182 Recovering node 2 from slot 1 on device (253,2)
And I found possible reason from below two notes:
CIFS Mount Soft Lockup While Transferring Large Files (Doc ID 1621616.1) Hung CIFS processes in D state and stuck in "cifs_reopen_file" Also the logs show "kernel: CIFS VFS: No writable handles for inode" before server getting rebooted (Doc ID 2318108.1)Yes, sometimes the CIFS sharing even could cause server reboot.
From these notes, below methods could be taken:
echo 0 > /proc/fs/cifs/OplockEnabled enable 'forcedirectio' option of the mount command
From the alert log of the database, I did not find useful information:
Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 21-JAN-2018 22:22:45 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out Sun Jan 21 22:34:26 2018 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0
In the trace files of the cluster, I could only got the system worked abnormally at least from 22:24, and was evicted finally.
[/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5818:Aborted command 'check' for resource 'ora.prod.db'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent _oracle/oraagent_oracle.log. 2018-01-21 22:24:17.869: [/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5832:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' was unable to process commands. Details at (:CRSAGF00128:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2018-01-21 22:24:43.241: [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.ons'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_gri d/oraagent_grid.log. 2018-01-21 22:24:43.400: [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log" 2018-01-21 22:24:43.420: [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/or aagent_grid/oraagent_grid.log. 2018-01-21 22:24:43.515: [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CL SN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log" 2018-01-21 22:27:17.948: [crsd(8528)]CRS-5825:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' is unresponsive and will be restarted. Details at (:CRSAGF00131:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/crsd/crsd.log. 2018-01-21 22:28:59.355: [cssd(8447)]CRS-1662:Member kill requested by node prod1 for member number 2, group DBprod 2018-01-21 22:29:00.073: [cssd(8447)]CRS-1662:Member kill requested by node prod2 for member number 2, group DBprod 2018-01-21 22:29:30.225: [cssd(8447)]CRS-1608:This node was evicted by node 1, prod1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log. 2018-01-21 22:29:30.235: [cssd(8447)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log 2018-01-21 22:29:30.235: [cssd(8447)]CRS-1652:Starting clean up of CRSD resources. 2018-01-21 22:29:30.846: [cssd(8447)]CRS-1604:CSSD voting file is offline: /oraocr/storage/vdsk; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log. 2018-01-21 22:29:34.079: [cssd(8447)]CRS-1605:CSSD voting file is online: /oraocr/storage/vdsk; details in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log. 2018-01-21 22:29:45.238: [cssd(8447)]CRS-1653:The clean up of the CRSD resources failed. 2018-01-21 22:29:48.309: [ohasd(8031)]CRS-2765:Resource 'ora.ctssd' has failed on server 'prod3'. 2018-01-21 22:29:48.554: [ohasd(8031)]CRS-2765:Resource 'ora.evmd' has failed on server 'prod3'. 2018-01-21 22:29:50.093: [ctssd(25158)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log. 2018-01-21 22:29:54.187: [ohasd(8031)]CRS-2878:Failed to restart resource 'ora.ctssd' 2018-01-21 22:29:54.202: [ohasd(8031)]CRS-2769:Unable to failover resource 'ora.ctssd'. 2018-01-21 22:29:55.138: [ctssd(25172)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log. 2018-01-21 22:32:58.157: [ohasd(8030)]CRS-2112:The OLR service started on node prod3. 2018-01-21 22:32:58.501: [ohasd(8030)]CRS-1301:Oracle High Availability Service started on node prod3.
I guess the system maybe lost response from 22:24 while I could not confirm this point.