CIFS VFS: No response to cmd

28/01/2018 11:29:39NeilZhang

NeilZhang

管理员

140
文章

106.8千
浏览

Linux Oracle评论820字数 2269阅读7分33秒阅读模式

I never thought mounting a shared filesystem maybe cause server reboot!

One node in a three-node RAC database was rebooted without any reason, and I was contacted to find the possible reason.

I checked the servers and found it was hard to confirm the root cause as I could not get the system information before the reboot, and had to get all the information from the system and database log and trace files.

From the /var/log/messages of the affected server, got below errors:

Jan 21 22:18:51 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49711
Jan 21 22:19:40 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49715
Jan 21 22:20:29 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49719
Jan 21 22:21:18 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49723
Jan 21 22:22:07 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49727
Jan 21 22:22:56 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49731
Jan 21 22:23:45 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49735
Jan 21 22:24:34 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49739
Jan 21 22:25:23 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49743
Jan 21 22:26:12 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49747
Jan 21 22:27:01 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49751
Jan 21 22:27:50 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49755
Jan 21 22:28:39 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49759
Jan 21 22:29:28 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49763
Jan 21 22:29:58 prod3 kernel: SysRq : Resetting
Jan 21 22:32:12 prod3 syslogd 1.4.1: restart.

So the server was restarted at 22:29:58, and before these errors I also found other errors:

Jan 21 21:48:49 prod3 kernel:  CIFS VFS: Write2 ret -28, wrote 0
Jan 21 21:49:20 prod3 last message repeated 21269 times
Jan 21 21:50:21 prod3 last message repeated 43405 times
Jan 21 21:51:22 prod3 last message repeated 42692 times
Jan 21 21:52:23 prod3 last message repeated 43849 times
Jan 21 21:53:24 prod3 last message repeated 44724 times
Jan 21 21:54:17 prod3 last message repeated 88025 times
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: No response to cmd 47 mid 49592
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: No response to cmd 47 mid 49591
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: No response to cmd 47 mid 49593

The server mounted a large share folder from a Windows server for backup purpose.

From other two nodes, got the related information:

prod1:
Jan 21 22:30:30 prod1 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down.
Jan 21 22:30:30 prod1 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516545000.229633 now 1516545030.229094 dr 1516545000.229617 adv 1516
545000.229647:1516545000.229648 func (49156591:504) 1516544986.229775:1516544986.229787)
Jan 21 22:30:30 prod1 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -112 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -112
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -112 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -112
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res P000000000000000000000000000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:00 prod1 kernel: (o2net,7350,23):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors.
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_send_proxy_ast_msg:486 ERROR: 65D58ADDE3984ED79283A1DF83AEABD2: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
Jan 21 22:31:06 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
Jan 21 22:31:07 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
Jan 21 22:32:39 prod1 kernel: o2net: accepted connection from node prod3 (num 2) at 172.20.20.62:7777
Jan 21 22:32:43 prod1 kernel: o2dlm: Node 2 joins domain 3ADDC1A2B0634122BC64ED298F64E0CD
Jan 21 22:32:43 prod1 kernel: o2dlm: Nodes in domain 3ADDC1A2B0634122BC64ED298F64E0CD: 0 1 2
Jan 21 22:32:48 prod1 kernel: o2dlm: Node 2 joins domain B9CD211F85F6429AAC04FB815C6EDC22
Jan 21 22:32:48 prod1 kernel: o2dlm: Nodes in domain B9CD211F85F6429AAC04FB815C6EDC22: 0 1 2
Jan 21 22:32:52 prod1 kernel: o2dlm: Node 2 joins domain 755B051EE3AD402784A78865F351F6F4
Jan 21 22:32:52 prod1 kernel: o2dlm: Nodes in domain 755B051EE3AD402784A78865F351F6F4: 0 1 2
Jan 21 22:32:56 prod1 kernel: o2dlm: Node 2 joins domain 65D58ADDE3984ED79283A1DF83AEABD2
Jan 21 22:32:56 prod1 kernel: o2dlm: Nodes in domain 65D58ADDE3984ED79283A1DF83AEABD2: 0 1 2
 
prod2:
Jan 21 22:30:29 prod2 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down.
Jan 21 22:30:29 prod2 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516544999.275659 now 1516545029.274388 dr 1516544999.275645 adv 1516
544999.275670:1516544999.275671 func (70a70579:505) 1516544815.278433:1516544815.278436)
Jan 21 22:30:29 prod2 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777
Jan 21 22:30:59 prod2 kernel: (o2net,7207,11):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors.
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (ocfs2rec,1910,0):ocfs2_replay_journal:1182 Recovering node 2 from slot 1 on device (253,2)

And I found possible reason from below two notes:

CIFS Mount Soft Lockup While Transferring Large Files (Doc ID 1621616.1)

Hung CIFS processes in D state and stuck in "cifs_reopen_file" Also the logs show "kernel: CIFS VFS: No writable handles for inode" before server getting rebooted (Doc ID 2318108.1)

Yes, sometimes the CIFS sharing even could cause server reboot.

From these notes, below methods could be taken:

echo 0 > /proc/fs/cifs/OplockEnabled
enable 'forcedirectio' option of the mount command

From the alert log of the database, I did not find useful information:

Fatal NI connect error 12170.
 
  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.4.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 21-JAN-2018 22:22:45
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
 
TNS-12535: TNS:operation timed out
Sun Jan 21 22:34:26 2018
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0

In the trace files of the cluster, I could only got the system worked abnormally at least from 22:24, and was evicted finally.

[/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5818:Aborted command 'check' for resource 'ora.prod.db'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent
_oracle/oraagent_oracle.log.
2018-01-21 22:24:17.869:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5832:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' was unable to process commands. Details at (:CRSAGF00128:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2018-01-21 22:24:43.241:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.ons'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_gri
d/oraagent_grid.log.
2018-01-21 22:24:43.400:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/opmn/bin/onsctli" for action "check": details at
"(:CLSN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log"
2018-01-21 22:24:43.420:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/or
aagent_grid/oraagent_grid.log.
2018-01-21 22:24:43.515:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CL
SN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log"
2018-01-21 22:27:17.948:
[crsd(8528)]CRS-5825:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' is unresponsive and will be restarted. Details at (:CRSAGF00131:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/crsd/crsd.log.
2018-01-21 22:28:59.355:
[cssd(8447)]CRS-1662:Member kill requested by node prod1 for member number 2, group DBprod
2018-01-21 22:29:00.073:
[cssd(8447)]CRS-1662:Member kill requested by node prod2 for member number 2, group DBprod
2018-01-21 22:29:30.225:
[cssd(8447)]CRS-1608:This node was evicted by node 1, prod1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
2018-01-21 22:29:30.235:
[cssd(8447)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log
2018-01-21 22:29:30.235:
[cssd(8447)]CRS-1652:Starting clean up of CRSD resources.
2018-01-21 22:29:30.846:
[cssd(8447)]CRS-1604:CSSD voting file is offline: /oraocr/storage/vdsk; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
2018-01-21 22:29:34.079:
[cssd(8447)]CRS-1605:CSSD voting file is online: /oraocr/storage/vdsk; details in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
2018-01-21 22:29:45.238:
[cssd(8447)]CRS-1653:The clean up of the CRSD resources failed.
2018-01-21 22:29:48.309:
[ohasd(8031)]CRS-2765:Resource 'ora.ctssd' has failed on server 'prod3'.
2018-01-21 22:29:48.554:
[ohasd(8031)]CRS-2765:Resource 'ora.evmd' has failed on server 'prod3'.
2018-01-21 22:29:50.093:
[ctssd(25158)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log.
2018-01-21 22:29:54.187:
[ohasd(8031)]CRS-2878:Failed to restart resource 'ora.ctssd'
2018-01-21 22:29:54.202:
[ohasd(8031)]CRS-2769:Unable to failover resource 'ora.ctssd'.
2018-01-21 22:29:55.138:
[ctssd(25172)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log.
2018-01-21 22:32:58.157:
[ohasd(8030)]CRS-2112:The OLR service started on node prod3.
2018-01-21 22:32:58.501:
[ohasd(8030)]CRS-1301:Oracle High Availability Service started on node prod3.

I guess the system maybe lost response from 22:24 while I could not confirm this point.

CIFS VFS: No response to cmd

Archivelog backups were not removed per backup policy

Talking about ‘recovery window of 7 days’

ORA-27102 out of memory on a system with huge memory

Maybe a dangerous configuration of ASM diskgroup with Normal Redundancy

Key points for Oracle RAC patching

Could I ignore the SID clause in command ‘alter system’?

How to recreate the lost ASM password in Oracle Clusterware?

How to fix permission issues of Oracle RAC?

Some issues during the Oracle RAC installation on openSUSE 15.2

Install Oracle 11gR2 on RHEL8