CIFS VFS: No response to cmd

NeilZhang
NeilZhang
管理员
140
文章
106.8千
浏览
Linux Oracle评论820字数 2269阅读7分33秒阅读模式

I never thought mounting a shared filesystem maybe cause server reboot!

One node in a three-node RAC database was rebooted without any reason, and I was contacted to find the possible reason.

I checked the servers and found it was hard to confirm the root cause as I could not get the system information before the reboot, and had to get all the information from the system and database log and trace files.

From the /var/log/messages of the affected server, got below errors:

  1. Jan 21 22:18:51 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49711
  2. Jan 21 22:19:40 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49715
  3. Jan 21 22:20:29 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49719
  4. Jan 21 22:21:18 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49723
  5. Jan 21 22:22:07 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49727
  6. Jan 21 22:22:56 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49731
  7. Jan 21 22:23:45 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49735
  8. Jan 21 22:24:34 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49739
  9. Jan 21 22:25:23 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49743
  10. Jan 21 22:26:12 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49747
  11. Jan 21 22:27:01 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49751
  12. Jan 21 22:27:50 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49755
  13. Jan 21 22:28:39 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49759
  14. Jan 21 22:29:28 prod3 kernel: CIFS VFS: No response for cmd 162 mid 49763
  15. Jan 21 22:29:58 prod3 kernel: SysRq : Resetting
  16. Jan 21 22:32:12 prod3 syslogd 1.4.1: restart.

So the server was restarted at 22:29:58, and before these errors I also found other errors:

  1. Jan 21 21:48:49 prod3 kernel: CIFS VFS: Write2 ret -28, wrote 0
  2. Jan 21 21:49:20 prod3 last message repeated 21269 times
  3. Jan 21 21:50:21 prod3 last message repeated 43405 times
  4. Jan 21 21:51:22 prod3 last message repeated 42692 times
  5. Jan 21 21:52:23 prod3 last message repeated 43849 times
  6. Jan 21 21:53:24 prod3 last message repeated 44724 times
  7. Jan 21 21:54:17 prod3 last message repeated 88025 times
  8. Jan 21 21:55:02 prod3 kernel: CIFS VFS: No response to cmd 47 mid 49592
  9. Jan 21 21:55:02 prod3 kernel: CIFS VFS: Write2 ret -11, wrote 0
  10. Jan 21 21:55:02 prod3 kernel: CIFS VFS: No response to cmd 47 mid 49591
  11. Jan 21 21:55:02 prod3 kernel: CIFS VFS: Write2 ret -11, wrote 0
  12. Jan 21 21:55:02 prod3 kernel: CIFS VFS: No response to cmd 47 mid 49593

The server mounted a large share folder from a Windows server for backup purpose.

From other two nodes, got the related information:

  1. prod1:
  2. Jan 21 22:30:30 prod1 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down.
  3. Jan 21 22:30:30 prod1 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516545000.229633 now 1516545030.229094 dr 1516545000.229617 adv 1516
  4. 545000.229647:1516545000.229648 func (49156591:504) 1516544986.229775:1516544986.229787)
  5. Jan 21 22:30:30 prod1 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777
  6. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -112 send AST to node 2
  7. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -112
  8. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -112 send AST to node 2
  9. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -112
  10. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res P000000000000000000000000000000, error -107 send AST to node 2
  11. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107
  12. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2
  13. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
  14. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2
  15. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107
  16. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2
  17. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
  18. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2
  19. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
  20. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2
  21. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
  22. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2
  23. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
  24. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2
  25. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107
  26. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2
  27. Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
  28. Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
  29. Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_flush_asts:609 ERROR: status = -107
  30. Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
  31. Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
  32. Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
  33. Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
  34. Jan 21 22:31:00 prod1 kernel: (o2net,7350,23):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors.
  35. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res S000000000000000000000200000000, error -107 send AST to node 2
  36. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
  37. Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
  38. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res S000000000000000000000200000000, error -107 send AST to node 2
  39. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_flush_asts:609 ERROR: status = -107
  40. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res S000000000000000000000200000000, error -107 send AST to node 2
  41. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107
  42. Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
  43. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_send_proxy_ast_msg:486 ERROR: 65D58ADDE3984ED79283A1DF83AEABD2: res S000000000000000000000200000000, error -107 send AST to node 2
  44. Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_flush_asts:609 ERROR: status = -107
  45. Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
  46. Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
  47. Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
  48. Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
  49. Jan 21 22:31:06 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
  50. Jan 21 22:31:07 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
  51. Jan 21 22:32:39 prod1 kernel: o2net: accepted connection from node prod3 (num 2) at 172.20.20.62:7777
  52. Jan 21 22:32:43 prod1 kernel: o2dlm: Node 2 joins domain 3ADDC1A2B0634122BC64ED298F64E0CD
  53. Jan 21 22:32:43 prod1 kernel: o2dlm: Nodes in domain 3ADDC1A2B0634122BC64ED298F64E0CD: 0 1 2
  54. Jan 21 22:32:48 prod1 kernel: o2dlm: Node 2 joins domain B9CD211F85F6429AAC04FB815C6EDC22
  55. Jan 21 22:32:48 prod1 kernel: o2dlm: Nodes in domain B9CD211F85F6429AAC04FB815C6EDC22: 0 1 2
  56. Jan 21 22:32:52 prod1 kernel: o2dlm: Node 2 joins domain 755B051EE3AD402784A78865F351F6F4
  57. Jan 21 22:32:52 prod1 kernel: o2dlm: Nodes in domain 755B051EE3AD402784A78865F351F6F4: 0 1 2
  58. Jan 21 22:32:56 prod1 kernel: o2dlm: Node 2 joins domain 65D58ADDE3984ED79283A1DF83AEABD2
  59. Jan 21 22:32:56 prod1 kernel: o2dlm: Nodes in domain 65D58ADDE3984ED79283A1DF83AEABD2: 0 1 2
  60.  
  61. prod2:
  62. Jan 21 22:30:29 prod2 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down.
  63. Jan 21 22:30:29 prod2 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516544999.275659 now 1516545029.274388 dr 1516544999.275645 adv 1516
  64. 544999.275670:1516544999.275671 func (70a70579:505) 1516544815.278433:1516544815.278436)
  65. Jan 21 22:30:29 prod2 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777
  66. Jan 21 22:30:59 prod2 kernel: (o2net,7207,11):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors.
  67. Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
  68. Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
  69. Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
  70. Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
  71. Jan 21 22:31:03 prod2 kernel: (ocfs2rec,1910,0):ocfs2_replay_journal:1182 Recovering node 2 from slot 1 on device (253,2)

And I found possible reason from below two notes:

CIFS Mount Soft Lockup While Transferring Large Files (Doc ID 1621616.1)

Hung CIFS processes in D state and stuck in "cifs_reopen_file" Also the logs show "kernel: CIFS VFS: No writable handles for inode" before server getting rebooted (Doc ID 2318108.1)

Yes, sometimes the CIFS sharing even could cause server reboot.

From these notes, below methods could be taken:

  1. echo 0 > /proc/fs/cifs/OplockEnabled
  2. enable 'forcedirectio' option of the mount command

From the alert log of the database, I did not find useful information:

  1. Fatal NI connect error 12170.
  2.  
  3. VERSION INFORMATION:
  4. TNS for Linux: Version 11.2.0.4.0 - Production
  5. Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  6. TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  7. Time: 21-JAN-2018 22:22:45
  8. Tracing not turned on.
  9. Tns error struct:
  10. ns main err code: 12535
  11.  
  12. TNS-12535: TNS:operation timed out
  13. Sun Jan 21 22:34:26 2018
  14. Starting ORACLE instance (normal)
  15. LICENSE_MAX_SESSION = 0
  16. LICENSE_SESSIONS_WARNING = 0

In the trace files of the cluster, I could only got the system worked abnormally at least from 22:24, and was evicted finally.

  1. [/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5818:Aborted command 'check' for resource 'ora.prod.db'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent
  2. _oracle/oraagent_oracle.log.
  3. 2018-01-21 22:24:17.869:
  4. [/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5832:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' was unable to process commands. Details at (:CRSAGF00128:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_oracle/oraagent_oracle.log.
  5. 2018-01-21 22:24:43.241:
  6. [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.ons'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_gri
  7. d/oraagent_grid.log.
  8. 2018-01-21 22:24:43.400:
  9. [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/opmn/bin/onsctli" for action "check": details at
  10. "(:CLSN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log"
  11. 2018-01-21 22:24:43.420:
  12. [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/or
  13. aagent_grid/oraagent_grid.log.
  14. 2018-01-21 22:24:43.515:
  15. [/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CL
  16. SN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log"
  17. 2018-01-21 22:27:17.948:
  18. [crsd(8528)]CRS-5825:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' is unresponsive and will be restarted. Details at (:CRSAGF00131:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/crsd/crsd.log.
  19. 2018-01-21 22:28:59.355:
  20. [cssd(8447)]CRS-1662:Member kill requested by node prod1 for member number 2, group DBprod
  21. 2018-01-21 22:29:00.073:
  22. [cssd(8447)]CRS-1662:Member kill requested by node prod2 for member number 2, group DBprod
  23. 2018-01-21 22:29:30.225:
  24. [cssd(8447)]CRS-1608:This node was evicted by node 1, prod1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
  25. 2018-01-21 22:29:30.235:
  26. [cssd(8447)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log
  27. 2018-01-21 22:29:30.235:
  28. [cssd(8447)]CRS-1652:Starting clean up of CRSD resources.
  29. 2018-01-21 22:29:30.846:
  30. [cssd(8447)]CRS-1604:CSSD voting file is offline: /oraocr/storage/vdsk; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
  31. 2018-01-21 22:29:34.079:
  32. [cssd(8447)]CRS-1605:CSSD voting file is online: /oraocr/storage/vdsk; details in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
  33. 2018-01-21 22:29:45.238:
  34. [cssd(8447)]CRS-1653:The clean up of the CRSD resources failed.
  35. 2018-01-21 22:29:48.309:
  36. [ohasd(8031)]CRS-2765:Resource 'ora.ctssd' has failed on server 'prod3'.
  37. 2018-01-21 22:29:48.554:
  38. [ohasd(8031)]CRS-2765:Resource 'ora.evmd' has failed on server 'prod3'.
  39. 2018-01-21 22:29:50.093:
  40. [ctssd(25158)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log.
  41. 2018-01-21 22:29:54.187:
  42. [ohasd(8031)]CRS-2878:Failed to restart resource 'ora.ctssd'
  43. 2018-01-21 22:29:54.202:
  44. [ohasd(8031)]CRS-2769:Unable to failover resource 'ora.ctssd'.
  45. 2018-01-21 22:29:55.138:
  46. [ctssd(25172)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log.
  47. 2018-01-21 22:32:58.157:
  48. [ohasd(8030)]CRS-2112:The OLR service started on node prod3.
  49. 2018-01-21 22:32:58.501:
  50. [ohasd(8030)]CRS-1301:Oracle High Availability Service started on node prod3.

I guess the system maybe lost response from 22:24 while I could not confirm this point.

 
  • 本文由 NeilZhang 发表于28/01/2018 11:29:39
  • Repost please keep this link: https://www.dbcloudsvc.com/blogs/linux/cifs-vfs-no-response-to-cmd/
匿名

发表评论

匿名网友
:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
确定