這篇博客是上一篇博客Oracle shutdown immediate遭遇ORA-24324 ORA-24323 ORA-01089的延伸(數據庫掛起hang時,才去重啟的),其實這是我們海外一工廠的遇到的案例,把內容拆開是因為這個case分開講述顯得主題明確一些。正式進入主題:
服務器數據庫版本Oracle Database 10g Release 10.2.0.4.0,操作系統為Red Hat Enterprise Linux Server release 5.7,虛擬機。當時告警日志裡面出現大量的found dead shared server這裡信息。數據庫也出現連接不上的情況
found dead shared server 'S016', pid = (35, 23)
found dead shared server 'S023', pid = (42, 1)
Fri Aug 5 10:28:48 2016
found dead shared server 'S013', pid = (32, 110)
found dead shared server 'S021', pid = (40, 1)
Fri Aug 5 10:33:53 2016
found dead shared server 'S012', pid = (31, 132)
found dead shared server 'S023', pid = (38, 3)
Fri Aug 5 10:38:55 2016
found dead shared server 'S013', pid = (32, 111)
found dead shared server 'S022', pid = (42, 3)
Fri Aug 5 10:40:53 2016
found dead shared server 'S020', pid = (39, 4)
found dead shared server 'S021', pid = (40, 3)
failed to start shared server, oer=0
通過檢查發現系統內存耗盡,出現了oom_kill 。OOMkiller,即out of memory killer,是linux下面當內存耗盡時的的一種處理機制。當內存較少時,OOM會遍歷整個進程鏈表,然後根據進程的內存使用情況以及它的oom score值最終找到得分較高的進程,然後發送kill信號將其殺掉。
# grep -i kill /var/log/messages | more
Aug 5 10:12:10 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:12:10 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug 5 10:12:11 xxxxx kernel: Killed process 21687 (oracle)
Aug 5 10:12:11 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:12:11 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug 5 10:12:11 xxxxx kernel: Killed process 21668 (oracle)
Aug 5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:23:09 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug 5 10:23:09 xxxxx kernel: Killed process 21756 (oracle)
Aug 5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:23:09 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug 5 10:23:09 xxxxx kernel: Killed process 21732 (oracle)
Aug 5 10:28:08 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:28:08 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug 5 10:28:09 xxxxx kernel: Killed process 21752 (oracle)
Aug 5 10:28:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:28:09 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug 5 10:28:09 xxxxx kernel: Killed process 21722 (oracle)
Aug 5 10:32:24 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:32:24 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug 5 10:32:24 xxxxx kernel: Killed process 21718 (oracle)
Aug 5 10:32:24 xxxxx kernel: gdm-rh-security invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:32:24 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug 5 10:32:24 xxxxx kernel: Killed process 22053 (oracle)
Aug 5 10:37:54 xxxxx kernel: beremote invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:37:54 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug 5 10:37:54 xxxxx kernel: Killed process 22238 (oracle)
Aug 5 10:37:54 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug 5 10:37:54 xxxxx kernel: [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
--More--
從上面可以看到大量的ORACLE的進程被kill掉,從而導致ORACLE出現"found dead shared server 'S016', pid = (35, 23)"這類錯誤,在官方文檔Found Dead Shared Server Messages Reported In Alert.Log (文檔 ID 760872.1) 有如下介紹(這個文檔較老舊,不過原理依然適用於此處環境):
The following is being reported in the alert.log file
Mon Dec 22 16:48:31 2008
found dead shared server 'S004', pid = (13, 1)
found dead shared server 'S001', pid = (10, 1)
No further errors accompany those messages in the alert.log file.
Listener.log shows no errors.
也就是說當一個會話被殺,在某些情況下, shared sever進程會死掉,導致上面“found dead shared server”出現在告警日志中。 這個案例中,由於系統大量kill掉會話進程,導致shared server進程死掉。所以Oracle數據庫出現無法訪問的情況。突然掛起。
$grep "Out of memory" messages
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug 5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug 5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug 5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug 5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug 5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
Aug 5 10:38:46 xxxxx kernel: Out of memory: kill process 22227 (oracle) score 2132996 or a child
Aug 5 10:39:14 xxxxx kernel: Out of memory: kill process 22229 (oracle) score 2134280 or a child
Aug 5 10:40:57 xxxxx kernel: Out of memory: kill process 22286 (oracle) score 2135299 or a child
Aug 5 10:41:24 xxxxx kernel: Out of memory: kill process 22245 (oracle) score 2135302 or a child
Aug 5 10:41:25 xxxxx kernel: Out of memory: kill process 22485 (oracle) score 2133009 or a child
Aug 5 10:41:56 xxxxx kernel: Out of memory: kill process 21779 (oracle) score 2132880 or a child
Aug 5 10:42:08 xxxxx kernel: Out of memory: kill process 22068 (oracle) score 2132873 or a child
Aug 5 10:42:26 xxxxx kernel: Out of memory: kill process 22249 (oracle) score 2132873 or a child
Aug 5 10:42:26 xxxxx kernel: Out of memory: kill process 22278 (oracle) score 2132873 or a child
Aug 5 10:42:31 xxxxx kernel: Out of memory: kill process 21662 (oracle) score 2132872 or a child
Aug 5 10:42:47 xxxxx kernel: Out of memory: kill process 22045 (oracle) score 2132872 or a child
Aug 5 10:42:57 xxxxx kernel: Out of memory: kill process 22314 (oracle) score 2132872 or a child
Aug 5 10:43:35 xxxxx kernel: Out of memory: kill process 22336 (oracle) score 2132872 or a child
Aug 5 10:43:35 xxxxx kernel: Out of memory: kill process 22435 (oracle) score 2132870 or a child
Aug 5 10:43:55 xxxxx kernel: Out of memory: kill process 21666 (oracle) score 2132869 or a child
Aug 5 10:44:02 xxxxx kernel: Out of memory: kill process 22263 (oracle) score 2132869 or a child
Aug 5 10:44:19 xxxxx kernel: Out of memory: kill process 22405 (oracle) score 2132866 or a child
Aug 5 10:44:20 xxxxx kernel: Out of memory: kill process 22438 (oracle) score 2132866 or a child
Aug 5 10:44:20 xxxxx kernel: Out of memory: kill process 22453 (oracle) score 2132865 or a child
Aug 5 10:44:23 xxxxx kernel: Out of memory: kill process 22466 (oracle) score 2132737 or a child
Aug 5 10:44:26 xxxxx kernel: Out of memory: kill process 22499 (oracle) score 2132607 or a child
Aug 5 10:44:27 xxxxx kernel: Out of memory: kill process 21716 (oracle) score 1078417 or a child
Aug 5 10:44:27 xxxxx kernel: Out of memory: kill process 21670 (oracle) score 1066455 or a child
Aug 5 10:48:02 xxxxx kernel: Out of memory: kill process 22829 (oracle) score 2134273 or a child
Aug 5 10:49:47 xxxxx kernel: Out of memory: kill process 22900 (oracle) score 2133007 or a child
Aug 5 10:50:36 xxxxx kernel: Out of memory: kill process 22842 (oracle) score 2133095 or a child
Aug 5 10:51:25 xxxxx kernel: Out of memory: kill process 22990 (oracle) score 2134285 or a child
Aug 5 10:51:25 xxxxx kernel: Out of memory: kill process 23054 (oracle) score 2132994 or a child
Aug 5 10:51:49 xxxxx kernel: Out of memory: kill process 22933 (oracle) score 2134277 or a child
Aug 5 10:51:49 xxxxx kernel: Out of memory: kill process 23103 (oracle) score 2132996 or a child
Aug 5 10:52:52 xxxxx kernel: Out of memory: kill process 23211 (oracle) score 2134267 or a child
在官方文檔Oracle VM Server hangs after Invoking the OOM Killer and having hundreds of kpartx processes spawned and "state S non-preferred supports toluSnA" reported on the FC LUNs (文檔 ID 2123877.1)介紹了這麼一個案例
On Oracle VM Server, during normal server operation, the server suddenly hangs, with the following kind of messages being logged about invoking the Out Of Memory Killer :
Stack traces like this one are also being printed on the console :
Mar 19 09:17:50 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice childIn the processes dump before the Out Of Memory Killer starts killing processes, there are hundreds of kpartx instances :
$ grep -c kpartx messagesError messages like the following one are also being reported :
Mar 13 03:59:20 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Issue(s) on the Fiber Channel Storage Server which is hosting the LUNs used by the Oracle VM Server.
To resolve this issue, please resolve the issue on the Fiber Channel Storage Array - The issue on the VM Server will then resolve as the issues on the Storage is fixed.
跟這個案例非常類似,猜測就有可能就是這個原因造成的,需要海外那邊的系統管理員去檢查、處理。哎,但是一直沒給我反饋。有時候要幫他們看著這些屬於他們職責范圍內的事情,還得不到有效支持,真心覺得累!