ORACLE數據庫的systemstat dump生成trace文件雖然比較簡單,但是怎麼從trace文件中浩如煙海的信息中提煉有用信息,並作出分析診斷是一件技術活,下面收集、整理如何分析解讀systemstat dump產生的trace文件。
如果要人工去解讀systemstat dump生成的trace文件,真是一件體力活,因為這些trace文件動不動就幾百M甚至更大,它產生的跟蹤文件包含了系統中所有進程的進程狀態等信息。每個進程對應跟蹤文件中的一段內容,反映該進程的狀態信息,包括進程信息,會話信息,enqueues信息(主要是lock的信息),緩沖區的信息和該進程在SGA區中持有的(held)對象的狀態等信息。dump systemstate產生的跟蹤文件是從dump那一刻開始到dump任務完成之間一段事件內的系統內所有進程的信息。 我們需要的是找到導致問題出現的進程的信息,然後根據進程的信息判斷導致問題出現的root cause,並在分析問題後解決問題。
幸好網上有人寫了這個一個工具ass109.awk ,可以節約分析systemstat dump文件或跟蹤文件(trace file)的時間,可以將trace文件中關鍵信息梳理、整理出來,當然如果了解詳細信息,還是必須人工去解讀。下面貼上一個例子,是我在學習中的一個案例,本人也是在學習、研究過程中,如有分析不對的地方敬請指出
[oracle@db-server udump]$ awk -f ass109.awk scm2_ora_25575.trc
Starting Systemstate 1
.................................................
Starting Systemstate 2
....................................................
Starting Systemstate 3
....................................................
Ass.Awk Version 1.0.9 - Processing scm2_ora_25575.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'rdbms ipc message'
7: waiting for 'rdbms ipc message'
8: last wait for 'smon timer'
9: waiting for 'rdbms ipc message'
10: waiting for 'rdbms ipc message'
11: waiting for 'rdbms ipc message'
12: waiting for 'rdbms ipc message'
13: waiting for 'SQL*Net message from client'[Latch 855675ae0]
Cmd: Select
14: waiting for 'SQL*Net message from client'[Latch 8556759a0]
Cmd: Select
15: waiting for 'SQL*Net message from client'
Cmd: Select
16: waiting for 'SQL*Net message from client'[Latch 855675720]
17: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Insert
18: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Select
19: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Select
20: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Select
21: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Insert
22: waiting for 'virtual circuit status'
Cmd: Select
23:
24:
25: waiting for 'virtual circuit status'
Cmd: Select
26:
27: waiting for 'virtual circuit status'
Cmd: Select
28:
29: waiting for 'latch: shared pool' [Latch 8556759a0]
Cmd: Select
30:
31: waiting for 'virtual circuit status'
Cmd: Select
33: waiting for 'jobq slave wait'
34:
35:
36: waiting for 'Streams AQ: qmn slave idle wait'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39: waiting for 'rdbms ipc message'
40: waiting for 'rdbms ipc message'
41: waiting for 'rdbms ipc message'
42: waiting for 'rdbms ipc message'
43: waiting for 'rdbms ipc message'
44: waiting for 'rdbms ipc message'
45: waiting for 'rdbms ipc message'
46: waiting for 'rdbms ipc message'
47: waiting for 'Streams AQ: qmn coordinator idle wait'
49: for 'Streams AQ: waiting for time management or cleanup tasks'
58:
61: waiting for 'virtual circuit status'
Cmd: Select
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 855675ae0 ??? Blocker
Latch 8556759a0 ??? Blocker
Latch 855675720 ??? Blocker
Latch 8556755e0 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 855675ae0 Child library cache
Latch 8556759a0 Child library cache
Latch 855675720 Child library cache
Latch 8556755e0 Child library cache
System State 2
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'rdbms ipc message'
7: waiting for 'rdbms ipc message'
8: waiting for 'smon timer'
9: waiting for 'rdbms ipc message'
10: waiting for 'rdbms ipc message'
11: waiting for 'rdbms ipc message'
12: waiting for 'rdbms ipc message'
13: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
14: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
15: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
16: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
17: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
18: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
19: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
20: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
21: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
22:
23:
24:
25:
26:
27:
28: waiting for 'virtual circuit status'
29:
30:
31: waiting for 'virtual circuit status'
Cmd: Select
32: waiting for 'jobq slave wait'
33: last wait for 'latch: shared pool' [Latch 600f7320]
34:
35: waiting for 'virtual circuit status'
Cmd: Select
36: waiting for 'Streams AQ: qmn slave idle wait'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39: waiting for 'rdbms ipc message'
40: waiting for 'rdbms ipc message'
41: waiting for 'rdbms ipc message'
42: waiting for 'rdbms ipc message'
43: waiting for 'rdbms ipc message'
44: waiting for 'rdbms ipc message'
45: waiting for 'rdbms ipc message'
46: waiting for 'rdbms ipc message'
47: waiting for 'Streams AQ: qmn coordinator idle wait'
48: waiting for 'library cache load lock'
49: for 'Streams AQ: waiting for time management or cleanup tasks'
50: waiting for 'library cache load lock'
58:
61: waiting for 'virtual circuit status'
Cmd: Select
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 855675900 ??? Blocker
Latch 855675680 ??? Blocker
Latch 600f7320 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 855675900 Child library cache
Latch 855675680 Child library cache
Latch 600f7320 Child shared pool
System State 3
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'rdbms ipc message'
7: waiting for 'rdbms ipc message'
8: waiting for 'smon timer'
9: waiting for 'rdbms ipc message'
10: waiting for 'rdbms ipc message'
11: waiting for 'latch: shared pool' [Latch 600f7320]
12: waiting for 'rdbms ipc message'
13: waiting for 'SQL*Net message from client'[Latch 855675540]
Cmd: Select
14: waiting for 'SQL*Net message from client'[Latch 855675540]
Cmd: Select
15: waiting for 'SQL*Net message from client'[Latch 855675b80]
Cmd: Select
16: waiting for 'SQL*Net message from client'[Latch 8556757c0]
Cmd: Select
17: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
18: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
19: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
20: waiting for 'SQL*Net message from client'
Cmd: Select
21: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
22:
23:
24:
25:
26:
27:
28: waiting for 'virtual circuit status'
29:
30:
31: waiting for 'virtual circuit status'
Cmd: Select
32: waiting for 'jobq slave wait'
33: last wait for 'latch: shared pool' [Latch 600f7320]
Cmd: Select
34:
35: waiting for 'virtual circuit status'
Cmd: Select
36: waiting for 'Streams AQ: qmn slave idle wait'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39: waiting for 'rdbms ipc message'
40: waiting for 'rdbms ipc message'
41: waiting for 'rdbms ipc message'
42: waiting for 'rdbms ipc message'
43: waiting for 'rdbms ipc message'
44: waiting for 'rdbms ipc message'
45: waiting for 'rdbms ipc message'
46: waiting for 'rdbms ipc message'
47: waiting for 'Streams AQ: qmn coordinator idle wait'
48: waiting for 'SQL*Net message from client'
49: for 'Streams AQ: waiting for time management or cleanup tasks'
50: waiting for 'latch: shared pool' [Latch 600f7320]
Cmd: Select
58:
61: waiting for 'virtual circuit status'
Cmd: Select
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 600f7320 ??? Blocker
Latch 855675540 ??? Blocker
Latch 855675b80 ??? Blocker
Latch 8556757c0 ??? Blocker
Latch 855675680 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 600f7320 Child shared pool
Latch 855675540 Child library cache
Latch 855675b80 Child library cache
Latch 8556757c0 Child library cache
Latch 855675680 Child library cache
從輸出信息我們能判斷我當時做了3次系統狀態轉儲(實際也是執行了三次oradebug dump systemstate 266),從System State 2,我們可以看到有3個Blocker
其中的一個Blocker的latch是855675900,而且這個latch造成了進程20、17、16、19、14、21、15的waiting for 'SQL*Net message from client',從下面
看到hold住latch 855675900是oracle@xxxx (J000)進程,也就是job的進程。也就說,由於這個j000進程的異常,hold住了855675900 的latch。
我們到hold住latch 855675900是oracle@xxxxxx (J000)進程,也就是job的進程。也就說,由於這個j000進程的異常,hold住了855675900 的latch。其實這個案例跟“一個Job運行失敗導致數據庫掛死”有點類似,最後也發現這個JOB是EMD_MAINTENANCE.EXECUTE_EM_DBMS_JOB_PROCS,當然引起問題的原因更復雜,不在此處討論。另外,metalink上也有一篇關於如何解讀、理解Systemstate Dumps的文章: Reading and Understanding Systemstate Dumps (文檔 ID 423153.1),具體內容如下所示。
To be able to read a systemstate, or navigate through a systemstate in order to identify what sessions are doing and ,
in the case of a waiting session, which session(s) hold the resource it requires
This document is intended for DBAs.
另外如果要理解、解讀systemstate dump的內容,如何閱讀systemstate dump這篇文章不得不細讀。這個裡面講述了很多Detail方面的東西。非常受益!
參考資料:
http://www.oracleblog.org/working-case/database-hang-due-to-job-dead/
http://www.askmaclean.com/archives/%E8%BD%AC%E5%A6%82%E4%BD%95%E9%98%85%E8%AF%BBsystemstate-dump.html