這篇文章是記錄一次ORACLE數據庫UNDO表空間爆滿的分析過程,主要整理、梳理了同事分析的思路。具體過程如下所示:
早上收到一數據庫服務器的UNDO表空間的告警郵件,最早一封是7:55發出的(監控作業是15分鐘一次),從告警郵件分析,好像是UNDO表空間突然一下子被耗盡了。
DB
Tablespace
Allocated
Free
Used
% Free
% Used
192.168.xxx.xxx:1521
UNDOTBS1
16384
190.25
16193.75
1.16
99
使用一些SQL分析了undo表空間使用情況,以及undo segment狀態等等,非常想定位到是哪個或那些SQL耗盡了UNDO表空間,但是沒有一個SQL能實現我的想法,抑或是我不了解。
SELECT UPPER(F.TABLESPACE_NAME) AS "TABLESPACE_NAME",
ROUND(D.MAX_BYTES,2) AS "TBS_TOTAL_SIZE" ,
ROUND(D.AVAILB_BYTES ,2) AS "TABLESPACE_SIZE",
ROUND(D.MAX_BYTES - D.AVAILB_BYTES +USED_BYTES,2) AS "TBS_AVABLE_SIZE",
ROUND((D.AVAILB_BYTES - F.USED_BYTES),2) AS "TBS_USED_SIZE",
TO_CHAR(ROUND((D.AVAILB_BYTES - F.USED_BYTES) / D.AVAILB_BYTES * 100,
2),
'999.99') AS "USED_RATE(%)",
ROUND(F.USED_BYTES, 6) AS "FREE_SIZE(G)"
FROM (SELECT TABLESPACE_NAME,
ROUND(SUM(BYTES) / (1024 * 1024 * 1024), 6) USED_BYTES,
ROUND(MAX(BYTES) / (1024 * 1024 * 1024), 6) MAX_BYTES
FROM SYS.DBA_FREE_SPACE
GROUP BY TABLESPACE_NAME) F,
(SELECT DD.TABLESPACE_NAME,
ROUND(SUM(DD.BYTES) / (1024 * 1024 * 1024), 6) AVAILB_BYTES,
ROUND(SUM(DECODE(DD.MAXBYTES, 0, DD.BYTES, DD.MAXBYTES))/(1024*1024*1024),6) MAX_BYTES
FROM SYS.DBA_DATA_FILES DD
GROUP BY DD.TABLESPACE_NAME) D
HERE D.TABLESPACE_NAME = F.TABLESPACE_NAME
AND D.TABLESPACE_NAME=&UNDO_TABLESPACE_NAME
RDER BY 5 DESC;
select usn,xacts,rssize/1024/1024/1024,hwmsize/1024/1024/1024,shrinks
from v$rollstat order by rssize;
既然直接入手,無法定位,那就曲線分析,首先檢查、分析了一下redo log,發現在7點這段時間,日志切換了83次之多,橫向、縱向對比,明顯異常,如下截圖所示:
SELECT
TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY,
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'99') "00",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'99') "01",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'99') "02",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'99') "03",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'99') "04",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'99') "05",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'99') "06",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'99') "07",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'99') "0",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'99') "09",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'99') "10",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'99') "11",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'99') "12",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'99') "13",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'99') "14",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'99') "15",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'99') "16",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'99') "17",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'99') "18",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'99') "19",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'99') "20",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'99') "21",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'99') "22",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'99') "23"
FROM
V$LOG_HISTORY
GROUP BY
TO_CHAR(FIRST_TIME,'YYYY-MM-DD')
ORDER BY 1 DESC;
生成了實例在7:00~8:00時間段的AWR報告,從下面指標我們可以看出,數據庫實例在這段時間呢,其實是非常空閒的,因為DB Time為9.74(mins)
另外,從Time Model Statistics部分來看,主要時間花在background elapsed time,而不是DB Time,我們可以判斷時間主要耗費在後台進程,而不是前台進程。另外sql execute elapsed time耗用了DB Time的70.36的時間。
然後我們來看SQL order by Gets部分信息, 第一個SQL是刪除WRH$_SQL_PLAN的記錄,當然也有刪除wrh$_sqltext、WRH$_SEG_STAT_OBJ表記錄的SQL,如下所示
DELETE
FROM WRH$_SQL_PLAN tab
WHERE (:beg_snap <= tab.snap_id
AND tab.snap_id <= :end_snap
AND dbid = :dbid)
AND NOT EXISTS
(SELECT 1
FROM WRM$_BASELINE b
WHERE (tab.dbid = b.dbid)
AND (tab.snap_id >= b.start_snap_id)
AND (tab.snap_id <= b.end_snap_id)
)
DELETE
FROM wrh$_sqltext tab
WHERE (tab.dbid = :dbid
AND :beg_snap <= tab.snap_id
AND tab.snap_id <= :end_snap
AND tab.ref_count = 0)
AND NOT EXISTS
(SELECT 1
FROM WRM$_BASELINE b
WHERE (b.dbid = :dbid2
AND tab.snap_id >= b.start_snap_id
AND tab.snap_id <= b.end_snap_id)
);
DELETE
FROM WRH$_SEG_STAT_OBJ tab
WHERE (:beg_snap <= tab.snap_id
AND tab.snap_id <= :end_snap
AND dbid = :dbid)
AND NOT EXISTS
(SELECT 1
FROM WRM$_BASELINE b
WHERE (tab.dbid = b.dbid)
AND (tab.snap_id >= b.start_snap_id)
AND (tab.snap_id <= b.end_snap_id)
);
查看SQL ordered by Reads部分信息,發現主要也是刪除系統表WRH$_SQL_PLAN記錄 (這個表是非常大的)
DELETE
FROM WRH$_SQL_PLAN tab
WHERE (:beg_snap <= tab.snap_id
AND tab.snap_id <= :end_snap
AND dbid = :dbid)
AND NOT EXISTS
(SELECT 1
FROM WRM$_BASELINE b
WHERE (tab.dbid = b.dbid)
AND (tab.snap_id >= b.start_snap_id)
AND (tab.snap_id <= b.end_snap_id)
)
然後我們查看AWR報告的Tablespace IO Stats部分,IO主要集中在SYSAUX,UNDOTBS1這兩個表空間,然後你會發現那個表WRH$_SQL_PLAN就是在SYSAUX下
所以,上面種種證據顯示,讓我們幾乎可以斷定主要是下面這個SQL導致了UNDO表空間使用的暴增。當然分析過程中,還有一些旁聽佐證。在此感覺沒有必要一一列舉了。
DELETE
FROM WRH$_SQL_PLAN tab
WHERE (:beg_snap <= tab.snap_id
AND tab.snap_id <= :end_snap
AND dbid = :dbid)
AND NOT EXISTS
(SELECT 1
FROM WRM$_BASELINE b
WHERE (tab.dbid = b.dbid)
AND (tab.snap_id >= b.start_snap_id)
AND (tab.snap_id <= b.end_snap_id)
)