10gClusterwareVotedisk損壞的恢復方法
votedisk無論是對於RAC(10g Clusterware、11g GI)而言,是非常重要的,我們稱它為仲裁盤,當RAC集群中的某個節點發生故障而脫網掉線時,就由它來判斷是否將其踢出集群,以保證集群正常運行,當votedisk損壞了,也就會導致集群服務無法啟動,集群資源都無法加載,最後導致罷工。那麼我們平時就要注意對votedisk的備份,在11g中,由於votedisk和ocr默認就會放進ASM磁盤組,因此可以不用特別關注,但對於10g的Cluster來說,由於不能放到ASM磁盤組,只能以raw的形式使用,因此要特別關注votedisk,定期對其進行備份,如:
用dd命令備份和恢復votedisk的方法:
備份:dd if=/dev/raw/raw3 of=/tmp/votedisk.bak
恢復:dd if=/tmp/votedisk.bak of=/dev/raw/raw3
如果很不幸,之前沒有做過備份,且沒有做過鏡像,當votedisk損壞的時候,就只能對crs進行重建了,下面來演示一下這個過程:
--關閉crs,對votedisk的盤進行破壞,這裡是/dev/raw/raw3
[root@rac1 ~]# dd if=/dev/zero of=/dev/raw/raw3 bs=4096 count=12800
再次重啟crs,就提示無法啟動了,查找ocssd.log日志文件發現,其中有記錄,說明了是磁盤損壞
PS:10g Clusterware的日志入口地址是$ORA_CRS_HOME/log/主機名/...
[ CSSD]2015-01-16 09:37:38.327 >USER: Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2094 Oracle. All rights reserved.
[ CSSD]2015-01-16 09:37:38.327 >USER: CSS daemon log for node rac1, number 1, in cluster cluster
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_CSSD))
[ CSSD]2015-01-16 09:37:38.332 [3059615952] >TRACE: clssscmain: local-only set to false
[ CSSD]2015-01-16 09:37:38.344 [3059615952] >TRACE: clssnmReadNodeInfo: added node 1 (rac1) to cluster
[ CSSD]2015-01-16 09:37:38.352 [3059615952] >TRACE: clssnmReadNodeInfo: added node 2 (rac2) to cluster
[ CSSD]2015-01-16 09:37:38.356 [3032808336] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[ CSSD]2015-01-16 09:37:38.356 [3059615952] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[ CSSD]2015-01-16 09:37:38.362 [3059615952] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw3)
[ CSSD]2015-01-16 09:37:40.381 [3032808336] >TRACE: clssnmvDiskOpen: corrupt kill block on disk (0x09!=0x636c73536b696c4c)
[ CSSD]2015-01-16 09:37:40.381 [3032808336] >TRACE: clssnmDiskStateChange: state from 2 to 3 disk (0//dev/raw/raw3)
重建crs很簡單,就執行2個腳本:
1.$ORA_CRS_HOME/install/rootdelete.sh
2.$ORA_CRS_HOME/install/rootdeinstall.sh
節點1:
[root@rac1 install]# ./rootdelete.sh
Shutting down Oracle Cluster Ready Services (CRS):
Stopping resources.
Error while stopping resources. Possible cause: CRSD is down.
Stopping CSSD.
Unable to communicate with the CSS daemon.
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down...
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'
[root@rac1 install]# ./rootdeinstall.sh
Removing contents from OCR device
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 0.590608 seconds, 17.8 MB/s
節點2:
[root@rac2 install]# ./rootdelete.sh
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed with invalid format: PROC-22: The OCR backend has an invalid format
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down...
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'
[root@rac2 install]# ./rootdeinstall.sh
Removing contents from OCR device
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 0.627909 seconds, 16.7 MB/s
[root@rac2 install]# dd if=/dev/zero of=/dev/raw/raw3 bs=4096 count=128000
dd: writing `/dev/raw/raw3': No space left on device
25601+0 records in
25600+0 records out
104857600 bytes (105 MB) copied, 5.40456 seconds, 19.4 MB/s
然後重新在2個節點依次執行$ORA_CRS_HOME/root.sh就可以了,軟件的OUI不用重新安裝
如果通過腳本無法刪除成功,安裝順利重新安裝crs,可以手工刪除以下目錄:
rm /etc/oracle/*
rm -f /etc/init.d/init.cssd
rm -f /etc/init.d/init.crs
rm -f /etc/init.d/init.crsd
rm -f /etc/init.d/init.evmd
rm -f /etc/rc2.d/K96init.crs
rm -f /etc/rc2.d/S96init.crs
rm -f /etc/rc3.d/K96init.crs
rm -f /etc/rc3.d/S96init.crs
rm -f /etc/rc5.d/K96init.crs
rm -f /etc/rc5.d/S96init.crs
rm -Rf /etc/oracle/scls_scr
rm -f /etc/inittab.crs
cp /etc/inittab.orig /etc/inittab
總結:
平時我們都會對ocr和votedisk磁盤做多個鏡像冗余,另外,如果是裸設備的話,還會通過dd命令單獨去備份,通常是不太容易損壞和丟失的,萬一發生了無備份情況下的損壞,那麼就只能工作重建crs來解決問題了,這就是DBAs們的最後一根救命稻草了。