IP沖突導致的ORA-12170
開發反饋應用程序連不上數據庫,並且應用日志中報錯:
2014-03-24 09:56:03,095 [WARN] ==== http-bio-8010-exec-3: org.hibernate.util.JDBCExceptionReporter.logExceptions(77)
SQL Error: 20, SQLState: 61000
2014-03-24 09:56:03,096 [ERROR] ==== http-bio-8010-exec-3: org.hibernate.util.JDBCExceptionReporter.logExceptions(78)
The Network Adapter could not establish the connection
DB環境為RAC環境,登陸上兩台服務器分別檢查,發現RAC上的資源都顯示為正常狀態,包括instance,監聽等,並且從服務器端sqlplus / as sysdba連進去坐了些switch logfile之類的操作,都沒有問題。
於是嘗試從我的個人電腦上Plsql連接DB,半天不響應,之後報了:ORA-12170,TNS:Connect timeout occurred。看看sqlnet.log的信息:
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
Windows NT TCP/IP NT Protocol Adapter for 32-bit Windows: Version 10.2.0.1.0 - Production
Time: 24-3月 -2014 09:43:08
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS: 操作超時
ns secondary err code: 12560
nt main err code: 505
TNS-00505: 操作超時
nt secondary err code: 60
nt OS err code: 0
Client address: <unknown>
***********************************************************************
各個錯誤代碼如下:
[oracle@bssdb02 ~]$ oerr ora 12170
12170, 00000, "TNS:Connect timeout occurred"
// *Cause: The server shut down because connection establishment or
// communication with a client failed to complete within the allotted time
// interval. This may be a result of network or system delays; or this may
// indicate that a malicious client is trying to cause a Denial of Service
// attack on the server.
// *Action: If the error occurred because of a slow network or system,
// reconfigure one or all of the parameters SQLNET.INBOUND_CONNECT_TIMEOUT,
// SQLNET.SEND_TIMEOUT, SQLNET.RECV_TIMEOUT in sqlnet.ora to larger values.
// If a malicious client is suspected, use the address in sqlnet.log to
// identify the source and restrict access. Note that logged addresses may
// not be reliable as they can be forged (e.g. in TCP/IP).
[oracle@bssdb02 ~]$ oerr ora 12535
12535, 00000, "TNS:operation timed out"
// *Cause: The requested operation could not be completed within the time out
// period.
// *Action: Look at the documentation on the secondary errors for possible
// remedy. See SQLNET.LOG to find secondary error if not provided explicitly.
// Turn on tracing to gather more information.
[oracle@bssdb02 ~]$ oerr ora 12560
12560, 00000, "TNS:protocol adapter error"
// *Cause: A generic protocol adapter error occurred.
// *Action: Check addresses used for proper protocol specification. Before
// reporting this error, look at the error stack and check for lower level
// transport errors. For further details, turn on tracing and reexecute the
// operation. Turn off tracing when the operation is complete.
[oracle@bssdb02 ~]$
很詭異,這種現象一般是服務名或者連接串配置有誤,但同時多個用戶突然出現這種錯誤,應該可以排除這方面的原因,防火牆都是關閉狀態,這方面影響因素也可以排除,於是做了level 16的客戶端的tns跟蹤:
trace_level_client = 16
trace_file_client = client
trace_directory_client = d:\tmp
從trace內容來看,應該是在NODE1將用戶連接轉發到NODE2之後,用戶連接在連接NODE2的時候出現問題導致的。
[24-3月 -2014 09:40:59:687] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.0.23.135)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=RAC)(FAILOVER_MODE=(TYPE=SESSION)(METHOD=BASIC))(CID=(PROGRAM=C:\oracle\product\10.2.0\db_1\bin\sqlplus.exe)(HOST=BPIT-10003657)(USER=pengmd))))
[24-3月 -2014 09:40:59:687] nttbnd2addr: entry
[24-3月 -2014 09:40:59:687] snlinGetAddrInfo: entry
[24-3月 -2014 09:40:59:687] snlinGetAddrInfo: exit
[24-3月 -2014 09:40:59:687] nttbnd2addr: using host IP address: 10.0.23.135
[24-3月 -2014 09:40:59:687] snlinFreeAddrInfo: entry
[24-3月 -2014 09:40:59:687] snlinFreeAddrInfo: exit
[24-3月 -2014 09:40:59:687] nttbnd2addr: exit
[24-3月 -2014 09:40:59:687] nsc2addr: normal exit
[24-3月 -2014 09:40:59:688] nsopen: entry
[24-3月 -2014 09:40:59:688] nsmal: entry
[24-3月 -2014 09:40:59:688] nsmal: 736 bytes at 0x1bbf068
[24-3月 -2014 09:40:59:688] nsmal: normal exit
[24-3月 -2014 09:40:59:688] nsopenmplx: entry
[24-3月 -2014 09:40:59:688] nsmal: entry
[24-3月 -2014 09:40:59:688] nsmal: 1644 bytes at 0x1bbf350
[24-3月 -2014 09:40:59:688] nsmal: normal exit
[24-3月 -2014 09:40:59:688] nsiorini: entry
[24-3月 -2014 09:40:59:688] nsbal: entry
[24-3月 -2014 09:40:59:688] nsbgetfl: entry
[24-3月 -2014 09:40:59:688] nsbgetfl: normal exit
[24-3月 -2014 09:40:59:688] nsmal: entry
[24-3月 -2014 09:40:59:688] nsmal: 84 bytes at 0x1a2feb8
[24-3月 -2014 09:40:59:688] nsmal: normal exit
[24-3月 -2014 09:40:59:688] nsbal: normal exit
[24-3月 -2014 09:40:59:688] nsiorini: exit (0)
[24-3月 -2014 09:40:59:688] nscpxget: entry
[24-3月 -2014 09:40:59:688] nscpxget: normal exit
[24-3月 -2014 09:40:59:688] nsopenalloc_nsntx: nlhthput on mplx_ht_nsgbu:ctx=1bbf068, nsntx=1bbf350
[24-3月 -2014 09:40:59:688] nsopenmplx: normal exit
[24-3月 -2014 09:40:59:688] nsopen: opening transport...
[24-3月 -2014 09:40:59:688] nttcon: entry
[24-3月 -2014 09:40:59:688] nttcon: toc = 1
[24-3月 -2014 09:40:59:688] nttcnp: entry
[24-3月 -2014 09:40:59:688] nttcnp: creating a socket.
[24-3月 -2014 09:40:59:689] nttcnp: exit
[24-3月 -2014 09:40:59:689] nttcni: entry
[24-3月 -2014 09:40:59:689] nttcni: trying to connect to socket 288.
[24-3月 -2014 09:40:59:691] snlinGetNameInfo: entry
[24-3月 -2014 09:40:59:691] snlinGetNameInfo: Using numeric form of host's address 10.25.192.23
[24-3月 -2014 09:40:59:691] snlinGetNameInfo: exit
[24-3月 -2014 09:40:59:691] nttcni: connected on ipaddr 10.25.192.23
[24-3月 -2014 09:40:59:691] nttcni: exit
[24-3月 -2014 09:40:59:691] nttcon: NT layer TCP/IP connection has been established. <==========NODE1已經跟客戶端已經成功建立了連接
[24-3月 -2014 09:40:59:691] nttcon: set TCP_NODELAY on 288
[24-3月 -2014 09:40:59:691] nttcon: exit
[24-3月 -2014 09:40:59:691] nsopen: transport is open
[24-3月 -2014 09:40:59:703] nscall: connecting...
[24-3月 -2014 09:40:59:703] nsc2addr: entry
[24-3月 -2014 09:40:59:703] nsc2addr: (address=(protocol=tcp)(host=10.0.23.136)(port=1521)) <===========NODE1將連接轉到NODE2
[24-3月 -2014 09:40:59:703] nttbnd2addr: entry
[24-3月 -2014 09:40:59:703] snlinGetAddrInfo: entry
[24-3月 -2014 09:40:59:703] snlinGetAddrInfo: exit
[24-3月 -2014 09:40:59:703] nttbnd2addr: using host IP address: 10.0.23.136 <======用戶發起到NODE2的連接
[24-3月 -2014 09:40:59:703] snlinFreeAddrInfo: entry
[24-3月 -2014 09:40:59:703] snlinFreeAddrInfo: exit
[24-3月 -2014 09:40:59:703] nttbnd2addr: exit
[24-3月 -2014 09:40:59:703] nsc2addr: normal exit
[24-3月 -2014 09:40:59:703] nsopen: entry
[24-3月 -2014 09:40:59:703] nsmal: entry
[24-3月 -2014 09:40:59:703] nsmal: 736 bytes at 0x1bbf068
[24-3月 -2014 09:40:59:703] nsmal: normal exit
[24-3月 -2014 09:40:59:703] nsopenmplx: entry
[24-3月 -2014 09:40:59:703] nsmal: entry
[24-3月 -2014 09:40:59:703] nsmal: 1644 bytes at 0x1bbf350
[24-3月 -2014 09:40:59:703] nsmal: normal exit
[24-3月 -2014 09:40:59:703] nsiorini: entry
[24-3月 -2014 09:40:59:703] nsbal: entry
[24-3月 -2014 09:40:59:703] nsbgetfl: entry
[24-3月 -2014 09:40:59:703] nsbgetfl: normal exit
[24-3月 -2014 09:40:59:703] nsbal: normal exit
[24-3月 -2014 09:40:59:703] nsiorini: exit (0)
[24-3月 -2014 09:40:59:703] nscpxget: entry
[24-3月 -2014 09:40:59:703] nscpxget: normal exit
[24-3月 -2014 09:40:59:703] nsopenalloc_nsntx: nlhthput on mplx_ht_nsgbu:ctx=1bbf068, nsntx=1bbf350
[24-3月 -2014 09:40:59:704] nsopenmplx: normal exit
[24-3月 -2014 09:40:59:704] nsopen: opening transport...
[24-3月 -2014 09:40:59:704] nttcon: entry
[24-3月 -2014 09:40:59:704] nttcon: toc = 1
[24-3月 -2014 09:40:59:704] nttcnp: entry
[24-3月 -2014 09:40:59:704] nttcnp: creating a socket.
[24-3月 -2014 09:40:59:704] nttcnp: exit
[24-3月 -2014 09:40:59:704] nttcni: entry
[24-3月 -2014 09:40:59:704] nttcni: trying to connect to socket 288.
[24-3月 -2014 09:41:20:711] ntt2err: entry
[24-3月 -2014 09:41:20:711] ntt2err: soc 288 error - operation=1, ntresnt[0]=505, ntresnt[1]=60, ntresnt[2]=0 <====連接失敗並報錯
[24-3月 -2014 09:41:20:711] ntt2err: exit
[24-3月 -2014 09:41:20:811] nttcni: exit
[24-3月 -2014 09:41:20:811] nttcon: exit
[24-3月 -2014 09:41:20:811] nserror: entry
[24-3月 -2014 09:41:20:811] nserror: nsres: id=0, op=65, ns=12535, ns2=12560; nt[0]=505, nt[1]=60, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
[24-3月 -2014 09:41:20:811] nsopen: unable to open transport
[24-3月 -2014 09:41:20:811] nsiocancel: entry
[24-3月 -2014 09:41:20:811] nsiofrrg: entry
[24-3月 -2014 09:41:20:811] nsbfr: entry
[24-3月 -2014 09:41:20:811] nsbaddfl: entry
[24-3月 -2014 09:41:20:811] nsbaddfl: normal exit
[24-3月 -2014 09:41:20:811] nsbfr: normal exit
[24-3月 -2014 09:41:20:811] nsiofrrg: exit
[24-3月 -2014 09:41:20:811] nsiocancel: exit
[24-3月 -2014 09:41:20:811] nsopenfree_nsntx: nlhthdel from mplx_ht_nsgbu, ctx=1bbf068 nsntx=1bbf350
[24-3月 -2014 09:41:20:811] nsiocancel: entry
[24-3月 -2014 09:41:20:812] nsiofrrg: entry
[24-3月 -2014 09:41:20:812] nsiofrrg: exit
[24-3月 -2014 09:41:20:812] nsiocancel: exit
[24-3月 -2014 09:41:20:812] nsmfr: entry
[24-3月 -2014 09:41:20:812] nsmfr: 1644 bytes at 0x1bbf350
[24-3月 -2014 09:41:20:812] nsmfr: normal exit
[24-3月 -2014 09:41:20:812] nsmfr: entry
[24-3月 -2014 09:41:20:812] nsmfr: 736 bytes at 0x1bbf068
[24-3月 -2014 09:41:20:812] nsmfr: normal exit
[24-3月 -2014 09:41:20:812] nsopen: error exit
[24-3月 -2014 09:41:20:812] nsmfr: entry
[24-3月 -2014 09:41:20:812] nsmfr: 238 bytes at 0x1bcf1d8
[24-3月 -2014 09:41:20:812] nsmfr: normal exit
[24-3月 -2014 09:41:20:812] nsmfr: entry
[24-3月 -2014 09:41:20:812] nsmfr: 212 bytes at 0x1a2fce0
[24-3月 -2014 09:41:20:812] nsmfr: normal exit
[24-3月 -2014 09:41:20:812] nladtrm: entry
[24-3月 -2014 09:41:20:812] nladtrm: exit
[24-3月 -2014 09:41:20:812] nscall: error exit
[24-3月 -2014 09:41:20:813] nioqper: error from nscall
[24-3月 -2014 09:41:20:813] nioqper: ns main err code: 12535
[24-3月 -2014 09:41:20:813] nioqper: ns (2) err code: 12560
[24-3月 -2014 09:41:20:813] nioqper: nt main err code: 505
[24-3月 -2014 09:41:20:813] nioqper: nt (2) err code: 60
[24-3月 -2014 09:41:20:813] nioqper: nt OS err code: 0
[24-3月 -2014 09:41:20:813] niomapnserror: entry
[24-3月 -2014 09:41:20:813] niqme: entry
[24-3月 -2014 09:41:20:813] niqme: reporting NS-12535 error as ORA-12535
[24-3月 -2014 09:41:20:813] niqme: exit
[24-3月 -2014 09:41:20:813] niomapnserror: exit
[24-3月 -2014 09:41:20:813] niotns: Couldn't connect, returning 12170
大致明白了到NODE2的連接有問題,但前面步驟已經對NODE2進行了檢查,沒發現異常狀況,再看看日志,發現crsd.log持續不斷的報這種錯:
2014-03-22 15:19:58.318: [ CRSRES][1532291392] Attempting to start `ora.bssdb02.vip` on member `bssdb02`
2014-03-22 15:20:20.592: [ CRSAPP][1532291392] StartResource error for ora.bssdb02.vip error code = 1
2014-03-22 15:20:20.990: [ CRSRES][1532291392] Start of `ora.bssdb02.vip` on member `bssdb02` failed.
2014-03-22 15:25:21.452: [ CRSRES][1532291392] startRunnable: setting CLI values
/var/log/messages也是持續不斷報錯:
Mar 23 04:05:02 bssdb02 avahi-daemon[7483]: Registering new address record for 10.0.23.136 on eth2.
Mar 23 04:05:02 bssdb02 avahi-daemon[7483]: Withdrawing address record for 10.0.23.136 on eth2.
看來VIP有問題,但ifconfig顯示VIP是正常的。
[root@bssdb02 ~]# ifconfig
eth1 Link encap:Ethernet HWaddr A4:BA:DB:13:EA:C2
inet addr:192.168.127.102 Bcast:192.168.127.255 Mask:255.255.255.0
inet6 addr: fe80::a6ba:dbff:fe13:eac2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:385015102 errors:0 dropped:0 overruns:0 frame:0
TX packets:447034056 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:231813004471 (215.8 GiB) TX bytes:322103900599 (299.9 GiB)
Interrupt:146 Memory:d8000000-d8012800
eth2 Link encap:Ethernet HWaddr A4:BA:DB:13:EA:C4
inet addr:10.0.23.132 Bcast:10.0.15.255 Mask:255.255.255.0
inet6 addr: fe80::a6ba:dbff:fe13:eac4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:362482320 errors:0 dropped:0 overruns:0 frame:0
TX packets:322680748 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:60578409873 (56.4 GiB) TX bytes:322706932324 (300.5 GiB)
Interrupt:154 Memory:da000000-da012800
eth2:1 Link encap:Ethernet HWaddr A4:BA:DB:13:EA:C4
inet addr:10.0.23.136 Bcast:10.0.15.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:154 Memory:da000000-da012800
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:107105620 errors:0 dropped:0 overruns:0 frame:0
TX packets:107105620 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:183491493683 (170.8 GiB) TX bytes:183491493683 (170.8 GiB)
[root@bssdb02 ~]#
既然客戶端連接NODE2有問題,那麼抓下包呢,用tcpdump配合客戶端的連接,但顯示到NODE2-VIP沒有包過來:
[root@bssdb02 ~]# tcpdump -i eth2 -Z oracle src host 10.25.192.23 and dst host 10.0.23.136 and port 1521
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 96 bytes
0 packets captured
0 packets received by filter
0 packets dropped by kernel
修改下tcpdump設置,但從NODE2-PUBLICIP還是有網路包過來的,如下:
[root@bssdb02 ~]# tcpdump -i eth2 -Z oracle src host 10.25.192.23 and port 1521
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 96 bytes
11:47:41.951599 IP bpit-10003657.sany.com.cn.63079 > bssdb02.ncube-lm: S 1037287069:1037287069(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
11:47:42.451180 IP bpit-10003657.sany.com.cn.63079 > bssdb02.ncube-lm: S 1037287069:1037287069(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
11:47:42.951154 IP bpit-10003657.sany.com.cn.63079 > bssdb02.ncube-lm: S 1037287069:1037287069(0) win 8192 <mss 1460,nop,nop,sackOK>
11:48:12.823245 IP bpit-10003657.sany.com.cn.63088 > bssdb02.ncube-lm: S 3262807939:3262807939(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
11:48:12.823833 IP bpit-10003657.sany.com.cn.63088 > bssdb02.ncube-lm: . ack 2306706224 win 16425
11:48:20.065630 IP bpit-10003657.sany.com.cn.63088 > bssdb02.ncube-lm: . ack 2 win 16425
11:48:20.066305 IP bpit-10003657.sany.com.cn.63088 > bssdb02.ncube-lm: F 0:0(0) ack 2 win 16425
問題到這基本上可以知道原因了,繼而在客戶端telnet NODE2-VIP 1521端口不通,但telnet NODE2-PUBLICIP 1521沒問題。
於是將NODE2的VIP重啟(crs_start ora..vip),但重啟失敗。
找到網絡工程師,查了下NODE2-VIP,發現局域網內IP沖突導致這個問題。