目錄1
1.現象描述1
2.問題定位2
3.解決方案5
4.相關代碼5
外界連接9090端口均超時,但telnet端口總是成功。使用top命令觀察,發現單個線程的CPU最高達99.99%,但並不總是99.9%,而是在波動。當遷走往該機器的流量後,能夠訪問成功,但仍然有超時,讀超時比寫超時多:
#./hbase_stress--hbase=110.13.136.207:9090--test=2--timeout=10
[2016-11-2710:15:21/771][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090):EAGAIN(timedout)
[2016-11-2710:15:31/775][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090):EAGAIN(timedout)
PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
20727zhangsan20010.843g9.263g26344R99.926.41448:00java
20729zhangsan20010.843g9.263g26344R99.926.41448:00java
20730zhangsan20010.843g9.263g26344R99.926.41449:10java
20728zhangsan20010.843g9.263g26344R99.826.41448:00java
20693zhangsan20010.843g9.263g26344S0.026.40:00.00java
20727zhangsan20010.843g9.263g26344R75.526.41448:06java
20728zhangsan20010.843g9.263g26344R75.226.41448:06java
20729zhangsan20010.843g9.263g26344R75.226.41448:06java
20730zhangsan20010.843g9.263g26344R75.226.41449:15java
20716zhangsan20010.843g9.263g26344S24.926.493:48.75java
使用ps命令找出CPU最多的線程,和top顯示的一致:
$ps-mp20693-oTHREAD,tid,time|sort-rn
zhangsan18.819----207301-00:11:23
zhangsan18.719----207291-00:10:13
zhangsan18.719----207281-00:10:13
zhangsan18.719----207271-00:10:13
zhangsan16.119-futex_--2073120:44:51
zhangsan5.219-futex_--2073206:46:39
然後借助jstack,發現為GC進程:
"Gangworker#0(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d4000nid=0x50f7runnable
"Gangworker#1(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d5800nid=0x50f8runnable
"Gangworker#2(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d7800nid=0x50f9runnable
"Gangworker#3(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d9000nid=0x50farunnable
使用jstat工具查看GC,情況很不樂觀,問題就是有GC引起的:
$jstat-gcutil206931000100
S0S1EOMCCSYGCYGCTFGCFGCTGCT
0.0099.67100.00100.0098.0894.4142199369.1322708434869.60135238.733
0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580
0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580
0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580
$jstat-gccapacity20693
NGCMNNGCMXNGCS0CS1CECOGCMNOGCMXOGCOCMCMNMCMXMCCCSMNCCSMXCCSCYGCFGC
191808.01107520.01107520.0110720.0110720.0886080.0383680.08094144.08094144.08094144.00.01077248.031584.00.01048576.03424.04219927156
$jstat-gcold20693
MCMUCCSCCCSUOCOUYGCFGCFGCTGCT
31584.030978.73424.03232.78094144.08094144.0421992717434964.34735333.479
$jstat-gcoldcapacity20693
OGCMNOGCMXOGCOCYGCFGCFGCTGCT
383680.08094144.08094144.08094144.0421992719234982.62335351.755
$jstat-gcnewcapacity20693
NGCMNNGCMXNGCS0CMXS0CS1CMXS1CECMXECYGCFGC
191808.01107520.01107520.0110720.0110720.0110720.0110720.0886080.0886080.04219927202
$jstat-gc20693
S0CS1CS0US1UECEUOCOUMCMUCCSCCCSUYGCYGCTFGCFGCTGCT
110720.0110720.00.0110395.9886080.0886080.08094144.08094144.031584.030978.73424.03232.742199369.1322720634996.53835365.671
$jstat-gcnew20693
S0CS1CS0US1UTTMTTDSSECEUYGCYGCT
110720.0110720.00.0110396.96655360.0886080.0886080.042199369.132
使用lsof顯示該進程的連接數不多,完全在安全范圍內,問題應當是有對象不能被回收。使用jmap查看內存詳情,先看堆的使用情況:
$jmap-heap20693
AttachingtoprocessID20693,pleasewait...
Debuggerattachedsuccessfully.
Servercompilerdetected.
JVMversionis25.77-b03
usingparallelthreadsinthenewgeneration.
usingthread-localobjectallocation.
ConcurrentMark-SweepGC
HeapConfiguration:
MinHeapFreeRatio=40
MaxHeapFreeRatio=70
MaxHeapSize=9422503936(8986.0MB)
NewSize=196411392(187.3125MB)
MaxNewSize=1134100480(1081.5625MB)
OldSize=392888320(374.6875MB)
NewRatio=2
SurvivorRatio=8
MetaspaceSize=21807104(20.796875MB)
CompressedClassSpaceSize=1073741824(1024.0MB)
MaxMetaspaceSize=17592186044415MB
G1HeapRegionSize=0(0.0MB)
HeapUsage:
NewGeneration(Eden+1SurvivorSpace):
capacity=1020723200(973.4375MB)
used=1020398064(973.1274261474609MB)
free=325136(0.3100738525390625MB)
99.96814650632022%used
EdenSpace:
capacity=907345920(865.3125MB)
used=907345920(865.3125MB)
free=0(0.0MB)
100.0%used
FromSpace:
capacity=113377280(108.125MB)
used=113052144(107.81492614746094MB)
free=325136(0.3100738525390625MB)
99.71322649476156%used
ToSpace:
capacity=113377280(108.125MB)
used=0(0.0MB)
free=113377280(108.125MB)
0.0%used
concurrentmark-sweepgeneration:
capacity=8288403456(7904.4375MB)
used=8288403424(7904.437469482422MB)
free=32(3.0517578125E-5MB)
99.9999996139184%used
10216internedStringsoccupying934640bytes.
進一步查看對象的情況:
$jmap-histo20693
num#instances#bytesclassname
----------------------------------------------
1:728352122518411456[B
2:498271471993085880java.util.TreeMap$Entry
3:12855993617087664java.util.TreeMap
4:4285217445662568org.apache.hadoop.hbase.client.ClientScanner
5:4285222377099536org.apache.hadoop.hbase.client.Scan
6:4284875377069000org.apache.hadoop.hbase.client.ScannerCallable
7:4285528342921344[Ljava.util.HashMap$Node;
8:4284880308511360org.apache.hadoop.hbase.client.ScannerCallableWithReplicas
9:8570671274261472java.util.LinkedList
10:4285579205707792java.util.HashMap
11:4285283205693584org.apache.hadoop.hbase.client.RpcRetryingCaller
12:3820914152836560org.apache.hadoop.hbase.filter.SingleColumnValueFilter
13:4291904137340928java.util.concurrent.ConcurrentHashMap$Node
14:8570636137130176java.util.TreeMap$EntrySet
15:4285278137128896org.apache.hadoop.hbase.io.TimeRange
16:8570479137127664java.util.concurrent.atomic.AtomicBoolean
17:289140992525088org.apache.hadoop.hbase.NoTagsKeyValue
18:428654068584640java.lang.Integer
19:428529868564768java.util.TreeMap$KeySet
20:428527568564400java.util.TreeSet
21:428500668560096java.util.HashSet
22:428485168557616java.util.HashMap$KeySet
23:317611850817888org.apache.hadoop.hbase.filter.BinaryComparator
24:10933607600[Ljava.util.concurrent.ConcurrentHashMap$Node;
25:41877518479112[Lorg.apache.hadoop.hbase.Cell;
26:67144317693224[C
27:41878116751240org.apache.hadoop.hbase.client.Result
28:66973916073736java.lang.String
29:64479615475104org.apache.hadoop.hbase.filter.SubstringComparator
30:41913410059216java.util.LinkedList$Node
為使系統能夠正常工作,先實施治標不治本的方案:監控GC,定時重啟HBaseThrift2進程,然後再找出根本原因達到治本的目的。
從上面jmap的輸出來看,猜測是不是因為額scanner沒有被關閉導致的。而scanner沒有被關閉的原因有兩個:一是客戶端程序問題沒有關閉,也就是有內存洩漏了,二是客戶端程序異常導致沒機會關閉。
查看客戶端源代碼,確實存在openScanner的異常時未關閉。另外客戶端被kill掉或斷電等,也會導致無法釋放,這一點是HBaseThrift2得解決的問題。
針對前面分析出的問題分別加以解決:
1)客戶端保證scanner全部釋放;
2)HBaseThrift2增加自動釋放長時間未操作的scanner;
3)另外也可以使用getScannerResults替代getScannerRows來規避此問題。
補丁:
https://issues.apache.org/jira/browse/HBASE-17182。
privatefinalMap<Integer,ResultScanner>scannerMap
=newConcurrentHashMap<Integer,ResultScanner>();
@Override
publicintopenScanner(ByteBuffertable,TScanscan)throwsTIOError,TException{
Tablehtable=getTable(table);
ResultScannerresultScanner=null;
try{
resultScanner=htable.getScanner(scanFromThrift(scan));
}catch(IOExceptione){
throwgetTIOError(e);
}finally{
closeTable(htable);
}
//將scanner放入到scannerMap中,
//如果客戶端沒有調用closeScanner,則導致該scanner洩漏,GC無法回收改部分內存
returnaddScanner(resultScanner);
}
/**
*AssignsauniqueIDtothescannerandaddsthemappingtoaninternalHashMap.
*@paramscannertoadd
*@returnIdforthisScanner
*/
privateintaddScanner(ResultScannerscanner){
intid=nextScannerId.getAndIncrement();
scannerMap.put(id,scanner);//將scanner放入到scannerMap中
returnid;
}
/**
*ReturnstheScannerassociatedwiththespecifiedId.
*@paramidoftheScannertoget
*@returnaScanner,ornulliftheIdisinvalid
*/
privateResultScannergetScanner(intid){
returnscannerMap.get(id);
}
@Override
publicvoidcloseScanner(intscannerId)throwsTIOError,TIllegalArgument,TException{
LOG.debug("scannerClose:id="+scannerId);
ResultScannerscanner=getScanner(scannerId);
if(scanner==null){
Stringmessage="scannerIDisinvalid";
LOG.warn(message);
TIllegalArgumentex=newTIllegalArgument();
ex.setMessage("InvalidscannerId");
throwex;
}
scanner.close();//關閉scanner
removeScanner(scannerId);//從scannerMap中移除scanner
}
/**
*RemovesthescannerassociatedwiththespecifiedIDfromtheinternalHashMap.
*@paramidoftheScannertoremove
*@returntheremovedScanner,or<code>null</code>iftheIdisinvalid
*/
protectedResultScannerremoveScanner(intid){
returnscannerMap.remove(id);//從scannerMap中移除scanner
}