春節期間線上出了兩個php-cgi的core,具體追查過程如下:
一、 Core信息
file core.xxx
bug.php-cgi.3611.1296586902: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from ‘php-cgi’
gdb ~/php5/bin/php-cgi core.xxx
Core was generated by `~/php5/bin/php-cgi –fpm –fpm-config ~/php5/etc/php-fpm.co’.
Program terminated with signal 4, Illegal instruction.
(gdb) bt
#0 0×0000000001000707 in ?? ()
#1 0×00000000006b1402 in zend_hash_destroy (ht=0×7fbffff4f8)
at ~/self/xxx/soft/source/src/php/php-5.2.8/Zend/zend_hash.c:526
#2 0×0000000000732b2e in fcgi_close (req=0×7fbfffd4c0, force=0, destroy=Variable “destroy” is not available.
)
at ~/self/xxx/soft/source/src/php/php-5.2.8/sapi/cgi/fastcgi.c:894
#3 0×0000000000732d24 in fcgi_finish_request (req=0×7fbfffd4c0)
at ~/self/xxx/soft/source/src/php/php-5.2.8/sapi/cgi/fastcgi.c:1248
#4 0×0000000000732d49 in fcgi_accept_request (req=0×7fbfffd4c0)
at ~/self/xxx/soft/source/src/php/php-5.2.8/sapi/cgi/fastcgi.c:944
#5 0×00000000007352b8 in main (argc=4, argv=0×7fbffff698)
at ~/self/xxx/soft/source/src/php/php-5.2.8/sapi/cgi/cgi_main.c:2224
根據堆棧可以看出core發生在php-fpm在accept一個新請求時,在對上一個請求(請求異常終止?)進行資源釋放時core掉的,線上的php訪問模式是apache+fastcgi+php的模式。一層層堆棧往下看:
1) f 0
已經被寫壞了,沒有什麼有用信息
2) f 1
打印zend_hash_destroy函數的參數
(gdb) p *ht
$5 = {nTableSize = 16779009, nTableMask = 0, nNumOfElements = 16779009, nNextFreeElement = 16779009,
pInternalPointer = 0×1000701, pListHead = 0×1000701, pListTail = 0×1000701, arBuckets = 0×1000701,
pDestructor = 0×1000701, persistent = 1 ‘\001′, nApplyCount = 7 ‘\a’, bApplyProtection = 0 ‘\0′}
PHP HashTbale的數據結構可以上網上搜一下,有很多介紹。這個hashtable已經被寫壞了,各個節點指向的內存0×1000701,該內存地址在gdb中都是一個不能訪問的內存。依然沒有什麼有用信息。
3) f 2
查看源碼,打印fcgi_close的參數
(gdb) p *req
$6 = {listen_socket = 0, fd = 11, id = 1, keep = 0, in_len = 0, in_pad = 0, out_hdr = 0×0,
out_pos = 0×7fbffffcf8 “\001\003″,
out_buf = “\001\a\000\001\037鳿000\000PHP Warning: simplexml_load_string() [<a href='function.simplexml-load-string'>function.simplexml-load-string</a>]: Entity: line 1: parser error : Start tag expected, ‘<’ not found in /hom”…, reserved = “\001\a\000\001\000\000\000\000\001\a\000\001\000\000\000″, env = {nTableSize = 16779009,
nTableMask = 0, nNumOfElements = 16779009, nNextFreeElement = 16779009, pInternalPointer = 0×1000701,
pListHead = 0×1000701, pListTail = 0×1000701, arBuckets = 0×1000701, pDestructor = 0×1000701,
persistent = 1 ‘\001′, nApplyCount = 7 ‘\a’, bApplyProtection = 0 ‘\0′}}
(gdb) ptype req
type = struct _fcgi_request {
int listen_socket;
int fd;
int id;
int keep;
int in_len;
int in_pad;
fcgi_header *out_hdr;
unsigned char *out_pos;
unsigned char out_buf[8192];
unsigned char reserved[16];
HashTable env;
} *
調用zend_hash_destroy(&req->env)進行銷毀的是req的成員env,這個成員變量是一個hashtable,該hashtable已經被上一個請求寫壞了,導致新請求在釋放上一個請求時core掉。
req->out_buf數組是php-cgi和apache進行交互的內存緩沖區,簡單看了一下,目前out_buf中的內容全部為simple_xml_load…這個PHP WARNNING,類似的錯誤信息出現在out_buf中的原因是PHP需要通過fastcgi協議打印錯誤信息到apache的error_log中。req->out_pos指針則指向當前buf末尾。
gdb) p req->out_pos – req->out_buf
$2 = 8312
BUF的末尾位置已經超過了聲明的大小8192,所以可以判斷後面的env成員變量已經在寫out_buf的過程中被寫壞了。PHP中有一個重要的全局變量sapi_globals,通過閱讀PHP源碼得知,新請求的sapi_globals請求數據填充在fcgi_accept_request完成之後的init_request_info函數中,所以當前內存中的sapi_globals仍然是上次請求的殘留信息
(gdb) p sapi_globals
從數據中得知導致core的罪魁禍首是線上某個功能的URL
二、 fastcgi源碼分析
(1) 源碼位置
fastcgi源碼位置:php5/sapi/cgi/fastcgi.c
cgi_main源碼位置:php5/sapi/cgi/cgi_main.c
(2) 結構體介紹
首先關注一下fcgi_request這個結構體
typedef struct _fcgi_request {
int listen_socket;
#ifdef _WIN32
int tcp;
#endif
int fd;
int id;
int keep;
int in_len;
int in_pad;
fcgi_header *out_hdr;
unsigned char *out_pos;
unsigned char out_buf[1024*8];