Hash鏈表的應用比較常見,其目的就是為了將不同的值映射到不同的位置,查找的時候直接找到相應的位置,而不需要傳統的順序遍歷或是二分查找,從而達到減少查詢時間的目的。常規的hash是預定義一定的桶(bucket),規定一個hash函數,然後進行散列。然而Mysql中的hash沒有固定的bucket,hash函數也是動態變化的,本文就進行非深入介紹。
基本結構體
Hash的結構體定義以及相關的函數接口定義在include/hash.h和mysys/hash.c兩個文件中。下面是HASH結構體的定義。
typedef struct st_hash {
size_t key_offset,key_length; /* Length of key if const length */
size_t blength;
ulong records;
uint flags;
DYNAMIC_ARRAY array; /* Place for hash_keys */
my_hash_get_key get_key;
void (*free)(void *);
CHARSET_INFO *charset;
} HASH;
成員名 說明 key_offset hash時key的offset,在不指定hash函數的情況下有意義 key_length key的長度,用於計算key值 blength 非常重要的輔助結構,初始為1,動態變化,用於hash函數計算,這裡理解為bucket length(其實不是真實的bucket數) records 實際的記錄數 flags 是否允許存在相同的元素,取值為HASH_UNIQUE(1)或者0 array 存儲元素的數組 get_key 用戶定義的hash函數,可以為NULL free 析構函數,可以為NULL charset 字符集
<font size="4"> </font><font size="3">HASH結構體裡面包含了一個動態數組結構體DYNAMIC_ARRAY,這裡就一並介紹了。其定義在<em><strong>include/my_sys.h</strong></em>中。</font>
typedef struct st_dynamic_array
{
uchar *buffer;
uint elements,max_element;
uint alloc_increment;
uint size_of_element;
} DYNAMIC_ARRAY;
成員名 說明 buffer 一塊連續的地址空間,用於存儲數據,可以看成一個數組空間 elements 元素個數 max_element 元素個數上限 alloc_increment 當元素達到上限時,即buffer滿時,按照alloc_increment進行擴展 size_of_element 每個元素的長度
初始化函數
Hash初始化函數對外提供兩個,my_hash_init和my_hash_init2,其區別即是否定義了growth_size(用於設置DYNAMIC_ARRAY的alloc_increment)。代碼在mysys/hash.c中。
#define my_hash_init(A,B,C,D,E,F,G,H) \
_my_hash_init(A,0,B,C,D,E,F,G,H)
#define my_hash_init2(A,B,C,D,E,F,G,H,I) \
_my_hash_init(A,B,C,D,E,F,G,H,I)
/**
@brief Initialize the hash
@details
Initialize the hash, by defining and giving valid values for
its elements. The failure to allocate memory for the
hash->array element will not result in a fatal failure. The
dynamic array that is part of the hash will allocate memory
as required during insertion.
@param[in,out] hash The hash that is initialized
@param[in] charset The charater set information
@param[in] size The hash size
@param[in] key_offest The key offset for the hash
@param[in] key_length The length of the key used in
the hash
@param[in] get_key get the key for the hash
@param[in] free_element pointer to the function that
does cleanup
@return inidicates success or failure of initialization
@retval 0 success
@retval 1 failure
*/
my_bool
_my_hash_init(HASH *hash, uint growth_size, CHARSET_INFO *charset,
ulong size, size_t key_offset, size_t key_length,
my_hash_get_key get_key,
void (*free_element)(void*), uint flags)
{
DBUG_ENTER("my_hash_init");
DBUG_PRINT("enter",("hash: 0x%lx size: %u", (long) hash, (uint) size));
hash->records=0;
hash->key_offset=key_offset;
hash->key_length=key_length;
hash->blength=1;
hash->get_key=get_key;
hash->free=free_element;
hash->flags=flags;
hash->charset=charset;
DBUG_RETURN(my_init_dynamic_array_ci(&hash->array,
sizeof(HASH_LINK), size, growth_size));
}
<font size="3"> 可以看到,_my_hash_init函數主要是初始化HASH結構體和hash->array(DYNAMIC_ARRAY結構體)。</font>
動態HASH函數
<font size="3"> 我們首先來看下hash函數的定義:</font>
static inline char*
my_hash_key(const HASH *hash, const uchar *record, size_t *length,
my_bool first)
{
if (hash->get_key)
return (char*) (*hash->get_key)(record,length,first);
*length=hash->key_length;
return (char*) record+hash->key_offset;
}
static uint my_hash_mask(my_hash_value_type hashnr, size_t buffmax,
size_t maxlength)
{
if ((hashnr & (buffmax-1)) < maxlength) return (hashnr & (buffmax-1));
return (hashnr & ((buffmax >> 1) -1));
}
my_hash_key參數 說明 hash HASH鏈表結構 record 帶插入的元素的值 length 帶插入元素的值長度 first 輔助參數
<font size=
"3"
> </font>
my_hash_mask參數 說明 hashnr my_hash_key的計算結果 buffmax hash結構體中的blength maxlength 實際桶的個數
你可能要問我怎麼有兩個?其實這和我們平時使用的差不多,第一個函數my_hash_key是根據我們的值進行Hash Key計算,一般我們在計算後,會對hash key進行一次模運算,以便計算結果在我們的bucket中。即my_hash_key的結果作為my_hash_mask的第一個輸入參數。其實到這裡都是非常好理解的,唯一讓我蛋疼的是my_hash_mask的實現,其計算結果是和第二和第三個參數有關,即Hash結構體中的blength和records有關。動態變化的,我去..
看到這裡我迷惑了,我上網經過各種百度,谷歌,終於讓我找到了一封Mysql Expert的回信:
Hi!
"Yan" == Yan Yu <[email protected]> writes:
Yan> Dear MySQL experts:
Yan> Thank you so much for your reply to my previous Qs, they are very
Yan> helpful!
Yan> Could someone please help me understand function my_hash_insert()
Yan> in mysys/hash.cc?
Yan> what are lines 352 -429 trying to achieve? Are they just some
Yan> optimization to shuffle existing
Yan> hash entries in the table (since the existing hash entries may be in
Yan> the wrong slot due to chaining
Yan> in the case of hash collision)?
<strong><font color="#ff0000">The hash algorithm is based on dynamic hashing without empty slots.</font></strong>
This means that when you insert a new key, in some cases a small set
of old keys needs to be moved to other buckets. This is what the code
does.
Regards,
Monty
紅色注釋的地方是重點,dynamic hash,原來如此,動態hash,第一次聽說,在網上下了個《Dynamic Hash Tables》的論文,下面圖解下基本原理。
<a href=http://up.2cto.com/2011/1214/20111214021339214.png"><img title="image" style="border-top-width: 0px; display: block; border-left-width: 0px; float: none; border-bottom-width: 0px; margin-left: auto; margin-right: auto; border-right-width: 0px" height="910" alt="image" src=http://www.bkjia.com/uploads/allimg/131125/053A0KM-0.png" width="570" border="0"></a>
<font size="3"></font>
動態Hash的本質是Hash函數的設計,圖中給出的動態hash函數只是論文中提到的一個例子。下面就具體解讀下Mysql中的hash插入——my_hash_insert
my_hash_insert非深入解析
首先給出my_hash_insert的源代碼,代碼在mysys/hash.c中。
my_bool my_hash_insert(HASH *info, const uchar *record)
{
int flag;
size_t idx,halfbuff,first_index;
my_hash_value_type hash_nr;
uchar *UNINIT_VAR(ptr_to_rec),*UNINIT_VAR(ptr_to_rec2);
HASH_LINK *data,*empty,*UNINIT_VAR(gpos),*UNINIT_VAR(gpos2),*pos;
if (HASH_UNIQUE & info->flags)
{
uchar *key= (uchar*) my_hash_key(info, record, &idx, 1);
if (my_hash_search(info, key, idx))
return(TRUE); /* Duplicate entry */
}
flag=0;
if (!(empty=(HASH_LINK*) alloc_dynamic(&info->array)))
return(TRUE); /* No more memory */
data=dynamic_element(&info->array,0,HASH_LINK*);
halfbuff= info->blength >> 1;
idx=first_index=info->records-halfbuff;
if (idx != info->records) /* If some records */
{
do
{
pos=data+idx;
hash_nr=rec_hashnr(info,pos->data);
if (flag == 0) /* First loop; Check if ok */
if (my_hash_mask(hash_nr, info->blength, info->records) != first_index)
break;
if (!(hash_nr & halfbuff))
{ /* Key will not move */
if (!(flag & LOWFIND))
{
if (flag & HIGHFIND)
{
flag=LOWFIND | HIGHFIND;
/* key shall be moved to the current empty position */
gpos=empty;
ptr_to_rec=pos->data;
empty=pos; /* This place is now free */
}
else
{
flag=LOWFIND | LOWUSED; /* key isn't changed */
gpos=pos;
ptr_to_rec=pos->data;
}
}
else
{
if (!(flag & LOWUSED))
{
/* Change link of previous LOW-key */
gpos->data=ptr_to_rec;
gpos->next= (uint) (pos-data);
flag= (flag & HIGHFIND) | (LOWFIND | LOWUSED);
}
gpos=pos;
ptr_to_rec=pos->data;
}
}
else
{ /* key will be moved */
if (!(flag & HIGHFIND))
{
flag= (flag & LOWFIND) | HIGHFIND;
/* key shall be moved to the last (empty) position */
gpos2 = empty; empty=pos;
ptr_to_rec2=pos->data;
}
else
{
if (!(flag & HIGHUSED))
{
/* Change link of previous hash-key and save */
gpos2->data=ptr_to_rec2;
gpos2->next=(uint) (pos-data);
flag= (flag & LOWFIND) | (HIGHFIND | HIGHUSED);
}
gpos2=pos;
ptr_to_rec2=pos->data;
}
}
}
while ((idx=pos->next) != NO_RECORD);
if ((flag & (LOWFIND | LOWUSED)) == LOWFIND)
{
gpos->data=ptr_to_rec;
gpos->next=NO_RECORD;
}
if ((flag & (HIGHFIND | HIGHUSED)) == HIGHFIND)
{
gpos2->data=ptr_to_rec2;
gpos2->next=NO_RECORD;
}
}
/* Check if we are at the empty position */
idx= my_hash_mask(rec_hashnr(info, record), info->blength, info->records + 1);
pos=data+idx;
if (pos == empty)
{
pos->data=(uchar*) record;
pos->next=NO_RECORD;
}
else
{
/* Check if more records in same hash-nr family */
empty[0]=pos[0];
gpos= data + my_hash_rec_mask(info, pos, info->blength, info->records + 1);
if (pos == gpos)
{
pos->data=(uchar*) record;
pos->next=(uint) (empty - data);
}
else
{
pos->data=(uchar*) record;
pos->next=NO_RECORD;
movelink(data,(uint) (pos-data),(uint) (gpos-data),(uint) (empty-data));
}
}
if (++info->records == info->blength)
info->blength+= info->blength;
return(0);
}
同時給出動態hash函數如下:
static uint my_hash_mask(my_hash_value_type hashnr, size_t buffmax,
size_t maxlength)
{
if ((hashnr & (buffmax-1)) < maxlength) return (hashnr & (buffmax-1));
return (hashnr & ((buffmax >> 1) -1));
}
<font size="3"> </font>
可以看出,hash函數是hash key與buffmax的模運算,buffmax即HASH結構中的blength,由my_hash_insert中最後幾行代碼可知:info->blength+= info->blength; 其初始值為1,即blength = 2^n,而且blengh始終是大於records。這個動態hash函數的基本意思是key%(2^n)。依然用圖解這個動態hash函數。
hash函數基本清楚了,但是mysql的具體實現還是比較值得探討的。那封回信中也提到了without empty slots,是的,它這種實現方式是根據實際的數據量進行桶數的分配。我這裡大概說下代碼的流程(有興趣的還需要大家自己仔細琢磨)。
摘自 心中無碼