程式師世界 >> 數據庫知識 >> MYSQL數據庫 >> MySQL綜合教程 >> MYSQL入門學習之三：全文本搜索

MYSQL入門學習之三：全文本搜索

編輯：MySQL綜合教程

MYSQL入門學習之三：全文本搜索一、理解全文本搜索 www.2cto.com 1、MyISAM支持全文本搜索，而InnoDB不支持。 2、在使用全文本搜索時，MySQL不需要分別查看每個行，不需要分別分析和處理每個詞。MySQL創建指定列中各詞的一個索引，搜索可以針對這些詞進行。這樣MySQL可以快速有效地決定哪些詞匹配，哪些詞不匹配，它們匹配的頻率，等等。二、使用全文本搜索 1、為了進行全文本搜索，必須索引被搜索的列，而且要隨著數據的改變不斷地重新索引。在對表列進行適當設計後，MySQL會自動進行所有的索引和重新索引。在索引之後，SELECT可與Match()和Against()一起使用以實際執行搜索。 2、一般在創建表時啟用全文本搜索。 [sql] create table productnotes ( note_id int not nullauto_increment, note_text text null, primary key(note_id), fulltext(note_text) )engine=MyISAM; 在定義之後，MySQL自動維護該索引。在增加、更新或刪除行時，索引隨之自動更新。 3、不要在導入數據時使用FULLTEXT。 www.2cto.com 4、進行全文本搜索 Match()指定被搜索的列，Against()指定要使用的搜索表達式。 [sql] mysql> select * from productnotes -> whereMatch(note_text) Against('designed'); +---------+--------------------------------------------------------------------- ------------------------------------------------------+ | note_id | note_text | +---------+--------------------------------------------------------------------- ------------------------------------------------------+ | 6 | LimsLink isdesigned to interface output from chromatography data sy stems (CDSs) to LIMS. | | 5 | This line ofproprietary reagents, containers, and automation tools is designed for genomics and drug discovery research. | +---------+--------------------------------------------------------------------- ------------------------------------------------------+ 2 rows in set (0.03 sec) 5、傳遞給Match()的值必須與FULLTEXT()定義中的相同。如果指定多個列，則必須列出它們（而且次序正確）。 6、除非使用BINARY方式，否則全文本搜索不區分大小寫。 [sql] mysql> select * from productnotes -> where BINARYMatch(note_text) Against('line'); +---------+--------------------------------------------------------------------- ------------------------------------------------------+ | note_id | note_text | +---------+--------------------------------------------------------------------- ------------------------------------------------------+ | 5 | This line ofproprietary reagents, containers, and automation tools is designed for genomics and drug discovery research. | +---------+--------------------------------------------------------------------- ------------------------------------------------------+ 1 row in set (0.05 sec) 7、全文本搜索的一個重要部分就是對結果排序。具有較高等級的行先返回。等級由MySQL根據行中詞的數目、唯一詞的數目、整個索引中詞的總數以及包含該詞的行的數目計算出來。文本中詞先前的行的等級值比詞靠後的行的等級值高。 [sql] mysql> select note_id, Match(note_text) Against('This line')as rank,note_text -> fromproductnotes -> whereMatch(note_text) Against('This line'); +---------+------------------+-------------------------------------------------- ----------------------------------------------------------------------------+ | note_id | rank | note_text | +---------+------------------+-------------------------------------------------- ----------------------------------------------------------------------------+ | 5 |0.81339610830754 | This line of proprietary reagents,. containers, a nd automation tools is designed. for genomics and drugdiscovery .research. | | 7 |0.76517958501676 | specificities include both alpha–beta and beta– beta. This line from chromatography .data systems (CDSs) and toLIMS. | +---------+------------------+-------------------------------------------------- ----------------------------------------------------------------------------+ 2 rows in set (0.00 sec) 8、查詢擴展 www.2cto.com 在使用查詢擴展時，MySQL對數據和索引進行兩遍掃描來完成搜索。首先，進行一個基本的全文本搜索，找出與搜索條件匹配的所有行；其次，MySQL檢查這些匹配行並選擇所有有用的詞；再次，MySQL再次進行全文本搜索，這次不僅使用原來的條件，而且還使用所有有用的詞。利用查詢擴展，能找出可能相關的結果，即使它們並不精確包含所查找的詞。表中的行越多，使用查詢擴展返回的結果越好。查詢擴展功能在MySQL4.1.1中引入。 [sql] mysql> select note_id, Match(note_text) Against('This line')as rank,note_text -> fromproductnotes -> where Match(note_text)Against('This line' with query expansion); +---------+------------------+-------------------------------------------------- ----------------------------------------------------------------------------+ | note_id | rank | note_text | +---------+------------------+-------------------------------------------------- ----------------------------------------------------------------------------+ | 5 | 0.81339610830754| This line of proprietary reagents,. containers, a nd automation tools is designed. for genomics and drugdiscovery .research. | | 7 |0.76517958501676 | specificities include both alpha–beta and beta– beta. This line from chromatography .data systems (CDSs) and toLIMS. | | 3 | 0 | Human S-100. monoclonal.and polyclonal specifici ties include both alpha–beta and beta–beta isoforms. | | 6 | 0 | LimsLink is .designed to interfaceoutput. from c hromatography .data systems (CDSs) and to LIMS. | | 1 | 0 | PepTool allows users tostore, manage. analyze, a nd visualize protein data. | +---------+------------------+-------------------------------------------------- ----------------------------------------------------------------------------+ 5 rows in set (0.00 sec) 9、布爾文本搜索（boolean mode）以布爾方式，可以提供關於如下內容的細節：要匹配的詞； www.2cto.com 要排斥的詞；排列提示；（指定某些詞比其他詞更重要）表達式分組；另外一些內容。 [sql] mysql> select note_id,note_text -> fromproductnotes -> whereMatch(note_text) Against('line' in boolean mode); +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | note_id | note_text | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | 5 | This line ofproprietary reagents,. containers, and automation tools is designed. for genomicsand drug discovery .research. | | 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro m chromatography .data systems (CDSs) and to LIMS. | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ 2 rows in set (0.00 sec) 即使沒有FULLTEXT索引也可以使用布爾文本搜索。但是非常緩慢。 mysql> select note_id,note_text/*匹配line且不包含systems*/ -> fromproductnotes -> whereMatch(note_text) Against('line -systems*' in boolean mode); +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | note_id | note_text | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | 5 | This line ofproprietary reagents,. containers, and automation tools is designed. forgenomics and drug discovery .research. | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ 1 row in set (0.00 sec) mysql> select note_id,note_text/*匹配line且匹配systems*/ -> fromproductnotes -> whereMatch(note_text) Against('+line +systems' in boolean mode); +---------+--------------------------------------------------------------------- ---------------------------------------------------+ | note_id | note_text | +---------+--------------------------------------------------------------------- ---------------------------------------------------+ | 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro m chromatography .data systems (CDSs) and to LIMS. | +---------+--------------------------------------------------------------------- ---------------------------------------------------+ 1 row in set (0.00 sec) mysql> select note_id,note_text/*匹配line或匹配systems*/ -> fromproductnotes -> whereMatch(note_text) Against('line systems' in boolean mode); +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | note_id | note_text | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | 5 | This line ofproprietary reagents,. containers, and automation tools is designed. forgenomics and drug discovery .research. | | 6 | LimsLink is.designed to interface output. from chromatography .data systems (CDSs) and toLIMS. | | 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro m chromatography .data systems (CDSs) and to LIMS. | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ 3 rows in set (0.00 sec) mysql> select note_id,note_text/*匹配短語*/ -> fromproductnotes -> whereMatch(note_text) Against('"This line"' in boolean mode); +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | note_id | note_text | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ | 5 | This line ofproprietary reagents,. containers, and automation tools is designed. forgenomics and drug discovery .research. | | 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro m chromatography .data systems (CDSs) and to LIMS. | +---------+--------------------------------------------------------------------- ---------------------------------------------------------+ 2 rows in set (0.00 sec) 10、使用說明 l 在索引全文本數據時，短詞被忽略且從索引中排除。短詞的定義為那些具有3個或臉上以下字符的詞（如果需要，這個數目可以更新）。 l MySQL帶有一個內建的非用詞（stopword）列表，這些詞在索引全文本數據時總是被忽略。如果需要，可以覆蓋這個列表。 l MySQL規定了一條50%規則，如果一個詞出現在50%以上的行中，則將它作為一個非用詞忽略。50%規則不用於IN BOOLEAN MODE。 l 如果表中的行數少於3行，則全文本搜索不返回結果（因為每個詞或者不出現，或者至少出現在50%的行中）。 l 忽略詞中的單引號。如，don’t索引為dont。 l 不具有詞分隔符的語言不能恰當地返回全文本搜索結果。