MYSQL入門學習之三:全文本搜索
一、理解全文本搜索
www.2cto.com
1、MyISAM支持全文本搜索,而InnoDB不支持。
2、在使用全文本搜索時,MySQL不需要分別查看每個行,不需要分別分析和處理每個詞。MySQL創建指定列中各詞的一個索引,搜索可以針對這些詞進行。這樣MySQL可以快速有效地決定哪些詞匹配,哪些詞不匹配,它們匹配的頻率,等等。
二、使用全文本搜索
1、為了進行全文本搜索,必須索引被搜索的列,而且要隨著數據的改變不斷地重新索引。在對表列進行適當設計後,MySQL會自動進行所有的索引和重新索引。
在索引之後,SELECT可與Match()和Against()一起使用以實際執行搜索。
2、一般在創建表時啟用全文本搜索。
[sql]
create table productnotes
(
note_id int not nullauto_increment,
note_text text null,
primary key(note_id),
fulltext(note_text)
)engine=MyISAM;
在定義之後,MySQL自動維護該索引。在增加、更新或刪除行時,索引隨之自動更新。
3、不要在導入數據時使用FULLTEXT。
www.2cto.com
4、進行全文本搜索
Match()指定被搜索的列,Against()指定要使用的搜索表達式。
[sql]
mysql> select * from productnotes
-> whereMatch(note_text) Against('designed');
+---------+---------------------------------------------------------------------
------------------------------------------------------+
| note_id | note_text
|
+---------+---------------------------------------------------------------------
------------------------------------------------------+
| 6 | LimsLink isdesigned to interface output from chromatography data sy
stems (CDSs) to LIMS. |
| 5 | This line ofproprietary reagents, containers, and automation tools
is designed for genomics and drug discovery research. |
+---------+---------------------------------------------------------------------
------------------------------------------------------+
2 rows in set (0.03 sec)
5、傳遞給Match()的值必須與FULLTEXT()定義中的相同。如果指定多個列,則必須列出它們(而且次序正確)。
6、除非使用BINARY方式,否則全文本搜索不區分大小寫。
[sql]
mysql> select * from productnotes
-> where BINARYMatch(note_text) Against('line');
+---------+---------------------------------------------------------------------
------------------------------------------------------+
| note_id | note_text
|
+---------+---------------------------------------------------------------------
------------------------------------------------------+
| 5 | This line ofproprietary reagents, containers, and automation tools
is designed for genomics and drug discovery research. |
+---------+---------------------------------------------------------------------
------------------------------------------------------+
1 row in set (0.05 sec)
7、全文本搜索的一個重要部分就是對結果排序。具有較高等級的行先返回。
等級由MySQL根據行中詞的數目、唯一詞的數目、整個索引中詞的總數以及包含該詞的行的數目計算出來。文本中詞先前的行的等級值比詞靠後的行的等級值高。
[sql]
mysql> select note_id, Match(note_text) Against('This line')as rank,note_text
-> fromproductnotes
-> whereMatch(note_text) Against('This line');
+---------+------------------+--------------------------------------------------
----------------------------------------------------------------------------+
| note_id | rank | note_text
|
+---------+------------------+--------------------------------------------------
----------------------------------------------------------------------------+
| 5 |0.81339610830754 | This line of proprietary reagents,. containers, a
nd automation tools is designed. for genomics and drugdiscovery .research. |
| 7 |0.76517958501676 | specificities include both alpha–beta and beta–
beta. This line from chromatography .data systems (CDSs) and toLIMS. |
+---------+------------------+--------------------------------------------------
----------------------------------------------------------------------------+
2 rows in set (0.00 sec)
8、查詢擴展 www.2cto.com
在使用查詢擴展時,MySQL對數據和索引進行兩遍掃描來完成搜索。
首先,進行一個基本的全文本搜索,找出與搜索條件匹配的所有行;
其次,MySQL檢查這些匹配行並選擇所有有用的詞;
再次,MySQL再次進行全文本搜索,這次不僅使用原來的條件,而且還使用所有有用的詞。
利用查詢擴展,能找出可能相關的結果,即使它們並不精確包含所查找的詞。
表中的行越多,使用查詢擴展返回的結果越好。
查詢擴展功能在MySQL4.1.1中引入。
[sql]
mysql> select note_id, Match(note_text) Against('This line')as rank,note_text
-> fromproductnotes
-> where Match(note_text)Against('This line' with query expansion);
+---------+------------------+--------------------------------------------------
----------------------------------------------------------------------------+
| note_id | rank | note_text
|
+---------+------------------+--------------------------------------------------
----------------------------------------------------------------------------+
| 5 | 0.81339610830754| This line of proprietary reagents,. containers, a
nd automation tools is designed. for genomics and drugdiscovery .research. |
| 7 |0.76517958501676 | specificities include both alpha–beta and beta–
beta. This line from chromatography .data systems (CDSs) and toLIMS. |
| 3 | 0 | Human S-100. monoclonal.and polyclonal specifici
ties include both alpha–beta and beta–beta isoforms. |
| 6 | 0 | LimsLink is .designed to interfaceoutput. from c
hromatography .data systems (CDSs) and to LIMS. |
| 1 | 0 | PepTool allows users tostore, manage. analyze, a
nd visualize protein data. |
+---------+------------------+--------------------------------------------------
----------------------------------------------------------------------------+
5 rows in set (0.00 sec)
9、布爾文本搜索(boolean mode)
以布爾方式,可以提供關於如下內容的細節:
要匹配的詞; www.2cto.com
要排斥的詞;
排列提示;(指定某些詞比其他詞更重要)
表達式分組;
另外一些內容。
[sql]
mysql> select note_id,note_text
-> fromproductnotes
-> whereMatch(note_text) Against('line' in boolean mode);
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| note_id | note_text
|
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| 5 | This line ofproprietary reagents,. containers, and automation tools
is designed. for genomicsand drug discovery .research. |
| 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro
m chromatography .data systems (CDSs) and to LIMS. |
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
2 rows in set (0.00 sec)
即使沒有FULLTEXT索引也可以使用布爾文本搜索。但是非常緩慢。
mysql> select note_id,note_text/*匹配line且不包含systems*/
-> fromproductnotes
-> whereMatch(note_text) Against('line -systems*' in boolean mode);
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| note_id | note_text
|
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| 5 | This line ofproprietary reagents,. containers, and automation tools
is designed. forgenomics and drug discovery .research. |
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select note_id,note_text/*匹配line且匹配systems*/
-> fromproductnotes
-> whereMatch(note_text) Against('+line +systems' in boolean mode);
+---------+---------------------------------------------------------------------
---------------------------------------------------+
| note_id | note_text
|
+---------+---------------------------------------------------------------------
---------------------------------------------------+
| 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro
m chromatography .data systems (CDSs) and to LIMS. |
+---------+---------------------------------------------------------------------
---------------------------------------------------+
1 row in set (0.00 sec)
mysql> select note_id,note_text/*匹配line或匹配systems*/
-> fromproductnotes
-> whereMatch(note_text) Against('line systems' in boolean mode);
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| note_id | note_text
|
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| 5 | This line ofproprietary reagents,. containers, and automation tools
is designed. forgenomics and drug discovery .research. |
| 6 | LimsLink is.designed to interface output. from chromatography .data
systems (CDSs) and toLIMS. |
| 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro
m chromatography .data systems (CDSs) and to LIMS. |
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
3 rows in set (0.00 sec)
mysql> select note_id,note_text/*匹配短語*/
-> fromproductnotes
-> whereMatch(note_text) Against('"This line"' in boolean mode);
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| note_id | note_text
|
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
| 5 | This line ofproprietary reagents,. containers, and automation tools
is designed. forgenomics and drug discovery .research. |
| 7 | specificitiesinclude both alpha–beta and beta–beta. This line fro
m chromatography .data systems (CDSs) and to LIMS. |
+---------+---------------------------------------------------------------------
---------------------------------------------------------+
2 rows in set (0.00 sec)
10、使用說明
l 在索引全文本數據時,短詞被忽略且從索引中排除。短詞的定義為那些具有3個或臉上以下字符的詞(如果需要,這個數目可以更新)。
l MySQL帶有一個內建的非用詞(stopword)列表,這些詞在索引全文本數據時總是被忽略。如果需要,可以覆蓋這個列表。
l MySQL規定了一條50%規則,如果一個詞出現在50%以上的行中,則將它作為一個非用詞忽略。50%規則不用於IN BOOLEAN MODE。
l 如果表中的行數少於3行,則全文本搜索不返回結果(因為每個詞或者不出現,或者至少出現在50%的行中)。
l 忽略詞中的單引號。如,don’t索引為dont。
l 不具有詞分隔符的語言不能恰當地返回全文本搜索結果。