譯者序
之前有位朋友提到從MySQL隨機取1條記錄其實只要SELECT * FROM table ORDER BY RAND() LIMIT 1即可。其實這個語句有很大的性能問題,對於大表的效率是非常低下的。我們可以看一下MySQL對其的解釋:
這個SQL語句無法使用任何索引,還必須使用臨時表和文件排序,在一個15萬條記錄的MyISAM表需要花大約0.3秒。已經是相當慢的了。如何優化,請往下看:
第一個例子我們先假設ID是從1開始並且1和ID的最大值之間沒有任何空檔。
將工作移入應用程序第一個想法:如果我們可以事先在應用程序中計算出ID,那麼就可以簡化整個工作。
SELECT MAX(id) FROM random; ## 在應用程序中生成隨機id SELECT name FROM random WHERE id = <random-id>
由於MAX(id) == COUNT(id) 我們只要生成從1到MAX(id)之間一個隨機數,並將其傳給數據庫並取回隨機行。
上面第一個SELECT基本上是一個可以被優化掉的空操作。第二個是一個針對常量的eq_ref查詢,同樣也很快。
將任務放入數據庫不過有必要將其放入應用程序嗎?難道我們不能在數據庫裡完成?
# 生成一個隨機 ID > SELECT RAND() * MAX(id) FROM random; +------------------+ | RAND() * MAX(id) | +------------------+ | 689.37582507297 | +------------------+ # 喔,這是一個浮點數,不過我們需要整數 > SELECT CEIL(RAND() * MAX(id)) FROM random; +-------------------------+ | CEIL(RAND() * MAX(id)) | +-------------------------+ | 1000000 | +-------------------------+ # 好多了。不過性能如何? > EXPLAIN SELECT CEIL(RAND() * MAX(id)) FROM random; +----+-------------+-------+-------+------+-------------+ | id | select_type | table | type | rows | Extra | +----+-------------+-------+-------+------+-------------+ | 1 | SIMPLE | random | index | 1000000 | Using index | +----+-------------+-------+-------+------+-------------+ ## 一個索引掃描?我們沒有對MAX()進行優化 > EXPLAIN SELECT CEIL(RAND() * (SELECT MAX(id) FROM random)); +----+-------------+-------+------+------+------------------------------+ | id | select_type | table | type | rows | Extra | +----+-------------+-------+------+------+------------------------------+ | 1 | PRIMARY | NULL | NULL | NULL | No tables used | | 2 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away | +----+-------------+-------+------+------+------------------------------+ ## 一個簡單的子查詢給我們將性能找了回來。
OK,現在我們知道如何生成隨機ID了,不過如何獲取記錄行?
> EXPLAIN SELECT name FROM random WHERE id = (SELECT CEIL(RAND() * (SELECT MAX(id) FROM random)); +----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+ | 1 | PRIMARY | random | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where | | 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | +----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+ > show warnings; +-------+------+------------------------------------------+ | Level | Code | Message | +-------+------+------------------------------------------+ | Note | 1249 | Select 2 was reduced during optimization | +-------+------+------------------------------------------+
哦,不!不要走這條路。雖然它很直觀,但是也是最容易犯的錯。理由是:在WHERE子句中的SELECT會針對外部SELECT取出的每一行執行一次。這可能會是0到4091行,看你的運氣了。
我們必須用一種方法確保隨機ID只被生成一次:
SELECT name FROM random JOIN (SELECT CEIL(RAND() * (SELECT MAX(id) FROM random)) AS id ) AS r2 USING (id); +----+-------------+------------+--------+------+------------------------------+ | id | select_type | table | type | rows | Extra | +----+-------------+------------+--------+------+------------------------------+ | 1 | PRIMARY | <derived2> | system | 1 | | | 1 | PRIMARY | random | const | 1 | | | 2 | DERIVED | NULL | NULL | NULL | No tables used | | 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away | +----+-------------+------------+--------+------+------------------------------+
內部的 SELECT 生成了一個常數臨時(TEMPORARY)表並且聯接(JOIN)只選擇了一行。完美。
沒有排序、沒有應用程序介入,查詢的大部分都被優化了。
在數字中加入空檔為了使最終的解決方案通用化,我們必須考慮空檔的可能性,如當你刪除(DELETE)了記錄行。
SELECT name FROM random AS r1 JOIN (SELECT (RAND() * (SELECT MAX(id) FROM random)) AS id) AS r2 WHERE r1.id >= r2.id ORDER BY r1.id ASC LIMIT 1; +----+-------------+------------+--------+------+------------------------------+ | id | select_type | table | type | rows | Extra | +----+-------------+------------+--------+------+------------------------------+ | 1 | PRIMARY | <derived2> | system | 1 | | | 1 | PRIMARY | r1 | range | 689 | Using where | | 2 | DERIVED | NULL | NULL | NULL | No tables used | | 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away | +----+-------------+------------+--------+------+------------------------------+
JOIN現在加入了所有大於等於我們隨機數的ID,並且當直接匹配不存在的時候,只選擇最臨近的記錄。不過一旦找到了某一行,我們就立刻停止(LIMIT 1)。同時我們根據索引(ORDER BY id ASC)讀取記錄。由於我們使用了>=而非=,所以我們可以削去CEIL同時還能獲取同樣的結果,節省了一點點開銷。
平均分布一旦ID的分布不再是平均的了,那麼我們對行的選擇也不是完全隨機的了。
> select * from holes; +----+----------------------------------+----------+ | id | name | Accesses | +----+----------------------------------+----------+ | 1 | d12b2551c6cb7d7a64e40221569a8571 | 107 | | 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 50 | | 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 132 | | 8 | 677a196206d93cdf18c3744905b94f73 | 230 | | 16 | b7556d8ed40587a33dc5c449ae0345aa | 481 | +----+----------------------------------+----------+
RAND函數生成諸如9到15之間的ID都會讓id 16被選擇。
目前還沒有針對這個問題的真正的解決方案,不過你的數據是大部分不變的話可以添加一個將行號映射到ID的映射表:
> create table holes_map ( row_id int not NULL primary key, random_id int not null); > SET @id = 0; > INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes; > select * from holes_map; +--------+-----------+ | row_id | random_id | +--------+-----------+ | 1 | 1 | | 2 | 2 | | 3 | 4 | | 4 | 8 | | 5 | 16 | +--------+-----------+
row_id現在則是沒有空檔的,我們就可以再次運行我們的隨機查詢了:
SELECT name FROM holes JOIN (SELECT r1.random_id FROM holes_map AS r1 JOIN (SELECT (RAND() * (SELECT MAX(row_id) FROM holes_map)) AS row_id) AS r2 WHERE r1.row_id >= r2.row_id ORDER BY r1.row_id ASC LIMIT 1) as rows ON (id = random_id);
1000次查找之後,我們可以看到一個平均分布:
> select * from holes; +----+----------------------------------+----------+ | id | name | Accesses | +----+----------------------------------+----------+ | 1 | d12b2551c6cb7d7a64e40221569a8571 | 222 | | 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 187 | | 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 195 | | 8 | 677a196206d93cdf18c3744905b94f73 | 207 | | 16 | b7556d8ed40587a33dc5c449ae0345aa | 189 | +----+----------------------------------+----------+使用觸發器維護有空檔的表
讓我們就用前面的幾個表:
DROP TABLE IF EXISTS r2; CREATE TABLE r2 ( id SERIAL, name VARCHAR(32) NOT NULL UNIQUE ); DROP TABLE IF EXISTS r2_equi_dist; CREATE TABLE r2_equi_dist ( id SERIAL, r2_id bigint unsigned NOT NULL UNIQUE );
一旦我們在r2中改動了某些東西,我們希望r2_equi_dist也被更新。
DELIMITER $$ DROP TRIGGER IF EXISTS tai_r2$$ CREATE TRIGGER tai_r2 AFTER INSERT ON r2 FOR EACH ROW BEGIN DECLARE m BIGINT UNSIGNED DEFAULT 1; SELECT MAX(id) + 1 FROM r2_equi_dist INTO m; SELECT IFNULL(m, 1) INTO m; INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id); END$$ DELIMITER ; DELETE FROM r2; INSERT INTO r2 VALUES ( NULL, MD5(RAND()) ); INSERT INTO r2 VALUES ( NULL, MD5(RAND()) ); INSERT INTO r2 VALUES ( NULL, MD5(RAND()) ); INSERT INTO r2 VALUES ( NULL, MD5(RAND()) ); SELECT * FROM r2; +----+----------------------------------+ | id | name | +----+----------------------------------+ | 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 | | 2 | a09a3959d68187ce48f4fe7e388926a9 | | 3 | 4e1897cd6d326f8079108292376fa7d5 | | 4 | 29a5e3ed838db497aa330878920ec01b | +----+----------------------------------+ SELECT * FROM r2_equi_dist; +----+-------+ | id | r2_id | +----+-------+ | 1 | 1 | | 2 | 2 | | 3 | 3 | | 4 | 4 | +----+-------+
INSERT很簡單,但在DELETE時我們需要更新equi-dist-id來維持id的連續。
DELIMITER $$ DROP TRIGGER IF EXISTS tad_r2$$ CREATE TRIGGER tad_r2 AFTER DELETE ON r2 FOR EACH ROW BEGIN DELETE FROM r2_equi_dist WHERE r2_id = OLD.id; UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id; END$$ DELIMITER ; DELETE FROM r2 WHERE id = 2; SELECT * FROM r2; +----+----------------------------------+ | id | name | +----+----------------------------------+ | 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 | | 3 | 4e1897cd6d326f8079108292376fa7d5 | | 4 | 29a5e3ed838db497aa330878920ec01b | +----+----------------------------------+ SELECT * FROM r2_equi_dist; +----+-------+ | id | r2_id | +----+-------+ | 1 | 1 | | 2 | 3 | | 3 | 4 | +----+-------+
UPDATE就非常直觀了。我們只要維護一下外鍵約束:
DELIMITER $$ DROP TRIGGER IF EXISTS tau_r2$$ CREATE TRIGGER tau_r2 AFTER UPDATE ON r2 FOR EACH ROW BEGIN UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id; END$$ DELIMITER ; UPDATE r2 SET id = 25 WHERE id = 4; SELECT * FROM r2; +----+----------------------------------+ | id | name | +----+----------------------------------+ | 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 | | 3 | 4e1897cd6d326f8079108292376fa7d5 | | 25 | 29a5e3ed838db497aa330878920ec01b | +----+----------------------------------+ SELECT * FROM r2_equi_dist; +----+-------+ | id | r2_id | +----+-------+ | 1 | 1 | | 2 | 3 | | 3 | 25 | +----+-------+一次多行
如果你想一次取回多行,你可以:
執行多次查詢 寫一個存儲過程執行查詢並將結果存入一個臨時表 進行一個UNION 存儲過程存儲過程為你提供了通用語言中很有用的一些結構:
循環 控制結構 過程 …這個任務中我們只需要一個LOOP:
DELIMITER $$ DROP PROCEDURE IF EXISTS get_rands$$ CREATE PROCEDURE get_rands(IN cnt INT) BEGIN DROP TEMPORARY TABLE IF EXISTS rands; CREATE TEMPORARY TABLE rands ( rand_id INT ); loop_me: LOOP IF cnt < 1 THEN LEAVE loop_me; END IF; INSERT INTO rands SELECT r1.id FROM random AS r1 JOIN (SELECT (RAND() * (SELECT MAX(id) FROM random)) AS id) AS r2 WHERE r1.id >= r2.id ORDER BY r1.id ASC LIMIT 1; SET cnt = cnt - 1; END LOOP loop_me; END$$ DELIMITER ; CALL get_rands(4); SELECT * FROM rands; +---------+ | rand_id | +---------+ | 133716 | | 702643 | | 112066 | | 452400 | +---------+
下面的問題留給讀者解決:
使用動態SQL並傳入臨時表的名字 在表上使用一個UNIQUE索引並捕獲UNIQUE鍵沖突來消除結果集中可能的重復記錄。 性能現在讓我們看看性能方面發生了什麼變化。我們有三個不同的查詢來解決這個問題。
Q1. ORDER BY RAND() Q2. RAND() * MAX(ID) Q3. RAND() * MAX(ID) + ORDER BY IDQ1預期消耗N * log2(N),Q2和Q3接近常數。
下標是評測的結果,針對N行的表(從一千行到一百萬行),每個查詢執行1000次。
100 1.000 10.000 100.000 1.000.000 Q1 0:00.718s 0:02.092s 0:18.684s 2:59.081s 58:20.000s Q2 0:00.519s 0:00.607s 0:00.614s 0:00.628s 0:00.637s Q3 0:00.570s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
如你所見,普通的ORDER BY RAND()從僅100行的表開始便落後於優化過的查詢了。
關於這些查詢更詳細的分析可以在查閱.