程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
 程式師世界 >> 數據庫知識 >> MYSQL數據庫 >> 關於MYSQL數據庫 >> 從MySQL隨機取1條記錄

從MySQL隨機取1條記錄

編輯:關於MYSQL數據庫

譯者序
之前有位朋友提到從MySQL隨機取1條記錄其實只要SELECT * FROM table ORDER BY RAND() LIMIT 1即可。其實這個語句有很大的性能問題,對於大表的效率是非常低下的。我們可以看一下MySQL對其的解釋:

EXPLAIN SELECT *
FROM `money_logs`
ORDER BY RAND( )
LIMIT 1 id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE table ALL NULL NULL NULL NULL 173784 Using temporary; Using filesort

這個SQL語句無法使用任何索引,還必須使用臨時表和文件排序,在一個15萬條記錄的MyISAM表需要花大約0.3秒。已經是相當慢的了。如何優化,請往下看:

第一個例子我們先假設ID是從1開始並且1和ID的最大值之間沒有任何空檔。

將工作移入應用程序

第一個想法:如果我們可以事先在應用程序中計算出ID,那麼就可以簡化整個工作。

SELECT MAX(id) FROM random;
## 在應用程序中生成隨機id
SELECT name FROM random WHERE id = <random-id>

由於MAX(id) == COUNT(id) 我們只要生成從1到MAX(id)之間一個隨機數,並將其傳給數據庫並取回隨機行。

上面第一個SELECT基本上是一個可以被優化掉的空操作。第二個是一個針對常量的eq_ref查詢,同樣也很快。

將任務放入數據庫

不過有必要將其放入應用程序嗎?難道我們不能在數據庫裡完成?

# 生成一個隨機 ID
> SELECT RAND() * MAX(id) FROM random;
+------------------+
| RAND() * MAX(id) |
+------------------+
|  689.37582507297 |
+------------------+
# 喔,這是一個浮點數,不過我們需要整數
> SELECT CEIL(RAND() * MAX(id)) FROM random;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
|                    1000000  |
+-------------------------+
# 好多了。不過性能如何?
> EXPLAIN
SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type  | rows | Extra       |
+----+-------------+-------+-------+------+-------------+
|  1 | SIMPLE      | random  | index | 1000000  | Using index |
+----+-------------+-------+-------+------+-------------+
## 一個索引掃描?我們沒有對MAX()進行優化
> EXPLAIN
SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra                        |
+----+-------------+-------+------+------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL | NULL | No tables used               |
|  2 | SUBQUERY    | NULL  | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## 一個簡單的子查詢給我們將性能找了回來。

OK,現在我們知道如何生成隨機ID了,不過如何獲取記錄行?

> EXPLAIN
SELECT name
FROM random
WHERE id = (SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra                        |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
|  1 | PRIMARY     | random | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where                  |
|  3 | SUBQUERY    | NULL   | NULL | NULL          | NULL | NULL    | NULL |    NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message                                  |
+-------+------+------------------------------------------+
| Note  | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+

哦,不!不要走這條路。雖然它很直觀,但是也是最容易犯的錯。理由是:在WHERE子句中的SELECT會針對外部SELECT取出的每一行執行一次。這可能會是0到4091行,看你的運氣了。

我們必須用一種方法確保隨機ID只被生成一次:

SELECT name
FROM random JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random)) AS id
) AS r2
USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | random     | const  |    1 |                              |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

內部的 SELECT 生成了一個常數臨時(TEMPORARY)表並且聯接(JOIN)只選擇了一行。完美。

沒有排序、沒有應用程序介入,查詢的大部分都被優化了。

在數字中加入空檔

為了使最終的解決方案通用化,我們必須考慮空檔的可能性,如當你刪除(DELETE)了記錄行。

SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | r1         | range  |  689 | Using where                  |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

JOIN現在加入了所有大於等於我們隨機數的ID,並且當直接匹配不存在的時候,只選擇最臨近的記錄。不過一旦找到了某一行,我們就立刻停止(LIMIT 1)。同時我們根據索引(ORDER BY id ASC)讀取記錄。由於我們使用了>=而非=,所以我們可以削去CEIL同時還能獲取同樣的結果,節省了一點點開銷。

平均分布

一旦ID的分布不再是平均的了,那麼我們對行的選擇也不是完全隨機的了。

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | Accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      107 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |       50 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      132 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      481 |
+----+----------------------------------+----------+

RAND函數生成諸如9到15之間的ID都會讓id 16被選擇。

目前還沒有針對這個問題的真正的解決方案,不過你的數據是大部分不變的話可以添加一個將行號映射到ID的映射表:

> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
|      1 |         1 |
|      2 |         2 |
|      3 |         4 |
|      4 |         8 |
|      5 |        16 |
+--------+-----------+

row_id現在則是沒有空檔的,我們就可以再次運行我們的隨機查詢了:

SELECT name FROM holes
JOIN (SELECT r1.random_id
FROM holes_map AS r1
JOIN (SELECT (RAND() *
(SELECT MAX(row_id)
FROM holes_map)) AS row_id)
AS r2
WHERE r1.row_id >= r2.row_id
ORDER BY r1.row_id ASC
LIMIT 1) as rows ON (id = random_id);

1000次查找之後,我們可以看到一個平均分布:

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | Accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      222 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |      187 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      195 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      189 |
+----+----------------------------------+----------+
使用觸發器維護有空檔的表

讓我們就用前面的幾個表:

DROP TABLE IF EXISTS r2;
CREATE TABLE r2 (
id SERIAL,
name VARCHAR(32) NOT NULL UNIQUE
);
DROP TABLE IF EXISTS r2_equi_dist;
CREATE TABLE r2_equi_dist (
id SERIAL,
r2_id bigint unsigned NOT NULL UNIQUE
);

一旦我們在r2中改動了某些東西,我們希望r2_equi_dist也被更新。

DELIMITER $$
DROP TRIGGER IF EXISTS tai_r2$$
CREATE TRIGGER tai_r2
AFTER INSERT ON r2 FOR EACH ROW
BEGIN
DECLARE m BIGINT UNSIGNED DEFAULT 1;
SELECT MAX(id) + 1 FROM r2_equi_dist INTO m;
SELECT IFNULL(m, 1) INTO m;
INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);
END$$
DELIMITER ;
DELETE FROM r2;
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  2 | a09a3959d68187ce48f4fe7e388926a9 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     2 |
|  3 |     3 |
|  4 |     4 |
+----+-------+

INSERT很簡單,但在DELETE時我們需要更新equi-dist-id來維持id的連續。

DELIMITER $$
DROP TRIGGER IF EXISTS tad_r2$$
CREATE TRIGGER tad_r2
AFTER DELETE ON r2 FOR EACH ROW
BEGIN
DELETE FROM r2_equi_dist WHERE r2_id = OLD.id;
UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;
END$$
DELIMITER ;
DELETE FROM r2 WHERE id = 2;
SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |     4 |
+----+-------+

UPDATE就非常直觀了。我們只要維護一下外鍵約束:

DELIMITER $$
DROP TRIGGER IF EXISTS tau_r2$$
CREATE TRIGGER tau_r2
AFTER UPDATE ON r2 FOR EACH ROW
BEGIN
UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;
END$$
DELIMITER ;
UPDATE r2 SET id = 25 WHERE id = 4;
SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
| 25 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |    25 |
+----+-------+
一次多行

如果你想一次取回多行,你可以:

執行多次查詢 寫一個存儲過程執行查詢並將結果存入一個臨時表 進行一個UNION 存儲過程

存儲過程為你提供了通用語言中很有用的一些結構:

循環 控制結構 過程 …

這個任務中我們只需要一個LOOP:

DELIMITER $$
DROP PROCEDURE IF EXISTS get_rands$$
CREATE PROCEDURE get_rands(IN cnt INT)
BEGIN
DROP TEMPORARY TABLE IF EXISTS rands;
CREATE TEMPORARY TABLE rands ( rand_id INT );
loop_me: LOOP
IF cnt < 1 THEN
LEAVE loop_me;
END IF;
INSERT INTO rands
SELECT r1.id
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
SET cnt = cnt - 1;
END LOOP loop_me;
END$$
DELIMITER ;
CALL get_rands(4);
SELECT * FROM rands;
+---------+
| rand_id |
+---------+
|  133716 |
|  702643 |
|  112066 |
|  452400 |
+---------+

下面的問題留給讀者解決:

使用動態SQL並傳入臨時表的名字 在表上使用一個UNIQUE索引並捕獲UNIQUE鍵沖突來消除結果集中可能的重復記錄。 性能

現在讓我們看看性能方面發生了什麼變化。我們有三個不同的查詢來解決這個問題。

Q1. ORDER BY RAND() Q2. RAND() * MAX(ID) Q3. RAND() * MAX(ID) + ORDER BY ID

Q1預期消耗N * log2(N),Q2和Q3接近常數。

下標是評測的結果,針對N行的表(從一千行到一百萬行),每個查詢執行1000次。

   100        1.000      10.000     100.000    1.000.000
Q1  0:00.718s  0:02.092s  0:18.684s  2:59.081s  58:20.000s
Q2  0:00.519s  0:00.607s  0:00.614s  0:00.628s   0:00.637s
Q3  0:00.570s  0:00.607s  0:00.614s  0:00.628s   0:00.637s

如你所見,普通的ORDER BY RAND()從僅100行的表開始便落後於優化過的查詢了。

關於這些查詢更詳細的分析可以在查閱.


  1. 上一頁:
  2. 下一頁:
Copyright © 程式師世界 All Rights Reserved