您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python CUDA 編程 - 6 - 共享內存

編輯：Python

CUDA編程中內存分為主機內存（內存條）與設備內存（顯存），為提高計算效率，需要設計程序降低內存的數據搬運，或使用快速的內存寄存數據。

共享內存

CPU和GPU組成異構計算架構，如果想從內存上優化程序，我們必須盡量減少主機與GPU設備間的數據拷貝，並將更多計算從主機端轉移到GPU設備端，我們要盡量在設備端初始化數據，並計算中間數據，並盡量不做無意義的數據回寫。

GPU的內存結構如圖所示：GPU的計算核心都在Streaming Multiprocessor（SM）上，SM裡有計算核心可直接訪問的寄存器（Register）和共享內存（Shared Memory）；多個SM可以讀取顯卡上的顯存，包括全局內存（Global Memory）。

每個SM上的Shared Memory相當於該SM上的一個緩存，一般都很小，Telsa V100的Shared Memory也只有96KB。注意，Shared Memory和Global Memory的字面上都有共享的意思，但是不要將兩者的概念混淆，Shared Memory離計算核心更近，延遲很低；Global Memory是整個顯卡上的全局內存，延遲高。

從軟件角度來看，CUDA的線程可以訪問不同級別的存儲，每個Thread有獨立的私有內存；每個Block中多個Thread都可以在該Block的Shared Memory中讀寫數據；整個Grid中所有Thread都可以讀寫Global Memory。Shared Memory的讀寫訪問速度會遠高於Global Memory。內存優化一般主要利用Shared Memory技術。下文將以矩陣乘法為例，展示如何使用Shared Memory來優化程序。

普通矩陣乘法

一個C = AB的矩陣乘法運算，需要我們把A的某一行與B的某一列的所有元素一一相乘，求和後，將結果存儲到結果矩陣C的(row, col)上。在這種實現中，每個線程都要讀取A的一整行和B的一整列，共計算M行*P列。以計算第row行為例，計算C[row, 0]、C[row, 1]…C[row, p-1]這些點時都需要從顯存的Global Memory中把整個第row行讀取一遍。可以算到，A矩陣中的每個點需要被讀 B.width 次，B矩陣中的每個點需要被讀 A.height 次。這樣比較浪費時間。因此，可以將多次訪問的數據放到Shared Memory中，減少重復讀取的次數，並充分利用Shared Memory的延遲低的優勢。

from numba import cuda
import numpy as np
import math
from time import time
@cuda.jit
def matmul(A, B, C):
""" 矩陣乘法 C = A * B
"""
# Numba庫提供了更簡易的計算方法
# x, y = cuda.grid(2)
# 具體計算公式如下
row = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
col = cuda.threadIdx.y + cuda.blockDim.y * cuda.blockIdx.y
if row < C.shape[0] and col < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[row, k] * B[k, col]
C[row, col] = tmp
def main():
# 初始化矩陣
M = 6000
N = 4800
P = 4000
A = np.random.random((M, N)) # 隨機生成的 [M x N] 矩陣
B = np.random.random((N, P)) # 隨機生成的 [N x P] 矩陣
start = time()
A = cuda.to_device(A)
B = cuda.to_device(B)
C_gpu = cuda.device_array((M, P))
# 執行配置
threads_per_block = (16, 16)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(B.shape[1] / threads_per_block[1]))
blocksPerGrid = (blocks_per_grid_x, blocks_per_grid_y)
# 啟動核函數
matmul[blocksPerGrid, threads_per_block](A, B, C_gpu)
# 數據拷貝
C = C_gpu.copy_to_host()
cuda.synchronize()
print("gpu matmul time :" + str(time() - start))
start = time()
C_cpu = np.empty((M, P), np.float)
np.matmul(A, B, C_cpu)
print("cpu matmul time :" + str(time() - start))
# 驗證正確性
if np.allclose(C_cpu, C):
print("gpu result correct")
if __name__ == "__main__":
main()

基於Shared Memory的矩陣乘法

接下來的程序利用了Shared Memory來做矩陣乘法。這個實現中，跟未做優化的版本相同的是，每個Thread計算結果矩陣中的一個元素，不同的是，每個CUDA Block會以一個 BLOCK_SIZE * BLOCK_SIZE 子矩陣為基本的計算單元。具體而言，需要聲明Shared Memory區域，數據第一次會從Global Memory拷貝到Shared Memory上，接下來可多次重復利用Shared Memory上的數據。

from numba import cuda, float32
import numpy as np
import math
from time import time
# thread per block
# 每個block有 BLOCK_SIZE x BLOCK_SIZE 個元素
BLOCK_SIZE = 16
@cuda.jit
def matmul(A, B, C):
""" 矩陣乘法 C = A * B
"""
row = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
col = cuda.threadIdx.y + cuda.blockDim.y * cuda.blockIdx.y
if row < C.shape[0] and col < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[row, k] * B[k, col]
C[row, col] = tmp
@cuda.jit
def matmul_shared_memory(A, B, C):
"""
使用Shared Memory的矩陣乘法 C = A * B
"""
# 在Shared Memory中定義向量
# 向量可被整個Block的所有Thread共享
# 必須聲明向量大小和數據類型
sA = cuda.shared.array(shape=(BLOCK_SIZE, BLOCK_SIZE), dtype=float32)
sB = cuda.shared.array(shape=(BLOCK_SIZE, BLOCK_SIZE), dtype=float32)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
row = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
col = cuda.threadIdx.y + cuda.blockDim.y * cuda.blockIdx.y
if row >= C.shape[0] and col >= C.shape[1]:
# 當(x, y)越界時退出
return
tmp = 0.
# 以一個 BLOCK_SIZE x BLOCK_SIZE 為單位
for m in range(math.ceil(A.shape[1] / BLOCK_SIZE)):
sA[tx, ty] = A[row, ty + m * BLOCK_SIZE]
sB[tx, ty] = B[tx + m * BLOCK_SIZE, col]
# 線程同步，等待Block中所有Thread預加載結束
# 該函數會等待所有Thread執行完之後才執行下一步
cuda.syncthreads()
# 此時已經將A和B的子矩陣拷貝到了sA和sB
# 計算Shared Memory中的向量點積
# 直接從Shard Memory中讀取數據的延遲很低
for n in range(BLOCK_SIZE):
tmp += sA[tx, n] * sB[n, ty]
# 線程同步，等待Block中所有Thread計算結束
cuda.syncthreads()
# 循環後得到每個BLOCK的點積之和
C[row, col] = tmp
def main():
# 初始化矩陣
M = 6000
N = 4800
P = 4000
A = np.random.random((M, N)) # 隨機生成的 [M x N] 矩陣
B = np.random.random((N, P)) # 隨機生成的 [N x P] 矩陣
A_device = cuda.to_device(A)
B_device = cuda.to_device(B)
C_device = cuda.device_array((M, P)) # [M x P] 矩陣
# 執行配置
threads_per_block = (BLOCK_SIZE, BLOCK_SIZE)
blocks_per_grid_x = int(math.ceil(A.shape[0] / BLOCK_SIZE))
blocks_per_grid_y = int(math.ceil(B.shape[1] / BLOCK_SIZE))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
start = time()
matmul[blocks_per_grid, threads_per_block](A_device, B_device, C_device)
cuda.synchronize()
print("matmul time :" + str(time() - start))
start = time()
matmul_shared_memory[blocks_per_grid, threads_per_block](A_device, B_device, C_device)
cuda.synchronize()
print("matmul with shared memory time :" + str(time() - start))
C = C_device.copy_to_host()
if __name__ == "__main__":
main()

進行Shared Memory優化後，計算部分的耗時減少了近一半：

matmul time :1.4370720386505127
matmul with shared memory time :0.7994928359985352

補充說明

聲明Shared Memory。這裡使用了cuda.shared.array(shape,type)，shape為這塊數據的向量維度大小，type為Numba數據類型，例如是int32還是float32。這個函數只能在設備端使用。定義好後，這塊數據可被同一個Block的所有Thread共享。需要注意的是，這塊數據雖然在核函數中定義，但它不是單個Thread的私有數據，它可被同Block中的所有Thread讀寫。
數據加載。每個Thread會將A中的一個元素加載到sA中，一個Block的 BLOCK_SIZE x BLOCK_SIZE 個Thread可以把sA填充滿。cuda.syncthreads()會等待Block中所有Thread執行完之後才執行下一步。所以，當執行完這個函數的時候，sA和sB的數據已經拷貝好了。
數據復用。A中的某個點，只會被讀取 B.width / BLOCK_SIZE 次；B中的某個點，只會被讀 A.height / BLOCK_SIZE 次。for n in range(BLOCK_SIZE)這個循環做子矩陣向量乘法時，可多次復用sA和sB的數據。
子矩陣的數據匯總。我們以一個 BLOCK_SIZE x BLOCK_SIZE 的子矩陣為單位分別對A從左到右，對B從上到下平移並計算，共循環 A.width / BLOCK_SIZE 次。在某一步平移，會得到子矩陣的點積。for m in range(math.ceil(A.shape[1] / BLOCK_SIZE))這個循環起到了計算A從左到右與B從上到下點積的過程。