您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Run a Python script with Azure Data Factory using Azure Batch

編輯：Python

在本教程中,你將了解如何執行以下操作：

通過 Batch Authenticate with the storage account
在 Python Develop and run scripts in
Create a pool of compute nodes required to run the application
計劃 Python 工作負荷
Monitor analytics pipelines
訪問日志文件

The following example runs one Python 腳本,該腳本從 Blob Storage container receives CSV 輸入,Execute data processing procedures,and write the output to separate Blob 存儲容器.

如果還沒有 Azure 訂閱,One can be created before starting免費帳戶.

先決條件

已安裝一個 Python The distribution is for local testing.
azure-storage-blobpip 包.
iris.csv 數據集
Azure Batch account and associated Azure 存儲帳戶. 有關如何創建 Batch account and link it to the storage account details,請參閱創建 Batch 帳戶.
一個 Azure Data Factory account. 有關如何通過 Azure Portal creates the details of the data factory,請參閱創建數據工廠.
Batch Explorer.
Azure 存儲資源管理器.

通過 https://portal.azure.com 登錄到 Azure 門戶.

Get account credentials

就此示例來說,需為 Batch account and storage account to provide credentials. to obtain the required credentials,A straightforward way is to use Azure 門戶. （也可使用 Azure API or command line tools to obtain these credentials.）

選擇“所有服務”>“Batch 帳戶”,然後選擇 Batch 帳戶的名稱.
若要查看 Batch 憑據,請選擇“密鑰”. 將“Batch 帳戶”、“URL”和“Master access key”to copy the value to a text editor.
To see the storage account name and key,請選擇“存儲帳戶”. 將“存儲帳戶名稱”和“Key1”to copy the value to a text editor.

使用 Batch Explorer 創建 Batch 池

在本部分,你將通過 Batch Explorer 創建 Azure The data factory pipeline to use Batch 池.

使用 Azure credentials to log in to Batch Explorer.
選擇你的 Batch 帳戶
Select on the left sidebar“池”to create the pool,Then select above the search form“添加”按鈕 .
1. 選擇 ID 和顯示名稱. 本示例使用 custom-activity-pool.
2. Set the scale type to “固定大小”,Set the dedicated node count to 2.
3. 在“數據科學”下,選擇“Dsvm Windows”作為操作系統.
4. 選擇 Standard_f2s_v2 as the virtual machine size.
5. Enable startup tasks,並添加命令 cmd /c "pip install azure-storage-blob pandas". User ID can be left as default“pool user”.
6. 選擇“確定”.

創建 Blob 容器

此處將創建 Blob 容器,用於存儲 OCR Input and output files for batch jobs.

使用 Azure Log in to Storage Explorer with the credentials.
Use a storage account linked to the batch account,按照創建 Blob 容器steps to create two Blob 容器（One for the input file,One for the output file）.
- 在本例中,We will call the input container input and output container output.
通過按照管理 Blob 容器中的 Blob 中的步驟操作,Using Storage Explorer will iris.csv Upload to the input container input

在 Python develop scripts

以下 Python 腳本從 input 容器加載 iris.csv 數據集,Execute data processing procedures,and save the result back output 容器.

Python復制

# Load libraries
from azure.storage.blob import BlobClient
import pandas as pd
# Define parameters
connectionString = "<storage-account-connection-string>"
containerName = "output"
outputBlobName = "iris_setosa.csv"
# Establish connection with the blob storage account
blob = BlobClient.from_connection_string(conn_str=connectionString, container_name=containerName, blob_name=outputBlobName)
# Load iris dataset from the task node
df = pd.read_csv("iris.csv")
# Take a subset of the records
df = df[df['Species'] == "setosa"]
# Save the subset of the iris dataframe locally in task node
df.to_csv(outputBlobName, index = False)
with open(outputBlobName, "rb") as data:
blob.upload_blob(data)

Save the script as main.py,並將其上傳到 Azure 存儲 input 容器. uploading it to Blob 容器之前,Be sure to test and verify its functionality locally：

Bash復制

python main.py

設置 Azure Data Factory pipeline

在本部分,你將使用 Python The script creates and validates a pipeline.

遵循此文的“創建數據工廠”The steps in section Create a data factory.
在“工廠資源”框中選擇“+”（加號）按鈕,然後選擇“管道”
在“常規”選項卡中,Set the pipe name to “運行 Python”
在“活動”box to expand“Batch 服務”. 將“活動”Drag a custom activity from the Toolbox to the Piping Designer surface. Fill out the following tabs for the custom event：
1. 在“常規”選項卡中,指定“testPipeline”作為名稱
2. 在“Azure Batch”選項卡中,Add the one created in the previous step Batch 帳戶,然後選擇“測試連接”to ensure a successful connection .
3. 在“設置”選項卡中：
  1. 將“命令”設置為 python main.py.
  2. 對於“Resource linking service”,Please add the storage account created in the previous steps. Test the connection to make sure the connection is successful.
  3. 在“文件夾路徑”中,選擇包含 Python Scripts and associated inputs“Azure Blob 存儲”容器的名稱. 這會在執行 Python 腳本之前,Download the selected files from this container to the pool node instance.
Click in the pipeline toolbar above the canvas“驗證”,in order to verify the pipeline settings. Confirm that the pipeline has been successfully verified. To turn off the validation output,請選擇 >>（右箭頭）按鈕.
單擊“調試”to test the pipeline,Make sure the pipeline is functioning properly.
單擊“發布”to release the pipeline.
單擊“觸發”,to run in a batch process Python 腳本.

監視日志文件

If a warning or error is generated while executing the script,可以查看 stdout.txt 或 stderr.txt Get details about the logged output.

在 Batch Explorer 的左側選擇“作業”.
Select the job created by Data Factory. Suppose the pool is named custom-activity-pool,請選擇 adfv2-custom-activity-pool.
Click the task with the failed exit code.
查看 stdout.txt 和 stderr.txt to investigate and diagnose the problem.

清理資源

Although the assignments and tasks themselves are not charged,But compute node charges. 因此,It is recommended to allocate pools only when needed. All task output on the node is deleted when the pool is deleted. 但是,Input and output files remain in the storage account. 當不再需要時,還可以刪除 Batch account and storage account.

後續步驟

在本教程中,You learned how to do the following：

通過 Batch Authenticate with the storage account
在 Python Develop and run scripts in
Create a pool of compute nodes required to run the application
計劃 Python 工作負荷
Monitor analytics pipelines
訪問日志文件

若要詳細了解 Azure 數據工廠,請參閱：

Azure Data Factory overview

建議的內容

快速入門：使用 Python 創建 Azure 數據工廠 - Azure Data Factory
Use a data factory to convert data from Azure Blob Copy one location in storage to another.
計算環境 - Azure Data Factory & Azure Synapse
了解可與 Azure 數據工廠和 Synapse Analytics 管道（例如 Azure HDInsight）A computing environment used in conjunction to transform or process data.
General troubleshooting - Azure Data Factory & Azure Synapse
了解如何對 Azure 數據工廠和 Azure Synapse Analytics Troubleshoot external control activities in the pipeline.
Use custom activities in pipelines - Azure Data Factory & Azure Synapse
了解如何使用 .NET 創建自定義活動,然後在 Azure Data Factory or Azure Synapse Analytics These activities are used in the pipeline.

本文內容