您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

PySpark - tests for the python package

編輯：Python

Pyspark python package的測試

PySpark - Python Package Management

PySpark提供了一種將pythonA way for the environment to be stripped out of the image,眾所周知,在dockerin the optimization plan,Reducing the size of the image can save resources、提高效率,這是一種“極大程度上”The way the mirroring can be optimized.

在這篇博文中,我主要使用conda打包pythonway to measure itk8s集群上的client/cluster模式提交任務

環境准備

k8s 集群
NFS - PVC(需要結合k8s配置nfs pvc)
A committablesparkMirror of the task

啟動容器

can be submittedsparkThe image of the task starts a container
進入容器,Do Something（Can be used ahead of time in a containerconda打包所需的python）

1. conda打包環境

安裝conda
用conda安裝所需python環境

# python=XXX, XXX代表指定python版本
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack python=XXX
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz

2. 測試使用的代碼：app.py

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql import SparkSession
def main(spark):
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
@pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()
print(df.groupby("id").agg(mean_udf(df['v'])).collect())
if __name__ == "__main__":
main(SparkSession.builder.getOrCreate())

client mode

運行命令

在client模式下,PYSPARK_DRIVER_PYTHON 和 PYSPARK_PYTHON 都要設置

PYSPARK_PYTHON: 通過--archives上傳的pyspark_conda_env.tar.gz在podAfter decompressionpython路徑
PYSPARK_DRIVER_PYTHON: pyspark_conda_env.tar.gzAfter decompression in the containerpython路徑
由於client模式下,PYSPARK_DRIVER_PYTHON指定的pythonPaths are only useddriver端使用,So just mount into the currently started container,It does not need to exist on other nodes as well.

# 解壓！
# [email protected]:/ppml/trusted-big-data-ml# tar -zxvf pyspark_conda_env.tar.gz -C pyspark_conda_env
export PYSPARK_DRIVER_PYTHON=/ppml/trusted-big-data-ml/pyspark_conda_env/bin/python # Do not set in cluster modes.
export PYSPARK_PYTHON=./pyspark_conda_env/bin/python
# 提交Spark命令
${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--deploy-mode client \
--conf spark.driver.host=${LOCAL_HOST} \
--master ${RUNTIME_SPARK_MASTER} \
--conf spark.kubernetes.executor.podTemplateFile=/ppml/trusted-big-data-ml/spark-executor-template.yaml \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.kubernetes.executor.deleteOnTermination=false \
--archives ./pyspark_conda_env.tar.gz#pyspark_conda_env \
local:///ppml/trusted-big-data-ml/app.py

執行分析

# hello

cluster mode

運行命令

在cluster模式下,只需要設置PYSPARK_PYTHON

PYSPARK_PYTHON: pyspark_conda_env.tar.gz解壓到pod中的python路徑
通過conda打包的pyspark_conda_env.tar.gz,通過spark.kubernetes.file.upload.pathThe specified shared file system is passed in toDriver中

export PYSPARK_PYTHON=./pyspark_conda_env/bin/python
# 提交Spark命令
${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--deploy-mode cluster \
--master ${RUNTIME_SPARK_MASTER} \
--conf spark.kubernetes.executor.podTemplateFile=/ppml/trusted-big-data-ml/spark-executor-template.yaml \
--conf spark.kubernetes.driver.podTemplateFile=/ppml/trusted-big-data-ml/spark-driver-template.yaml \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.kubernetes.executor.deleteOnTermination=false \
--conf spark.kubernetes.file.upload.path=/ppml/trusted-big-data-ml/work/data/shaojie \
--archives ./pyspark_conda_env.tar.gz#pyspark_conda_env \
local:///ppml/trusted-big-data-ml/work/data/shaojie/app.py

通過跑cluster模式的任務,Take the opportunity to take a lookspark.kubernetes.file.upload.path的作用.在這裡,--archivesMeans by specifying逗號分割的tar,jar,zipand a series of dependent packages,will be decompressedexecutor中

執行分析

22-07-29 02:05:58 INFO SparkContext:57 - Added archive file:/ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz#pyspark_conda_env at spark://app-py-36a4238247b3d72e-driver-svc.default.svc:7078/files/pyspark_conda_env.tar.gz with timestamp 1659060357319
22-07-29 02:05:58 INFO Utils:57 - Copying /ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz to /tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz
INFO fork chmod is forbidden !!!/tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz
22-07-29 02:05:58 INFO SparkContext:57 - Unpacking an archive file:/ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz#pyspark_conda_env from /tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz to /var/data/spark-3024b9ad-8e4d-4b2a-b51a-aee8f54d5a46/spark-8acb95d2-599d-4e2f-8203-c1f3455c4c7f/userFiles-994a18bf-12cd-4d98-b3e9-1035f741fe67/pyspark_conda_en

SparkContextask firstspark.kubernetes.file.upload.pathAll uploaded in the patharchives包,添加到driver中.[spark.kubernetes.file.upload.pathThe specified path must be a file system that can be accessed by the share: HDFS, NFS等]
Copy the package in the shared path toDriver中的路徑
解壓Driver中拷貝過來的archives包到指定位置

questions

client去提交archives的時候,不需要指定spark.kubernetes.file.upload.path的嘛？
cluster提交的archivesMainly to startdriver使用？
spark.kubernetes.file.upload.pathThe documentation states that the value of this configuration should be a remote store