PySpark - Python Package Management
PySpark提供了一種將pythonA way for the environment to be stripped out of the image,眾所周知,在dockerin the optimization plan,Reducing the size of the image can save resources、提高效率,這是一種“極大程度上”The way the mirroring can be optimized.
在這篇博文中,我主要使用conda打包pythonway to measure itk8s集群上的client/cluster模式提交任務
# python=XXX, XXX代表指定python版本
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack python=XXX
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql import SparkSession
def main(spark):
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
@pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()
print(df.groupby("id").agg(mean_udf(df['v'])).collect())
if __name__ == "__main__":
main(SparkSession.builder.getOrCreate())
在client模式下,PYSPARK_DRIVER_PYTHON
和 PYSPARK_PYTHON
都要設置
PYSPARK_PYTHON
: 通過--archives
上傳的pyspark_conda_env.tar.gz
在podAfter decompressionpython路徑PYSPARK_DRIVER_PYTHON
: pyspark_conda_env.tar.gz
After decompression in the containerpython路徑client
模式下,PYSPARK_DRIVER_PYTHON
指定的pythonPaths are only useddriver
端使用,So just mount into the currently started container,It does not need to exist on other nodes as well.# 解壓!
# [email protected]:/ppml/trusted-big-data-ml# tar -zxvf pyspark_conda_env.tar.gz -C pyspark_conda_env
export PYSPARK_DRIVER_PYTHON=/ppml/trusted-big-data-ml/pyspark_conda_env/bin/python # Do not set in cluster modes.
export PYSPARK_PYTHON=./pyspark_conda_env/bin/python
# 提交Spark命令
${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--deploy-mode client \
--conf spark.driver.host=${LOCAL_HOST} \
--master ${RUNTIME_SPARK_MASTER} \
--conf spark.kubernetes.executor.podTemplateFile=/ppml/trusted-big-data-ml/spark-executor-template.yaml \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.kubernetes.executor.deleteOnTermination=false \
--archives ./pyspark_conda_env.tar.gz#pyspark_conda_env \
local:///ppml/trusted-big-data-ml/app.py
# hello
在cluster模式下,只需要設置PYSPARK_PYTHON
PYSPARK_PYTHON
: pyspark_conda_env.tar.gz
解壓到pod中的python路徑pyspark_conda_env.tar.gz
,通過spark.kubernetes.file.upload.path
The specified shared file system is passed in toDriver
中export PYSPARK_PYTHON=./pyspark_conda_env/bin/python
# 提交Spark命令
${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--deploy-mode cluster \
--master ${RUNTIME_SPARK_MASTER} \
--conf spark.kubernetes.executor.podTemplateFile=/ppml/trusted-big-data-ml/spark-executor-template.yaml \
--conf spark.kubernetes.driver.podTemplateFile=/ppml/trusted-big-data-ml/spark-driver-template.yaml \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.kubernetes.executor.deleteOnTermination=false \
--conf spark.kubernetes.file.upload.path=/ppml/trusted-big-data-ml/work/data/shaojie \
--archives ./pyspark_conda_env.tar.gz#pyspark_conda_env \
local:///ppml/trusted-big-data-ml/work/data/shaojie/app.py
通過跑cluster
模式的任務,Take the opportunity to take a lookspark.kubernetes.file.upload.path
的作用.在這裡,--archives
Means by specifying逗號分割的tar
,jar
,zip
and a series of dependent packages,will be decompressedexecutor
中
22-07-29 02:05:58 INFO SparkContext:57 - Added archive file:/ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz#pyspark_conda_env at spark://app-py-36a4238247b3d72e-driver-svc.default.svc:7078/files/pyspark_conda_env.tar.gz with timestamp 1659060357319
22-07-29 02:05:58 INFO Utils:57 - Copying /ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz to /tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz
INFO fork chmod is forbidden !!!/tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz
22-07-29 02:05:58 INFO SparkContext:57 - Unpacking an archive file:/ppml/trusted-big-data-ml/work/data/shaojie/spark-upload-06550aaa-76f6-4f6e-a123-d34c71dbce5c/pyspark_conda_env.tar.gz#pyspark_conda_env from /tmp/spark-6c702f67-0531-49f8-9656-48d4e110eea1/pyspark_conda_env.tar.gz to /var/data/spark-3024b9ad-8e4d-4b2a-b51a-aee8f54d5a46/spark-8acb95d2-599d-4e2f-8203-c1f3455c4c7f/userFiles-994a18bf-12cd-4d98-b3e9-1035f741fe67/pyspark_conda_en
SparkContext
ask firstspark.kubernetes.file.upload.path
All uploaded in the patharchives
包,添加到driver
中.[spark.kubernetes.file.upload.path
The specified path must be a file system that can be accessed by the share: HDFS, NFS等]archives
包到指定位置archives
的時候,不需要指定spark.kubernetes.file.upload.path
的嘛?archives
Mainly to startdriver
使用?spark.kubernetes.file.upload.path
The documentation states that the value of this configuration should be a remote store