美文网首页
pyspark 如何在 Spark on Yarn 中使用多个

pyspark 如何在 Spark on Yarn 中使用多个

作者: 紫菜包饭哟嘻 | 来源:发表于2017-06-16 14:33 被阅读0次

    需求

    主程序拆成多个子模块方便复用:util.py, module1.py, module2.py, main.py。

    Solution

    对于 main.py 依赖的 util.py, module1.py, module2.py,需要先压缩成一个 .zip 文件,再通过 spark-submit 的 --py--files 选项上传到 yarn,mail.py 才能 import 这些子模块。命令如下:

    $ spark-submit --master=yarn --deploy-mode=cluster --jars elasticsearch-hadoop-5.3.1.jar --py-files deps.zip main.py
    

    Oozie spark-action

    需要在 <spark-opts> 后面加上相应的 ** --py-files ** 选项:

    <spark-opts>
      ${OTHER_OPTS} --py-files hdfs://${HDFS_HOST}:${HDFS_PORT}/${DEP_PATH}/deps.zip
    </spark-opts>
    

    软件版本

    Spark 1.6.2,Oozie 4.2 测试 OK.

    参考链接

    Third party packages: If you require a third party package to be installed on the executors, do so when setting up the node (for example with pip).
    Scripts: If you require custom scripts, create a *.zip (or *.egg) file containing all of your dependencies and ship it to the executors using--py-filescommand line option. Two things to look out for:
    Make sure that you also include ‘nested dependencies’
    Make sure that your *.py files are at the top level of the *.zip file.

    相关文章

      网友评论

          本文标题:pyspark 如何在 Spark on Yarn 中使用多个

          本文链接:https://www.haomeiwen.com/subject/cyfmqxtx.html