美文网首页
pycharm 开发pyspark

pycharm 开发pyspark

作者: wangqiaoshi | 来源:发表于2018-01-10 20:06 被阅读0次

    下载spark包

    配置参数

    配置spark参数
    vim ${spark_dir}/conf/spark-env.sh
    export SPARK_LOCAL_IP=ifconfig|grep -1a en0|grep netmask|awk {'print $2'}
    HADOOP_CONF_DIR=$SPARK_HOME/conf

    vim ${spark_dir}/conf/spark-defaults.conf
    spark.master local

    配置系统环境
    vim ~/.bash_profile
    SPARK_HOME=${spark_dir}
    export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
    export PYSPARK_DRIVER_PYTHON=ipython
    export PYSPARK_PYTHON=python
    export SPARK_HOME

    因为pycharm会读取.bash_profile,不过执行代码的时候会把PYTHONPATH会覆盖掉.
    所以让pycharm先设置PYTHONPATH.

    preferences->project interpreter->show all->


    image.png
    image.png image.png

    这样就可以在本地开发spark任务了

    from __future__ import print_function
    import sys
    from random import random
    from operator import add
    
    from pyspark.sql import SparkSession
    
    if __name__ == "__main__":
        """
            Usage: pi [partitions]
        """
        spark = SparkSession.builder.appName("PythonPi").getOrCreate()
    
        partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
        n = 100000 * partitions
    
        def f(_):
            x = random() * 2 - 1
            y = random() * 2 - 1
            return 1 if x ** 2 + y ** 2 <= 1 else 0
    
        count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
        print("Pi is roughly %f" % (4.0 * count / n))
    
        spark.stop()
    

    相关文章

      网友评论

          本文标题:pycharm 开发pyspark

          本文链接:https://www.haomeiwen.com/subject/vhxanxtx.html