美文网首页
01-pySpark 安装

01-pySpark 安装

作者: 过桥 | 来源:发表于2019-10-31 19:02 被阅读0次

    Linux 下载spark问题

    gzip: stdin: not in gzip format

    下载地址

    [mongodb@mongodb02 software]$ sudo curl -O https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 29675    0 29675    0     0  15922      0 --:--:--  0:00:01 --:--:-- 15920
    [mongodb@mongodb02 software]$ ll
    总用量 32
    -rw-r--r--. 1 root root 29675 10月 29 16:51 spark-2.4.4-bin-hadoop2.7.tgz
    

    解压文件无法解压

    [mongodb@mongodb02 software]$ tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
    
    gzip: stdin: not in gzip format
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now
    

    错误排查一,再次确认文件是否下载成功

    错误排查二,不使用gzip解压缩

    [mongodb@mongodb02 software]$ tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
    
    gzip: stdin: not in gzip format
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now
    

    错误排查三,检查文件格式,发现找到的下载地址指向的是页面,不是文件

    [mongodb@mongodb02 software]$ file spark-2.4.4-bin-hadoop2.7.tgz
    spark-2.4.4-bin-hadoop2.7.tgz: HTML document, ASCII text, with very long lines
    

    重新下载spark文件

    下载地址

    [mongodb@mongodb02 software]$ sudo curl -O http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
    

    检查压缩格式

    [mongodb@mongodb02 software]$ file spark-2.4.4-bin-hadoop2.7.tgz
    spark-2.4.4-bin-hadoop2.7.tgz: gzip compressed data, from Unix, last modified: Wed Aug 28 05:30:23 2019
    

    解压缩

    [mongodb@mongodb02 software]$ sudo tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
    

    测试

    启动pyspark

    [mongodb@mongodb02 spark-2.4.4-bin-hadoop2.7]$ ./bin/pyspark
    Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    19/10/30 16:33:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
          /_/
    
    Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
    SparkSession available as 'spark'.
    >>> 
    

    执行测试代码

    >>> rdd = sc.parallelize([1,2,3,4,5])
    >>> rdd.reduce(lambda x,y:x+y)
    15                                                                              
    >>> rdd.map(lambda x:x+1)
    PythonRDD[4] at RDD at PythonRDD.scala:53
    >>> rdd.reduce(lambda x,y:x+y)
    15
    >>> rdd.map(lambda x:x+1).reduce(lambda x,y:x+y)
    20
    >>> exit()
    [mongodb@mongodb02 spark-2.4.4-bin-hadoop2.7]$ 
    

    观察代码可知,如不执行action 操作,那么其实系统并不会将 RDD 进行Transformations,而是记录下转换的顺序与方法, 在执行action操作时统一执行Transformations

    transformations

    actions

    Windows 安装

    下载包解压至相关目录

    D:\S_Software\spark-2.4.4-bin-hadoop2.7
    

    将目录添加至系统环境变量

    我的电脑 -> 右键属性 -> 高级系统设置

    Path -> 编辑 -> 添加 D:\S_Software\spark-2.4.4-bin-hadoop2.7\bin

    测试

    运行 -> cmd -> pyspark

    测试代码参考前面内容

    相关文章

      网友评论

          本文标题:01-pySpark 安装

          本文链接:https://www.haomeiwen.com/subject/jcatbctx.html