美文网首页
Spark hello word(加载本地文件和加载hdfs文件

Spark hello word(加载本地文件和加载hdfs文件

作者: 牛马风情 | 来源:发表于2017-04-12 23:41 被阅读0次

    本节内容主要完成:
    使用sparkshell记载本地文件和hdfs文件

    spark处理的文件可能存在于本地文件系统中,也可能存在分布式文件系统中

    本地文件加载

    创建一个测试文件

    [root@sandbox home]# cd /home/guest/
    
    // 在guest 目录下创建一个文件夹
    [root@sandbox guest]# mkdir erhuan
    // 在 新建的文件夹中创建一个测试文件
    [root@sandbox guest]# cd erhuan/
    [root@sandbox erhuan]# vi hellospark
    
    

    启动sparkshell

    [root@sandbox erhuan]# spark-shell
    Spark assembly has been built with Hive, including Datanucleus jars on classpath
    17/04/12 14:45:41 INFO SecurityManager: Changing view acls to: root
    17/04/12 14:45:41 INFO SecurityManager: Changing modify acls to: root
    17/04/12 14:45:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
    17/04/12 14:45:41 INFO HttpServer: Starting HTTP Server
    17/04/12 14:45:41 INFO Utils: Successfully started service 'HTTP class server' on port 47623.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 1.2.1
          /_/
    
    // 省略一堆输出
    

    加载本地文件

    Spark context available as sc.
    //使用sc.textFile()方法记载文件
    scala>  val textFile = sc.textFile("file:///home/guest/erhuan/hellospark")
    
    // 省略一堆输出
    textFile: org.apache.spark.rdd.RDD[String] = file:///home/guest/erhuan/hellospark MappedRDD[1] at textFile at <console>:12
    //执行一次action操作
    scala> textFile.first()
    // 省略一堆输出
    17/04/12 14:53:27 WARN DomainSocketFactory: The short-circuit local reads feature cannot be 
    17/04/12 14:53:27 INFO DAGScheduler: Job 0 finished: first at <console>:15, took 0.306226 s
    res0: String = this is a hello word txt
    // spark 会记录之前所有的动作但是并不会进行操作,执行action动作后才会启动之前的操作
    
    将结果保存到本地
    scala> textFile.saveAsTextFile("file:///home/guest/erhuan/wordres")
    17/04/12 14:59:31 INFO DefaultExecutionContext: Starting job: saveAsTextFile at <console>:15
    17/04/12 14:59:31 INFO DAGScheduler: Got job 6 (saveAsTextFile at <console>:15) with 2 output partitions (allowLocal=false)
    // 省略一堆输出
    

    退出spark-shell,查看"/home/guest/erhuan/hellospark"文件夹下面内容

    //退出spark-shell
    scala> exit
    [root@sandbox erhuan]# cd wordres/
    [root@sandbox wordres]# ll
    total 4
    -rw-r--r-- 1 root root 25 2017-04-12 14:59 part-00000
    -rw-r--r-- 1 root root  0 2017-04-12 14:59 part-00001
    -rw-r--r-- 1 root root  0 2017-04-12 14:59 _SUCCESS
    [root@sandbox wordres]# more part-00000
    this is a hello word txt
    // 完成spark 对本地文件的加载和写入
    
    

    加载hdfs文件

    //首先向文件拷贝到hdfs上,避免权限问题将 先将文件拷贝到tmp目录下
    [root@sandbox tmp]# mv /home/guest/erhuan/hellospark /tmp
    [hdfs@sandbox tmp]$ hadoop fs -mkdir -p /user/erhuan
    [hdfs@sandbox tmp]$ hadoop fs -put /tmp/hellospark /user/erhuan
    //创建一个文件夹然后将本地文件推到hdfs上
    [root@sandbox erhuan]# spark-shell
    Spark assembly has been built with Hive, including Datanucleus jars on classpath
    17/04/12 14:45:41 INFO SecurityManager: Changing view acls to: root
    17/04/12 14:45:41 INFO SecurityManager: Changing modify acls to: root
    17/04/12 14:45:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
    17/04/12 14:45:41 INFO HttpServer: Starting HTTP Server
    17/04/12 14:45:41 INFO Utils: Successfully started service 'HTTP class server' on port 47623.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 1.2.1
          /_/
    
    // 省略一堆输出
    

    加载HDFS文件

    Spark context available as sc.
    //使用sc.textFile()方法记载文件
    scala> val textFile = sc.textFile("/user/erhuan/hellospark")
    17/04/12 15:33:29 INFO MemoryStore: ensureFreeSpace(277063) called with curMem=684755, // //省略一堆输出
    
    // 执行一次action 查看是否执行成功
    scala> textFile.first()
    // 省略一堆输出
    17/04/12 15:33:32 INFO DAGScheduler: Job 0 finished: first at <console>:15, took 0.543566 s
    res3: String = this is a hello word txt
    
    //写入回来
    
    scala> textFile.saveAsTextFile("/user/erhuan/res")
    17/04/12 15:36:34 INFO DefaultExecutionContext: Starting job: saveAsTextFile at <console>:15
    17/04/12 15:36:34 INFO DAGScheduler: Got job 1 (saveAsTextFile at <console>:15) with 2 output partitions (allowLocal=false)
    // 省略一堆输出
    
    //退出spark-shell
    //查看结果
    [hdfs@sandbox tmp]$ hadoop fs -ls /user/erhuan/res
    Found 3 items
    -rw-r--r--   1 hdfs hdfs          0 2017-04-12 15:36 /user/erhuan/res/_SUCCESS
    -rw-r--r--   1 hdfs hdfs         25 2017-04-12 15:36 /user/erhuan/res/part-00000
    -rw-r--r--   1 hdfs hdfs          0 2017-04-12 15:36 /user/erhuan/res/part-00001
    [hdfs@sandbox tmp]$ hadoop fs -cat /user/erhuan/res/part-00000
    this is a hello word txt
    
    

    相关文章

      网友评论

          本文标题:Spark hello word(加载本地文件和加载hdfs文件

          本文链接:https://www.haomeiwen.com/subject/wgjvattx.html