美文网首页
Java-Spark系列5-Spark SQL介绍

Java-Spark系列5-Spark SQL介绍

作者: 只是甲 | 来源:发表于2021-09-26 18:06 被阅读0次

    一.Spark SQL的概述

    1.1 Spark SQL 来源

    Hive是目前大数据领域,事实上的数据仓库标准。


    image.png

    Hive与RDBMS的SQL模型比较类似,容易掌握。 Hive的主要缺陷在于它的底层是基于MapReduce的,执行比较慢。

    在Spark 0.x版的时候推出了Shark,Shark与Hive是紧密关联的,Shark底层很多东西还是依赖于Hive,修改了内存管理、物理计划、执行三个模块,底层使用Spark的基于内存的计算模型,性能上比Hive提升了很多倍。

    在Spark 1.x的时候Shark被淘汰。在2014 年7月1日的Spark Summit 上, Databricks宣布终止对Shark的开发,将重点放到 Spark SQL 上。

    Shark终止以后,产生了两个分支:

    1). Hive on Spark
    hive社区的,源码在hive中

    2). Spark SQL(Spark on Hive)
    Spark社区,源码在Spark中,支持多种数据源,多种优化技术,扩展性好很多;

    Spark SQL的源码在Spark中,而且新增了许多的优化代码,如果追求速度,例如数据分析的时候,可以使用Hive on Spark,如果追求性能,例如生产的定时报表的时候,应该使用Spark SQL。

    1.2 从代码看Spark SQL的特点

    我们来对比Spark RDD、Dataframe、SQL代码实现wordcount:


    image.png

    我们可以看到,Spark SQL代码看起来与关系型数据库是一致的,从上图可以看到Spark SQL的特点:
    1). 集成
    通过Spark SQL或DataFrame API运行Spark 程序,操作更加简单、快速.

    image.png

    从上图可以看到,Spark SQL和DataFrame底层其实就是调用RDD

    2). 统一的数据访问
    DataFrame 和SQL提供了访问各种数据源的通用方式,包括Hive、Avro、Parquet、ORC、JSON和JDBC。您甚至可以跨这些数据源连接数据。

    image.png

    3). Hive集成
    在现有的数据仓库上运行SQL或HiveQL查询。

    4). 标准的连接
    服务器模式为业务智能工具提供行业标准的JDBC和ODBC连接。

    1.3 从代码运行速度看来看Spark SQL

    从上图我们可以看到:
    1). Python操作RDD比Java/Scala慢一倍以上
    2). 无论是那种语言操作DataFrame,性能几乎一致

    那么为什么Python用RDD这么慢?
    为什么用Python写的RDD比Scala慢一倍以上,两种不同的语言的执行引擎,上下文切换、数据传输。

    Spark SQL其实底层调用的也是RDD执行,其实中间的执行计划进行了优化,而且是在Spark的优化引擎里面,所以无论是那种语言操作DataFrame,性能几乎一致

    二.Spark SQL数据抽象

    Spark SQL提供了两个新的抽象,分别是DataFrame 和Dataset;

    Dataset是数据的分布式集合。Dataset是Spark 1.6中添加的一个新接口,它提供了RDDs的优点(强类型、使用强大lambda函数的能力)以及Spark SQL优化的执行引擎的优点。可以从JVM对象构造数据集,然后使用函数转换(map、flatMap、filter等)操作数据集。数据集API可以在Scala和Java中使用。Python不支持Dataset API。但是由于Python的动态特性,Dataset API的许多优点已经可以使用了(例如,您可以通过名称natural row. columnname访问行字段)。R的情况也是类似的。

    DataFrame 是组织成命名列的Dataset。它在概念上相当于关系数据库中的表或R/Python中的数据框架,但在底层有更丰富的优化。数据框架可以从各种各样的数据源构建,例如:结构化数据文件、Hive中的表、外部数据库或现有的rdd。DataFrame API可以在Scala、Java、Python和r中使用。在Scala和Java中,DataFrame是由行数据集表示的。在Scala API中,DataFrame只是Dataset[Row]的类型别名。而在Java API中,用户需要使用Dataset来表示DataFrame。

    2.1 DataFrame

    DataFrame的前身是SchemaRDD。Spark1.3更名为DataFrame。不继承RDD,自己实现RDD的大部分功能。

    与RDD类似,DataFrame也是一个分布式数据集
    1). DataFrame可以看做分布式Row对象的集合,提供了由列组成的详细模式信息,使其可以得到优化,DataFrame不仅有比RDD更多的算子,还可以进行执行计划的优化

    2). DataFrame更像传统数据库的二维表格,除了数据以外,还记录数据的结构信息,即schema

    3). DataFrame也支持嵌套数据类型(struct、array和Map)

    4). DataFrame API提供的是一套高层的关系操作,比函数式RDD API更加优化,门槛低

    5). DataFrame的劣势在于在编译期缺少类型安全检查,导致运行时出错。

    image.png

    2.2 Dataset

    Dataset时在Spark1.6中添加的新接口;与RDD相比,可以保存更多的描述信息,概念上等同于关系型数据库中的二维表。与DataFrame相比,保存了类型信息,是强类型,提供了编译时检查。

    调用Dataset的方法会生成逻辑计划,然后Spark的优化器进行优化,最终胜出无力计划,然后提交到集群中运行。

    Dataset包含了DataFrame的功能,在Spark2.0中两者得到了统一,DataFrame表示为Dataset[Row],即Dataset的子集.

    image.png

    三.Spark SQL 操作数据库

    3.1 Spark SQL操作Hive数据库

    Spark中所有功能的入口点都是SparkSession类。要创建一个基本的SparkSession,只需使用SparkSession.builder():

    import org.apache.spark.sql.SparkSession;
    
    SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate();
    

    在Spark repo的“examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java”中可以找到完整的示例代码。

    Spark 2.0中的SparkSession提供了对Hive特性的内置支持,包括使用HiveQL编写查询,访问Hive udf,以及从Hive表中读取数据的能力。要使用这些特性,您不需要有一个现有的Hive设置。

    3.1.1 创建DataFrames

    通过SparkSession,应用程序可以从现有的RDD、Hive表或Spark数据源中创建DataFrames。

    下面是一个基于text文件内容的DataFrame示例:
    代码:

    package org.example;
    
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    
    
    public class SparkSQLTest1 {
        public static void main(String[] args){
            SparkSession spark = SparkSession
                    .builder()
                    .appName("SparkSQLTest1")
                    .config("spark.some.config.option", "some-value")
                    .getOrCreate();
    
            Dataset<Row> df = spark.read().text("file:///home/pyspark/idcard.txt");
            df.show();
            
            spark.stop();
        }
    }
    
    

    测试记录:

    [root@hp2 javaspark]# spark-submit \
    >   --class org.example.SparkSQLTest1 \
    >   --master local[2] \
    >   /home/javaspark/SparkStudy-1.0-SNAPSHOT.jar
    21/08/10 14:41:54 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
    21/08/10 14:41:54 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf/__driver_logs__/driver.log
    21/08/10 14:41:54 INFO spark.SparkContext: Submitted application: SparkSQLTest1
    21/08/10 14:41:54 INFO spark.SecurityManager: Changing view acls to: root
    21/08/10 14:41:54 INFO spark.SecurityManager: Changing modify acls to: root
    21/08/10 14:41:54 INFO spark.SecurityManager: Changing view acls groups to: 
    21/08/10 14:41:54 INFO spark.SecurityManager: Changing modify acls groups to: 
    21/08/10 14:41:54 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
    21/08/10 14:41:55 INFO util.Utils: Successfully started service 'sparkDriver' on port 36352.
    21/08/10 14:41:55 INFO spark.SparkEnv: Registering MapOutputTracker
    21/08/10 14:41:55 INFO spark.SparkEnv: Registering BlockManagerMaster
    21/08/10 14:41:55 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    21/08/10 14:41:55 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
    21/08/10 14:41:55 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-59f07625-d934-48f5-a4d8-30f14fba5274
    21/08/10 14:41:55 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
    21/08/10 14:41:55 INFO spark.SparkEnv: Registering OutputCommitCoordinator
    21/08/10 14:41:55 INFO util.log: Logging initialized @1548ms
    21/08/10 14:41:55 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: 2018-09-05T05:11:46+08:00, git hash: 3ce520221d0240229c862b122d2b06c12a625732
    21/08/10 14:41:55 INFO server.Server: Started @1623ms
    21/08/10 14:41:55 INFO server.AbstractConnector: Started ServerConnector@b93aad{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    21/08/10 14:41:55 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@173b9122{/jobs,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@649f2009{/jobs/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@14bb2297{/jobs/job,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1a15b789{/jobs/job/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57f791c6{/stages,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@51650883{/stages/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c4f9535{/stages/stage,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@499b2a5c{/stages/stage/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@596df867{/stages/pool,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c1fca1e{/stages/pool/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@241a53ef{/storage,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@344344fa{/storage/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2db2cd5{/storage/rdd,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@70e659aa{/storage/rdd/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@615f972{/environment,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@285f09de{/environment/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@73393584{/executors,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@31500940{/executors/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1827a871{/executors/threadDump,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@48e64352{/executors/threadDump/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7249dadf{/static,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5be82d43{/,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@600b0b7{/api,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@473b3b7a{/jobs/job/kill,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1734f68{/stages/stage/kill,null,AVAILABLE,@Spark}
    21/08/10 14:41:55 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hp2:4040
    21/08/10 14:41:55 INFO spark.SparkContext: Added JAR file:/home/javaspark/SparkStudy-1.0-SNAPSHOT.jar at spark://hp2:36352/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628577715383
    21/08/10 14:41:55 INFO executor.Executor: Starting executor ID driver on host localhost
    21/08/10 14:41:55 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40459.
    21/08/10 14:41:55 INFO netty.NettyBlockTransferService: Server created on hp2:40459
    21/08/10 14:41:55 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    21/08/10 14:41:55 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hp2, 40459, None)
    21/08/10 14:41:55 INFO storage.BlockManagerMasterEndpoint: Registering block manager hp2:40459 with 366.3 MB RAM, BlockManagerId(driver, hp2, 40459, None)
    21/08/10 14:41:55 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hp2, 40459, None)
    21/08/10 14:41:55 INFO storage.BlockManager: external shuffle service port = 7337
    21/08/10 14:41:55 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hp2, 40459, None)
    21/08/10 14:41:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@245a060f{/metrics/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:56 INFO scheduler.EventLoggingListener: Logging events to hdfs://nameservice1/user/spark/applicationHistory/local-1628577715422
    21/08/10 14:41:56 INFO spark.SparkContext: Registered listener com.cloudera.spark.lineage.NavigatorAppListener
    21/08/10 14:41:56 INFO logging.DriverLogger$DfsAsyncWriter: Started driver log file sync to: /user/spark/driverLogs/local-1628577715422_driver.log
    21/08/10 14:41:56 INFO internal.SharedState: loading hive config file: file:/etc/hive/conf.cloudera.hive/hive-site.xml
    21/08/10 14:41:56 INFO internal.SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('/user/hive/warehouse').
    21/08/10 14:41:56 INFO internal.SharedState: Warehouse path is '/user/hive/warehouse'.
    21/08/10 14:41:56 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5922d3e9{/SQL,null,AVAILABLE,@Spark}
    21/08/10 14:41:56 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7d57dbb5{/SQL/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:56 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f95f1e1{/SQL/execution,null,AVAILABLE,@Spark}
    21/08/10 14:41:56 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@672a1c62{/SQL/execution/json,null,AVAILABLE,@Spark}
    21/08/10 14:41:56 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6015a4a5{/static/sql,null,AVAILABLE,@Spark}
    21/08/10 14:41:57 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
    21/08/10 14:41:58 INFO datasources.FileSourceStrategy: Pruning directories with: 
    21/08/10 14:41:58 INFO datasources.FileSourceStrategy: Post-Scan Filters: 
    21/08/10 14:41:58 INFO datasources.FileSourceStrategy: Output Data Schema: struct<value: string>
    21/08/10 14:41:58 INFO execution.FileSourceScanExec: Pushed Filters: 
    21/08/10 14:41:58 INFO codegen.CodeGenerator: Code generated in 185.453035 ms
    21/08/10 14:41:59 INFO codegen.CodeGenerator: Code generated in 11.048328 ms
    21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 342.0 KB, free 366.0 MB)
    21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 32.3 KB, free 365.9 MB)
    21/08/10 14:41:59 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hp2:40459 (size: 32.3 KB, free: 366.3 MB)
    21/08/10 14:41:59 INFO spark.SparkContext: Created broadcast 0 from show at SparkSQLTest1.java:17
    21/08/10 14:41:59 INFO execution.FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
    21/08/10 14:41:59 INFO spark.SparkContext: Starting job: show at SparkSQLTest1.java:17
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: Got job 0 (show at SparkSQLTest1.java:17) with 1 output partitions
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (show at SparkSQLTest1.java:17)
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: Parents of final stage: List()
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: Missing parents: List()
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest1.java:17), which has no missing parents
    21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.4 KB, free 365.9 MB)
    21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.9 KB, free 365.9 MB)
    21/08/10 14:41:59 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hp2:40459 (size: 3.9 KB, free: 366.3 MB)
    21/08/10 14:41:59 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1164
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest1.java:17) (first 15 tasks are for partitions Vector(0))
    21/08/10 14:41:59 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
    21/08/10 14:41:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8309 bytes)
    21/08/10 14:41:59 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
    21/08/10 14:41:59 INFO executor.Executor: Fetching spark://hp2:36352/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628577715383
    21/08/10 14:41:59 INFO client.TransportClientFactory: Successfully created connection to hp2/10.31.1.124:36352 after 35 ms (0 ms spent in bootstraps)
    21/08/10 14:41:59 INFO util.Utils: Fetching spark://hp2:36352/jars/SparkStudy-1.0-SNAPSHOT.jar to /tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf/userFiles-ddafbf49-ca6a-4c28-83fe-9e4ff2765c5e/fetchFileTemp442678599817678402.tmp
    21/08/10 14:41:59 INFO executor.Executor: Adding file:/tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf/userFiles-ddafbf49-ca6a-4c28-83fe-9e4ff2765c5e/SparkStudy-1.0-SNAPSHOT.jar to class loader
    21/08/10 14:41:59 INFO datasources.FileScanRDD: Reading File path: file:///home/pyspark/idcard.txt, range: 0-209, partition values: [empty row]
    21/08/10 14:41:59 INFO codegen.CodeGenerator: Code generated in 9.418721 ms
    21/08/10 14:41:59 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1441 bytes result sent to driver
    21/08/10 14:41:59 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 256 ms on localhost (executor driver) (1/1)
    21/08/10 14:41:59 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: ResultStage 0 (show at SparkSQLTest1.java:17) finished in 0.366 s
    21/08/10 14:41:59 INFO scheduler.DAGScheduler: Job 0 finished: show at SparkSQLTest1.java:17, took 0.425680 s
    21/08/10 14:41:59 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
    21/08/10 14:41:59 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 2.1 using Spark classes.
    21/08/10 14:42:00 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
    21/08/10 14:42:00 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/1be0271b-776b-42c6-9660-d8f73c14ef2f
    21/08/10 14:42:00 INFO session.SessionState: Created local directory: /tmp/root/1be0271b-776b-42c6-9660-d8f73c14ef2f
    21/08/10 14:42:00 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/1be0271b-776b-42c6-9660-d8f73c14ef2f/_tmp_space.db
    21/08/10 14:42:00 INFO client.HiveClientImpl: Warehouse location for Hive client (version 2.1.1) is /user/hive/warehouse
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 15
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 16
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 23
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 20
    21/08/10 14:42:00 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on hp2:40459 in memory (size: 3.9 KB, free: 366.3 MB)
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 12
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 6
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 22
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 10
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 5
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 19
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 28
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 26
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 27
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 7
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 11
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 17
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 24
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 29
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 30
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 18
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 25
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 14
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 13
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 8
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 21
    21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 9
    21/08/10 14:42:00 INFO hive.metastore: HMS client filtering is enabled.
    21/08/10 14:42:00 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
    21/08/10 14:42:00 INFO hive.metastore: Opened a connection to metastore, current connections: 1
    21/08/10 14:42:00 INFO hive.metastore: Connected to metastore.
    21/08/10 14:42:01 INFO metadata.Hive: Registering function getdegree myUdf.getDegree
    +------------------+
    |             value|
    +------------------+
    |440528*******63016|
    |350525*******60813|
    |120102*******10789|
    |452123*******30416|
    |440301*******22322|
    |441421*******54614|
    |440301*******55416|
    |232721*******40630|
    |362204*******88412|
    |430281*******91015|
    |420117*******88355|
    +------------------+
    
    21/08/10 14:42:01 INFO spark.SparkContext: Invoking stop() from shutdown hook
    21/08/10 14:42:01 INFO server.AbstractConnector: Stopped Spark@b93aad{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    21/08/10 14:42:01 INFO ui.SparkUI: Stopped Spark web UI at http://hp2:4040
    21/08/10 14:42:01 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    21/08/10 14:42:01 INFO memory.MemoryStore: MemoryStore cleared
    21/08/10 14:42:01 INFO storage.BlockManager: BlockManager stopped
    21/08/10 14:42:01 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    21/08/10 14:42:01 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    21/08/10 14:42:01 INFO spark.SparkContext: Successfully stopped SparkContext
    21/08/10 14:42:01 INFO util.ShutdownHookManager: Shutdown hook called
    21/08/10 14:42:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-45aa512d-c4a1-44f4-8152-2273f1e78bda
    21/08/10 14:42:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf
    [root@hp2 javaspark]# 
    

    3.1.2 以编程方式运行SQL查询

    代码:

    package org.example;
    
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    
    public class SparkSQLTest2 {
        public static void main(String[] args){
            SparkSession spark = SparkSession
                    .builder()
                    .appName("SparkSQLTest2")
                    .config("spark.some.config.option", "some-value")
                    .getOrCreate();
    
            Dataset<Row> sqlDF = spark.sql("SELECT * FROM test.ods_fact_sale limit 100");
            sqlDF.show();
            
           spark.stop();
        }
    
    }
    
    

    测试记录:

    [14:49:40] [root@hp2 javaspark]# spark-submit \
    [14:49:40] >   --class org.example.SparkSQLTest2 \
    [14:49:40] >   --master local[2] \
    [14:49:41] >   /home/javaspark/SparkStudy-1.0-SNAPSHOT.jar
    [14:49:42] 21/08/10 14:49:46 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
    [14:49:42] 21/08/10 14:49:46 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-7c425e78-967b-4dbc-9552-1917c8d38b2f/__driver_logs__/driver.log
    [14:49:42] 21/08/10 14:49:46 INFO spark.SparkContext: Submitted application: SparkSQLTest2
    [14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing view acls to: root
    [14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing modify acls to: root
    [14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing view acls groups to: 
    [14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing modify acls groups to: 
    ....snip....
    21/08/10 14:49:56 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 237) in 53 ms on localhost (executor driver) (1/1)
    21/08/10 14:49:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
    21/08/10 14:49:56 INFO scheduler.DAGScheduler: ResultStage 1 (show at SparkSQLTest2.java:16) finished in 0.064 s
    21/08/10 14:49:56 INFO scheduler.DAGScheduler: Job 0 finished: show at SparkSQLTest2.java:16, took 3.417211 s
    +---+--------------------+---------+---------+
    | id|           sale_date|prod_name|sale_nums|
    +---+--------------------+---------+---------+
    |  1|2011-08-16 00:00:...|    PROD4|       28|
    |  2|2011-11-06 00:00:...|    PROD6|       19|
    |  3|2011-04-25 00:00:...|    PROD8|       29|
    |  4|2011-09-12 00:00:...|    PROD2|       88|
    |  5|2011-05-15 00:00:...|    PROD5|       76|
    |  6|2011-02-23 00:00:...|    PROD6|       64|
    |  7|2012-09-26 00:00:...|    PROD2|       38|
    |  8|2012-02-14 00:00:...|    PROD6|       45|
    |  9|2010-04-22 00:00:...|    PROD8|       57|
    | 10|2010-10-31 00:00:...|    PROD5|       65|
    | 11|2010-10-24 00:00:...|    PROD2|       33|
    | 12|2011-02-11 00:00:...|    PROD9|       27|
    | 13|2012-07-10 00:00:...|    PROD8|       48|
    | 14|2011-02-23 00:00:...|    PROD6|       46|
    | 15|2010-08-10 00:00:...|    PROD4|       50|
    | 16|2011-05-02 00:00:...|    PROD6|       22|
    | 17|2012-07-20 00:00:...|    PROD2|       56|
    | 18|2012-07-12 00:00:...|    PROD9|       57|
    | 19|2011-11-18 00:00:...|    PROD6|       58|
    | 20|2010-04-22 00:00:...|    PROD4|        7|
    +---+--------------------+---------+---------+
    only showing top 20 rows
    
    21/08/10 14:49:56 INFO spark.SparkContext: Invoking stop() from shutdown hook
    21/08/10 14:49:56 INFO server.AbstractConnector: Stopped Spark@51768776{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    21/08/10 14:49:56 INFO ui.SparkUI: Stopped Spark web UI at http://hp2:4040
    21/08/10 14:49:56 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    21/08/10 14:49:56 INFO memory.MemoryStore: MemoryStore cleared
    21/08/10 14:49:56 INFO storage.BlockManager: BlockManager stopped
    21/08/10 14:49:56 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    21/08/10 14:49:56 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    21/08/10 14:49:56 INFO spark.SparkContext: Successfully stopped SparkContext
    21/08/10 14:49:56 INFO util.ShutdownHookManager: Shutdown hook called
    21/08/10 14:49:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-0d720459-d9a4-4bb5-b9ce-521f524c87c8
    21/08/10 14:49:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-7c425e78-967b-4dbc-9552-1917c8d38b2f
    [root@hp2 javaspark]# 
    

    3.2 Spark SQL操作MySQL数据库

    Spark SQL不仅可以操作Hive数据库,也可以操作远程的MySQL数据库

    代码:

    package org.example;
    
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    
    public class SparkSQLTest3 {
        public static void main(String[] args){
            SparkSession spark = SparkSession
                    .builder()
                    .appName("SparkSQLTest3")
                    .config("spark.some.config.option", "some-value")
                    .getOrCreate();
    
            Dataset<Row> jdbcDF = spark.read()
                    .format("jdbc")
                    .option("url", "jdbc:mysql://10.31.1.123:3306/test")
                    .option("dbtable", "(SELECT * FROM EMP) tmp")
                    .option("user", "root")
                    .option("password", "abc123")
                    .load();
    
            jdbcDF.printSchema();
            jdbcDF.show();
    
            spark.stop();
        }
    
    }
    
    

    测试记录:

    [root@hp2 javaspark]# spark-submit \
    >   --class org.example.SparkSQLTest3 \
    >   --master local[2] \
    >   /home/javaspark/SparkStudy-1.0-SNAPSHOT.jar
    21/08/10 15:04:01 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
    21/08/10 15:04:01 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6/__driver_logs__/driver.log
    21/08/10 15:04:01 INFO spark.SparkContext: Submitted application: SparkSQLTest3
    21/08/10 15:04:01 INFO spark.SecurityManager: Changing view acls to: root
    21/08/10 15:04:01 INFO spark.SecurityManager: Changing modify acls to: root
    21/08/10 15:04:01 INFO spark.SecurityManager: Changing view acls groups to: 
    21/08/10 15:04:01 INFO spark.SecurityManager: Changing modify acls groups to: 
    21/08/10 15:04:01 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
    21/08/10 15:04:02 INFO util.Utils: Successfully started service 'sparkDriver' on port 44266.
    21/08/10 15:04:02 INFO spark.SparkEnv: Registering MapOutputTracker
    21/08/10 15:04:02 INFO spark.SparkEnv: Registering BlockManagerMaster
    21/08/10 15:04:02 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    21/08/10 15:04:02 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
    21/08/10 15:04:02 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1781c08b-9d73-427d-9bc0-2e7e092a2f16
    21/08/10 15:04:02 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
    21/08/10 15:04:02 INFO spark.SparkEnv: Registering OutputCommitCoordinator
    21/08/10 15:04:02 INFO util.log: Logging initialized @1556ms
    21/08/10 15:04:02 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: 2018-09-05T05:11:46+08:00, git hash: 3ce520221d0240229c862b122d2b06c12a625732
    21/08/10 15:04:02 INFO server.Server: Started @1633ms
    21/08/10 15:04:02 INFO server.AbstractConnector: Started ServerConnector@b93aad{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    21/08/10 15:04:02 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@173b9122{/jobs,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@649f2009{/jobs/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@14bb2297{/jobs/job,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1a15b789{/jobs/job/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57f791c6{/stages,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@51650883{/stages/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c4f9535{/stages/stage,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@499b2a5c{/stages/stage/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@596df867{/stages/pool,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c1fca1e{/stages/pool/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@241a53ef{/storage,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@344344fa{/storage/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2db2cd5{/storage/rdd,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@70e659aa{/storage/rdd/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@615f972{/environment,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@285f09de{/environment/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@73393584{/executors,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@31500940{/executors/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1827a871{/executors/threadDump,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@48e64352{/executors/threadDump/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7249dadf{/static,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5be82d43{/,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@600b0b7{/api,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@473b3b7a{/jobs/job/kill,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1734f68{/stages/stage/kill,null,AVAILABLE,@Spark}
    21/08/10 15:04:02 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hp2:4040
    21/08/10 15:04:02 INFO spark.SparkContext: Added JAR file:/home/javaspark/SparkStudy-1.0-SNAPSHOT.jar at spark://hp2:44266/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628579042508
    21/08/10 15:04:02 INFO executor.Executor: Starting executor ID driver on host localhost
    21/08/10 15:04:02 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43617.
    21/08/10 15:04:02 INFO netty.NettyBlockTransferService: Server created on hp2:43617
    21/08/10 15:04:02 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    21/08/10 15:04:02 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hp2, 43617, None)
    21/08/10 15:04:02 INFO storage.BlockManagerMasterEndpoint: Registering block manager hp2:43617 with 366.3 MB RAM, BlockManagerId(driver, hp2, 43617, None)
    21/08/10 15:04:02 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hp2, 43617, None)
    21/08/10 15:04:02 INFO storage.BlockManager: external shuffle service port = 7337
    21/08/10 15:04:02 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hp2, 43617, None)
    21/08/10 15:04:02 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@245a060f{/metrics/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:03 INFO scheduler.EventLoggingListener: Logging events to hdfs://nameservice1/user/spark/applicationHistory/local-1628579042554
    21/08/10 15:04:03 INFO spark.SparkContext: Registered listener com.cloudera.spark.lineage.NavigatorAppListener
    21/08/10 15:04:03 INFO logging.DriverLogger$DfsAsyncWriter: Started driver log file sync to: /user/spark/driverLogs/local-1628579042554_driver.log
    21/08/10 15:04:03 INFO internal.SharedState: loading hive config file: file:/etc/hive/conf.cloudera.hive/hive-site.xml
    21/08/10 15:04:03 INFO internal.SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('/user/hive/warehouse').
    21/08/10 15:04:03 INFO internal.SharedState: Warehouse path is '/user/hive/warehouse'.
    21/08/10 15:04:03 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5922d3e9{/SQL,null,AVAILABLE,@Spark}
    21/08/10 15:04:03 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7d57dbb5{/SQL/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:03 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f95f1e1{/SQL/execution,null,AVAILABLE,@Spark}
    21/08/10 15:04:03 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@672a1c62{/SQL/execution/json,null,AVAILABLE,@Spark}
    21/08/10 15:04:03 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6015a4a5{/static/sql,null,AVAILABLE,@Spark}
    21/08/10 15:04:04 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
    root
     |-- empno: integer (nullable = true)
     |-- ename: string (nullable = true)
     |-- job: string (nullable = true)
     |-- mgr: integer (nullable = true)
     |-- hiredate: date (nullable = true)
     |-- sal: decimal(7,2) (nullable = true)
     |-- comm: decimal(7,2) (nullable = true)
     |-- deptno: integer (nullable = true)
    
    21/08/10 15:04:06 INFO codegen.CodeGenerator: Code generated in 194.297923 ms
    21/08/10 15:04:06 INFO codegen.CodeGenerator: Code generated in 32.1523 ms
    21/08/10 15:04:06 INFO spark.SparkContext: Starting job: show at SparkSQLTest3.java:24
    21/08/10 15:04:06 INFO scheduler.DAGScheduler: Got job 0 (show at SparkSQLTest3.java:24) with 1 output partitions
    21/08/10 15:04:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (show at SparkSQLTest3.java:24)
    21/08/10 15:04:06 INFO scheduler.DAGScheduler: Parents of final stage: List()
    21/08/10 15:04:06 INFO scheduler.DAGScheduler: Missing parents: List()
    21/08/10 15:04:06 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest3.java:24), which has no missing parents
    21/08/10 15:04:06 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 10.9 KB, free 366.3 MB)
    21/08/10 15:04:06 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.1 KB, free 366.3 MB)
    21/08/10 15:04:06 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hp2:43617 (size: 5.1 KB, free: 366.3 MB)
    21/08/10 15:04:06 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1164
    21/08/10 15:04:06 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest3.java:24) (first 15 tasks are for partitions Vector(0))
    21/08/10 15:04:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
    21/08/10 15:04:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7690 bytes)
    21/08/10 15:04:07 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
    21/08/10 15:04:07 INFO executor.Executor: Fetching spark://hp2:44266/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628579042508
    21/08/10 15:04:07 INFO client.TransportClientFactory: Successfully created connection to hp2/10.31.1.124:44266 after 37 ms (0 ms spent in bootstraps)
    21/08/10 15:04:07 INFO util.Utils: Fetching spark://hp2:44266/jars/SparkStudy-1.0-SNAPSHOT.jar to /tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6/userFiles-651b34b1-fee3-4d6b-b332-9246d1f6e35b/fetchFileTemp859077269082454607.tmp
    21/08/10 15:04:07 INFO executor.Executor: Adding file:/tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6/userFiles-651b34b1-fee3-4d6b-b332-9246d1f6e35b/SparkStudy-1.0-SNAPSHOT.jar to class loader
    21/08/10 15:04:07 INFO jdbc.JDBCRDD: closed connection
    21/08/10 15:04:07 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1884 bytes result sent to driver
    21/08/10 15:04:07 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 252 ms on localhost (executor driver) (1/1)
    21/08/10 15:04:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    21/08/10 15:04:07 INFO scheduler.DAGScheduler: ResultStage 0 (show at SparkSQLTest3.java:24) finished in 0.577 s
    21/08/10 15:04:07 INFO scheduler.DAGScheduler: Job 0 finished: show at SparkSQLTest3.java:24, took 0.632211 s
    21/08/10 15:04:07 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
    21/08/10 15:04:07 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 2.1 using Spark classes.
    21/08/10 15:04:07 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
    21/08/10 15:04:07 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/75efdbe5-01fb-4f39-b04a-82982a3b5298
    21/08/10 15:04:07 INFO session.SessionState: Created local directory: /tmp/root/75efdbe5-01fb-4f39-b04a-82982a3b5298
    21/08/10 15:04:07 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/75efdbe5-01fb-4f39-b04a-82982a3b5298/_tmp_space.db
    21/08/10 15:04:07 INFO client.HiveClientImpl: Warehouse location for Hive client (version 2.1.1) is /user/hive/warehouse
    21/08/10 15:04:08 INFO hive.metastore: HMS client filtering is enabled.
    21/08/10 15:04:08 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
    21/08/10 15:04:08 INFO hive.metastore: Opened a connection to metastore, current connections: 1
    21/08/10 15:04:08 INFO hive.metastore: Connected to metastore.
    21/08/10 15:04:08 INFO metadata.Hive: Registering function getdegree myUdf.getDegree
    +-----+------+---------+----+----------+-------+-------+------+
    |empno| ename|      job| mgr|  hiredate|    sal|   comm|deptno|
    +-----+------+---------+----+----------+-------+-------+------+
    | 7369| SMITH|    CLERK|7902|1980-12-17| 800.00|   null|    20|
    | 7499| ALLEN| SALESMAN|7698|1981-02-20|1600.00| 300.00|    30|
    | 7521|  WARD| SALESMAN|7698|1981-02-22|1250.00| 500.00|    30|
    | 7566| JONES|  MANAGER|7839|1981-04-02|2975.00|   null|    20|
    | 7654|MARTIN| SALESMAN|7698|1981-09-28|1250.00|1400.00|    30|
    | 7698| BLAKE|  MANAGER|7839|1981-05-01|2850.00|   null|    30|
    | 7782| CLARK|  MANAGER|7839|1981-06-09|2450.00|   null|    10|
    | 7788| SCOTT|  ANALYST|7566|1987-06-13|3000.00|   null|    20|
    | 7839|  KING|PRESIDENT|null|1981-11-17|5000.00|   null|    10|
    | 7844|TURNER| SALESMAN|7698|1981-09-08|1500.00|   0.00|    30|
    | 7876| ADAMS|    CLERK|7788|1987-06-13|1100.00|   null|    20|
    | 7900| JAMES|    CLERK|7698|1981-12-03| 950.00|   null|    30|
    | 7902|  FORD|  ANALYST|7566|1981-12-03|3000.00|   null|    20|
    | 7934|MILLER|    CLERK|7782|1982-01-23|1300.00|   null|    10|
    +-----+------+---------+----+----------+-------+-------+------+
    
    21/08/10 15:04:08 INFO server.AbstractConnector: Stopped Spark@b93aad{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    21/08/10 15:04:08 INFO ui.SparkUI: Stopped Spark web UI at http://hp2:4040
    21/08/10 15:04:08 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    21/08/10 15:04:08 INFO memory.MemoryStore: MemoryStore cleared
    21/08/10 15:04:08 INFO storage.BlockManager: BlockManager stopped
    21/08/10 15:04:08 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    21/08/10 15:04:08 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    21/08/10 15:04:08 INFO spark.SparkContext: Successfully stopped SparkContext
    21/08/10 15:04:08 INFO util.ShutdownHookManager: Shutdown hook called
    21/08/10 15:04:08 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6
    21/08/10 15:04:08 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-6b5f511b-1bb4-420c-b26f-737dc764fea1
    [root@hp2 javaspark]# 
    

    参考:

    1.http://spark.apache.org/docs/2.4.2/sql-getting-started.html
    2.https://www.jianshu.com/p/ad6dc9467a6b
    3.https://blog.csdn.net/u011412768/article/details/93426353
    4.https://blog.csdn.net/luoganttcc/article/details/88791460

    相关文章

      网友评论

          本文标题:Java-Spark系列5-Spark SQL介绍

          本文链接:https://www.haomeiwen.com/subject/ndbzvltx.html