sparksql读取hive数据库代码（一）

作者: 大胖圆儿小姐 | 来源:发表于2021-12-09 17:29 被阅读0次

sparksql读取hive数据库代码（一）
Hive to SparkSQL 简单指南
SparkSql学习一
Spark SQL以编程方式连接到Hive Metastore
21 sparkSQL
Spark学习笔记九：SparkSQL
SparkSQL读取HBase数据
sparksql之hive数据仓库安装及配置
SparkSQL读取Hive数据插入Redis
通过自定义SparkSQL外部数据源实现SparkSQL读取HB

一、代码概述

在windows本地运行spark，以local模式读取hive数据库表数据，实现的业务是计算经纬度之间的距离，再将结果插入一张新表。在windows本地运行，需要下载windows的hadoop模拟环境，使用winutils功能，将hive的配置文件hive-site.xml拷贝到资源目录，pom文件中指定spark的版本为2.12。

二、下载windows环境下的hadoop包

下载地址：https://github.com/4ttty/winutils，只能下载所有的，没办法只下载某一个版本，我所使用的版本是hadoop-2.8.3，将其拷贝到随便哪一个目录，拷贝出目录地址即可。
bd5c91c75782469268ad0dc4c424a4c.png

三、工程创建

使用intellij idea创建maven工程，很简单，我就不仔细写了。
将linux环境上hive的配置文件hive-site.xml拷贝到工程目录下的resources目录。

image.png
pom文件配置

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>spark</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!-- 根据linux上spark安装的版本，选择pom文件中spark的版本 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>
        <!-- 计算经纬度之间的距离需要的包 -->
        <dependency>
            <groupId>org.gavaghan</groupId>
            <artifactId>geodesy</artifactId>
            <version>1.1.3</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.4.1</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>util.Microseer</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

创建结果表t_sitenumber_distance的对象

/**
  * @author DongJing
  * @date 2021/12/9 16:39
 */
public class DistanceMeter {

    private String siteNumber;
    private double distance;
    private int flag;

    public String getSiteNumber() {
        return siteNumber;
    }

    public void setSiteNumber(String siteNumber) {
        this.siteNumber = siteNumber;
    }

    public double getDistance() {
        return distance;
    }

    public void setDistance(double distance) {
        this.distance = distance;
    }

    public int getFlag() {
        return flag;
    }

    public void setFlag(int flag) {
        this.flag = flag;
    }
}

通过spark读取hive库表

import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.*;
import org.gavaghan.geodesy.Ellipsoid;
import org.gavaghan.geodesy.GeodeticCalculator;
import org.gavaghan.geodesy.GeodeticCurve;
import org.gavaghan.geodesy.GlobalCoordinates;

/**
 * @author DongJing
 * @date 2021/12/9 16:39
 */
public class SparkSqlTest {

    public static void main(String[] args) {
        //windows环境下模拟hadoop环境，linux环境下可注释掉此行代码，不注释也没有问题
        System.setProperty("hadoop.home.dir","E:\\git\\spark\\hadoop-2.8.3");
        //获取sparksession连接
        SparkSession spark = SparkSession
                .builder()
                .appName("HiveSupport")
                .master("local")
                .enableHiveSupport()
                .getOrCreate();
        spark.sql("show databases").show();
        spark.sql("use sitelight");
        spark.sql("show tables").show();
        Dataset<Row> rowDataset = spark.sql("select t1.site_number, t1.longitude j1,t2.longitude j2,t1.latitude w1,t2.latitude w2 " +
                "from t_site_formal t1 inner join geo_site_info t2 on t1.site_number = t2.number where t1.del_flag=0 and t1.sign=0");
        Encoder<DistanceMeter> rowEncoder = Encoders.bean(DistanceMeter.class);
        //通过map拆分并组装数据，返回DistanceMeter对象
        Dataset<DistanceMeter> distanceMeterDataset = rowDataset.map((MapFunction<Row,DistanceMeter>) row->{
            DistanceMeter distanceMeter = new DistanceMeter();
            distanceMeter.setSiteNumber(row.get(0).toString());
            Double j1 = Double.valueOf(row.get(1).toString());
            Double j2 = Double.valueOf(row.get(2).toString());
            Double w1 = Double.valueOf(row.get(3).toString());
            Double w2 = Double.valueOf(row.get(4).toString());
            GlobalCoordinates source = new GlobalCoordinates(j1, w1);
            GlobalCoordinates target = new GlobalCoordinates(j2, w2);
            double distance = getDistanceMeter(source,target,Ellipsoid.Sphere);
            int flag = distance<=500?0:1;
            distanceMeter.setDistance(distance);
            distanceMeter.setFlag(flag);
            return distanceMeter;
        }, rowEncoder);
        //将数据集注册成一个临时表，通过sparksql执行插入操作
        distanceMeterDataset.registerTempTable("tmp");
        spark.sql("INSERT INTO t_sitenumber_distance SELECT siteNumber, flag, distance FROM tmp");
        spark.close();
    }

    /**
     * 经纬度距离计算
     *
     * @param gpsFrom
     * @param gpsTo
     * @param ellipsoid
     * @return
     */
    public static double getDistanceMeter(GlobalCoordinates gpsFrom, GlobalCoordinates gpsTo, Ellipsoid ellipsoid){
        //创建GeodeticCalculator，调用计算方法，传入坐标系、经纬度用于计算距离
        GeodeticCurve geoCurve = new GeodeticCalculator().calculateGeodeticCurve(ellipsoid, gpsFrom, gpsTo);
        return geoCurve.getEllipsoidalDistance();
    }

}

四、我遇到的问题

main方法执行时遇到的问题

Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop01
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:320)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getStagingDir(SaveAsHiveFile.scala:218)
    at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getStagingDir$(SaveAsHiveFile.scala:213)
    at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getStagingDir(InsertIntoHiveTable.scala:68)
    at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getExternalScratchDir(SaveAsHiveFile.scala:210)
    at org.apache.spark.sql.hive.execution.SaveAsHiveFile.newVersionExternalTempPath(SaveAsHiveFile.scala:192)
    at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getExternalTmpPath(SaveAsHiveFile.scala:131)
    at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getExternalTmpPath$(SaveAsHiveFile.scala:100)
    at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getExternalTmpPath(InsertIntoHiveTable.scala:68)
    at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:98)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
    at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
    at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
    at SparkSqlTest.main(SparkSqlTest.java:44)
Caused by: java.net.UnknownHostException: hadoop01
    ... 40 more

报错原因：由于要读取hadoop环境上存在的数据，windows环境上无法识别，所以需要配置映射。
解决方法：在C:\Windows\System32\drivers\etc目录下配置hosts，在最后一行添加172.16.100.26 hadoop01。

五、成功执行

main方法运行结果成功执行。

image.png
在linux服务器上，进入hive客户端或者hadoop的UI界面，验证执行结果。

sparksql读取hive数据库代码（一）
一、代码概述在windows本地运行spark，以local模式读取hive数据库表数据，实现的业务是计算经纬度...
Hive to SparkSQL 简单指南
Hive 转 SparkSQL 1. SparkSQL vs Hive 性能差距：SparkSQL 比 Hive ...
SparkSql学习一
1 简介 SparkSql 可以从各种结构化数据源读取数据（JSON Hive Parquet等）中读取数据。而且...
Spark SQL以编程方式连接到Hive Metastore
使用SparkSQL连接Hive数据库报错：找不到表 Table or view not found: bp_od...
21 sparkSQL
sparkSQL服务架构 sparkSQL与Hive集成 1需要配置的项目 1、拷贝hive的配置文件 Hive-...
Spark学习笔记九：SparkSQL
一、SparkSQL基础知识 1.SparkSQL介绍 Hive是Shark的前身，Shark是SparkSQL的...
SparkSQL读取HBase数据
这里的SparkSQL是指整合了Hive的spark-sql cli（关于SparkSQL和Hive的整合，见文章...
sparksql之hive数据仓库安装及配置
一、安装概述计划使用sparksql组件从hive中读取数据，基于前三篇文章，我已经安装好了hadoop、spa...
SparkSQL读取Hive数据插入Redis
（1）背景目前使用Python读取Hive表，解析转换之后并发插入Redis，使用fetchone方法读取速度较慢...
通过自定义SparkSQL外部数据源实现SparkSQL读取HB
通过自定义SparkSQL外部数据源实现SparkSQL读取HBase 标签: SparkSQL HBase Sa...

sparksql读取hive数据库代码（一）

一、代码概述

二、下载windows环境下的hadoop包

三、工程创建

四、我遇到的问题

五、成功执行

相关文章

sparksql读取hive数据库代码（一）

Hive to SparkSQL 简单指南

SparkSql学习一

Spark SQL以编程方式连接到Hive Metastore

21 sparkSQL

Spark学习笔记九：SparkSQL

SparkSQL读取HBase数据

sparksql之hive数据仓库安装及配置

SparkSQL读取Hive数据插入Redis

通过自定义SparkSQL外部数据源实现SparkSQL读取HB

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读