美文网首页
sparksql读取hive数据库代码(一)

sparksql读取hive数据库代码(一)

作者: 大胖圆儿小姐 | 来源:发表于2021-12-09 17:29 被阅读0次

    一、代码概述

    在windows本地运行spark,以local模式读取hive数据库表数据,实现的业务是计算经纬度之间的距离,再将结果插入一张新表。在windows本地运行,需要下载windows的hadoop模拟环境,使用winutils功能,将hive的配置文件hive-site.xml拷贝到资源目录,pom文件中指定spark的版本为2.12。

    二、下载windows环境下的hadoop包

    1. 下载地址:https://github.com/4ttty/winutils,只能下载所有的,没办法只下载某一个版本,我所使用的版本是hadoop-2.8.3,将其拷贝到随便哪一个目录,拷贝出目录地址即可。
      bd5c91c75782469268ad0dc4c424a4c.png

    三、工程创建

    1. 使用intellij idea创建maven工程,很简单,我就不仔细写了。
    2. 将linux环境上hive的配置文件hive-site.xml拷贝到工程目录下的resources目录。


      image.png
    3. pom文件配置
    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>org.example</groupId>
        <artifactId>spark</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <dependencies>
            <!-- 根据linux上spark安装的版本,选择pom文件中spark的版本 -->
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.12</artifactId>
                <version>3.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.12</artifactId>
                <version>3.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-hive_2.12</artifactId>
                <version>3.0.0</version>
            </dependency>
            <!-- 计算经纬度之间的距离需要的包 -->
            <dependency>
                <groupId>org.gavaghan</groupId>
                <artifactId>geodesy</artifactId>
                <version>1.1.3</version>
            </dependency>
        </dependencies>
    
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <version>2.4.1</version>
                    <configuration>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                        <archive>
                            <manifest>
                                <mainClass>util.Microseer</mainClass>
                            </manifest>
                        </archive>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <configuration>
                        <source>8</source>
                        <target>8</target>
                    </configuration>
                </plugin>
            </plugins>
        </build>
    </project>
    
    1. 创建结果表t_sitenumber_distance的对象
    /**
      * @author DongJing
      * @date 2021/12/9 16:39
     */
    public class DistanceMeter {
    
        private String siteNumber;
        private double distance;
        private int flag;
    
        public String getSiteNumber() {
            return siteNumber;
        }
    
        public void setSiteNumber(String siteNumber) {
            this.siteNumber = siteNumber;
        }
    
        public double getDistance() {
            return distance;
        }
    
        public void setDistance(double distance) {
            this.distance = distance;
        }
    
        public int getFlag() {
            return flag;
        }
    
        public void setFlag(int flag) {
            this.flag = flag;
        }
    }
    
    1. 通过spark读取hive库表
    import org.apache.spark.api.java.function.MapFunction;
    import org.apache.spark.sql.*;
    import org.gavaghan.geodesy.Ellipsoid;
    import org.gavaghan.geodesy.GeodeticCalculator;
    import org.gavaghan.geodesy.GeodeticCurve;
    import org.gavaghan.geodesy.GlobalCoordinates;
    
    /**
     * @author DongJing
     * @date 2021/12/9 16:39
     */
    public class SparkSqlTest {
    
        public static void main(String[] args) {
            //windows环境下模拟hadoop环境,linux环境下可注释掉此行代码,不注释也没有问题
            System.setProperty("hadoop.home.dir","E:\\git\\spark\\hadoop-2.8.3");
            //获取sparksession连接
            SparkSession spark = SparkSession
                    .builder()
                    .appName("HiveSupport")
                    .master("local")
                    .enableHiveSupport()
                    .getOrCreate();
            spark.sql("show databases").show();
            spark.sql("use sitelight");
            spark.sql("show tables").show();
            Dataset<Row> rowDataset = spark.sql("select t1.site_number, t1.longitude j1,t2.longitude j2,t1.latitude w1,t2.latitude w2 " +
                    "from t_site_formal t1 inner join geo_site_info t2 on t1.site_number = t2.number where t1.del_flag=0 and t1.sign=0");
            Encoder<DistanceMeter> rowEncoder = Encoders.bean(DistanceMeter.class);
            //通过map拆分并组装数据,返回DistanceMeter对象
            Dataset<DistanceMeter> distanceMeterDataset = rowDataset.map((MapFunction<Row,DistanceMeter>) row->{
                DistanceMeter distanceMeter = new DistanceMeter();
                distanceMeter.setSiteNumber(row.get(0).toString());
                Double j1 = Double.valueOf(row.get(1).toString());
                Double j2 = Double.valueOf(row.get(2).toString());
                Double w1 = Double.valueOf(row.get(3).toString());
                Double w2 = Double.valueOf(row.get(4).toString());
                GlobalCoordinates source = new GlobalCoordinates(j1, w1);
                GlobalCoordinates target = new GlobalCoordinates(j2, w2);
                double distance = getDistanceMeter(source,target,Ellipsoid.Sphere);
                int flag = distance<=500?0:1;
                distanceMeter.setDistance(distance);
                distanceMeter.setFlag(flag);
                return distanceMeter;
            }, rowEncoder);
            //将数据集注册成一个临时表,通过sparksql执行插入操作
            distanceMeterDataset.registerTempTable("tmp");
            spark.sql("INSERT INTO t_sitenumber_distance SELECT siteNumber, flag, distance FROM tmp");
            spark.close();
        }
    
        /**
         * 经纬度距离计算
         *
         * @param gpsFrom
         * @param gpsTo
         * @param ellipsoid
         * @return
         */
        public static double getDistanceMeter(GlobalCoordinates gpsFrom, GlobalCoordinates gpsTo, Ellipsoid ellipsoid){
            //创建GeodeticCalculator,调用计算方法,传入坐标系、经纬度用于计算距离
            GeodeticCurve geoCurve = new GeodeticCalculator().calculateGeodeticCurve(ellipsoid, gpsFrom, gpsTo);
            return geoCurve.getEllipsoidalDistance();
        }
    
    }
    
    

    四、我遇到的问题

    1. main方法执行时遇到的问题
    Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop01
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
        at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:320)
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getStagingDir(SaveAsHiveFile.scala:218)
        at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getStagingDir$(SaveAsHiveFile.scala:213)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getStagingDir(InsertIntoHiveTable.scala:68)
        at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getExternalScratchDir(SaveAsHiveFile.scala:210)
        at org.apache.spark.sql.hive.execution.SaveAsHiveFile.newVersionExternalTempPath(SaveAsHiveFile.scala:192)
        at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getExternalTmpPath(SaveAsHiveFile.scala:131)
        at org.apache.spark.sql.hive.execution.SaveAsHiveFile.getExternalTmpPath$(SaveAsHiveFile.scala:100)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getExternalTmpPath(InsertIntoHiveTable.scala:68)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:98)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
        at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
        at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
        at SparkSqlTest.main(SparkSqlTest.java:44)
    Caused by: java.net.UnknownHostException: hadoop01
        ... 40 more
    

    报错原因:由于要读取hadoop环境上存在的数据,windows环境上无法识别,所以需要配置映射。
    解决方法:在C:\Windows\System32\drivers\etc目录下配置hosts,在最后一行添加172.16.100.26 hadoop01。

    五、成功执行

    1. main方法运行结果成功执行。


      image.png
    2. 在linux服务器上,进入hive客户端或者hadoop的UI界面,验证执行结果。

    相关文章

      网友评论

          本文标题:sparksql读取hive数据库代码(一)

          本文链接:https://www.haomeiwen.com/subject/yytnxrtx.html