Spark 2.1 Spark sql 部分问题

作者: 会长大的幸福_8bf9 | 来源:发表于2018-06-07 20:34 被阅读0次

Spark 2.1 Spark sql 部分问题
Spark Core 解析：RDD
使用spark-sql报错 “Metastore contain
2018-10-10
spark mllib支持哪些机器学习算法？
【2019-07-17】jh使用spark on hbase获取
大数据面试必备知识点总结：Spark，Hadoop，kafka，
SparkSQL简介
是时候学习真正的 spark 技术了
是时候学习真正的 spark 技术了

最近通过使用Spark 2.1 的Spark sql ，发现以下几个问题

由于项目进度，所以没有记录下来错误日志，所以就只写下了结论

1，Spark 2.0 以上SQL 操作HIVE的时候不支持： alter table add Column 语句，也就是增加字段语句了

替代方案：使用hiveserver，走JDBC 去更新，通过其他blog得知，也可以使用thrift，因为spark sql本身走的就是thrift所以这个方法可以不依靠其他服务，但是我尝试失败，实在没找到更新表模式的方法

代码如下：我尝试失败，你们可以尝试修改本类

import org.apache.hadoop.hive.conf.HiveConf;

import org.apache.hadoop.hive.metastore.IMetaStoreClient;

import org.apache.hadoop.hive.metastore.RetryingMetaStoreClient;

import org.apache.hadoop.hive.metastore.api.Database;

import org.apache.hadoop.hive.metastore.api.FieldSchema;

import org.apache.hadoop.hive.metastore.api.MetaException;

import org.apache.thrift.TException;

import org.slf4j.Logger;

import java.util.List;

public class HiveClient {

protected static final Loggerlogger = org.slf4j.LoggerFactory.getLogger(HiveClient.class);

IMetaStoreClientclient;

public HiveClient() {

try {

HiveConf hiveConf =new HiveConf();

hiveConf.addResource("/hive-site.xml");

client = RetryingMetaStoreClient.getProxy(hiveConf);

}catch (MetaException ex) {

logger.error(ex.getMessage());

}

public ListgetAllDatabases() {

List databases =null;

try {

databases =client.getAllDatabases();

}catch (TException ex) {

logger.error(ex.getMessage());

}

return databases;

}

public DatabasegetDatabase(String db) {

Database database =null;

try {

database =client.getDatabase(db);

}catch (TException ex) {

logger.error(ex.getMessage());

}

return database;

}

public ListgetSchema(String db, String table) {

List schema =null;

try {

schema =client.getSchema(db, table);

}catch (TException ex) {

logger.error(ex.getMessage());

}

return schema;

}

public ListgetAllTables(String db) {

List tables =null;

try {

tables =client.getAllTables(db);

}catch (TException ex) {

logger.error(ex.getMessage());

}

return tables;

}

public StringgetLocation(String db, String table) {

String location =null;

try {

location =client.getTable(db, table).getSd().getLocation();

}catch (TException ex) {

logger.error(ex.getMessage());

}

return location;

}

2，Spark sql 连接HIVE创建的表，表模式会被SPARK 缓存，并且通过sparkSession.cataLog.refresh也刷新不到hive中的真实表模式， WTF 调试了几天也没找到原因

总结一下：如果通过spark 去操作hive的元数据，例如建表语句是spark发出的，OK ，你可以通过hive的show create table xx去看它的语句，指定了非常多的默认属性，包括表模式。所以它在读取表模式的时候，你通过其他工具更新了表结构，它无法感知。再结合它无法执行更新表模式，无法刷新表结构缓存，真的是......

一种方案是找到缓存并且清理掉：这种方案我查了一些BLOG ，解决办法比较少，并且新版API 无法重写代码

https://blog.csdn.net/cjuexuan/article/details/72236694 ，因为一些类被改为：private

第二种方案是：别通过spark创建表，而是通过hive创建表，然后所有与元数据更新的都走HIVE