一.从数据库读数据

1.导入jar包

image.png

在spark-hadoop包下的jars中导入对应数据库驱动的jar包

如

image.png

我所用的是oracle数据库,则导入ojdbc6-11.2.0.jar

2.数据库配置

我的数据库配置采用的 ini 配置文件的方式(此步可省略,手写链接配置也可以)

image.png

获取配置的方法:

#dbtype为[]中的名称,config_path为配置文件的地址
def get_db_config(dbtype,config_path='/home/ap/cognos/JRJY_Rec/config/db_config.ini'):
    import configparser
    #读取ini配置文件
    cf = configparser.ConfigParser()
    cf.read(config_path)
    url = cf.get(dbtype,'url')
    user = cf.get(dbtype,'user')
    password = cf.get(dbtype,'password')
    driver = cf.get(dbtype,'driver')
    prop = {'user': user,'password': password,'driver': driver}
    return prop,url

prop,url = get_db_config('oracle-hasdb')
#prop中为用户名,密码,驱动
#url为jdbc链接

3.从数据库导出数据到pyspark的dataframe

df = spark.read.jdbc(url=url,table='table_name',properties=prop)
# url jdbc连接
# table 数据库表名,也可以是查询语句,如:select * from table_name where ....
# properties 配置信息,也可以手动填写,如:properties={'user':'username','password':'password','driver':'driver'}

二.dataframe写入数据到数据库

prop,url = get_db_config('oracle-hasdb')
df.write.jdbc(url=url, table='table_name', mode='append', properties=prop)
# 配置文件和读数据库配置一样
# table table为数据库建立的表,如果不存在,spark会为df建立表
# mode append为追加写人数据