-
shell脚本中调用
- 示例: spark-sql -e "select * from student;"
-
shell脚本中调用(指定运行参数)
- 示例: spark-sql --executor-cores 2 --executor-memory 4g --driver-memory 8g -e "select * from student;"
- --driver-memory MEM: Memory for driver (e.g. 1000M, 2G) (Default: 1024M)
- --executor-memory MEM: Memory per executor (e.g. 1000M, 2G) (Default: 1G)
- --executor-cores NUM: Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)
-
本地模式调用(指定运行参数, 且不提交到yarn)
- 示例: spark-sql --master local[10] -e "select * from student;"
- 该模式被称为Local[N]模式,是用单机的多个线程来模拟Spark分布式计算,直接运行在本地,便于调试,通常用来验证开发出来的应用程序逻辑上有没有问题
- 其中N代表可以使用N个线程,每个线程拥有一个core。如果不指定N,linux系统中默认是1个线程,该线程有1个core(但win系统中默认线程数为核心数)
- 如果是local[*],则代表启动与CPU数目相同的executor: Run Spark locally with as many worker threads as logical cores on your machine
-
explode与explode_outer的区别
- explode:爆炸后,删除了null的那条数据
- explode_outer:爆炸后,保留null的那条数据
- 比如有以下dataframe
id | name | likes _______________________________ 1 | Luke | [baseball, soccer] 2 | Lucy | null
- explode结果:
id | name | likes _______________________________ 1 | Luke | baseball 1 | Luke | soccer
- explode_outer结果:
id | name | likes _______________________________ 1 | Luke | baseball 1 | Luke | soccer 2 | Lucy | null
-
spark.createDataFrame(list, Bo.class)的坑
- local模式下,当list中的对象数量一定数量时,就会报Java Heap Memory Out
- 临时处理:list中对象写到json文件中,用spark.read().json()读取json文件,搞定。
网友评论