参考资料
- 这个网站里有很多教程,包括spark,scala,kafka 等等大数据相关的,如果有需要可以去看看。
1.数据源
RAVI kumar
Anish kumar
Rakesh jha
Vishal kumar
Ananya ghosh
2.建表
CREATE TABLE `zhengyuan.mytable`(
`fname` string,
`lname` string
)
3.加载数据
load data local inpath '/data/pyspark/program/auto_report/zhengyuan/test/data.txt' into table zhengyuan.mytable; //加载数据
select * from zhengyuan.mytable; //查询
4.编写udf
#!/usr/bin/python
import sys
for line in sys.stdin:
line = line.strip()
fname , lname = line.split(' ')
l_name = lname.lower()
print '\t'.join([fname, str(l_name)])
5.加载udf
add FILE /data/pyspark/program/auto_report/zhengyuan/test/iteblog.py;
6.使用udf
SELECT TRANSFORM(stuff)
USING 'script'
AS thing1, thing2
or
SELECT TRANSFORM(stuff)
USING 'script'
AS (thing1 INT, thing2 INT)
select TRANSFORM (fname) USING "python iteblog.py" as (fname,lname) from zhengyuan.mytable;
7.验证结果
select * from zhengyuan.mytable;
select lname
from
(select
transform (fname) using "python iteblog.py" as (fname,
lname)
from zhengyuan.mytable)a ;
网友评论