pyspark学习笔记（二）

作者: 深思海数_willschang | 来源:发表于2016-12-15 10:50 被阅读2184次

pyspark学习笔记（二）
PySpark机器学习 Machine Learning wit
pyspark学习笔记（一）
PySpark笔记(二)：RDD
pyspark常用算子学习笔记
Spark Python API Docs(part one)
pyspark 学习
pyspark整理
PySpark初见
Jupyter配置教程

pyspark

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName('RDD2').setMaster('local[2]')
sc = SparkContext(conf=conf)

print sc.version

2.0.2

pyspark-rdd

sortBy

sortBy(keyfunc, ascending=True, numPartitions=None)

Sorts this RDD by the given keyfunc

x = sc.parallelize(['wills', 'kris', 'april', 'chang'])
def sortByFirstLetter(s): return s[0]
def sortBySecondLetter(s): return s[1]

y = x.sortBy(sortByFirstLetter).collect()
yy = x.sortBy(sortBySecondLetter).collect()

print '按第一个字母排序结果： {}'.format(y)
print '按第二个字母排序结果：{}'.format(yy)

按第一个字母排序结果： ['april', 'chang', 'kris', 'wills']
按第二个字母排序结果：['chang', 'wills', 'april', 'kris']

cartesian

cartesian(other)

Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

rdd1 = sc.parallelize([1,2,3])
rdd2 = sc.parallelize([11,22,33])
res = rdd1.cartesian(rdd2)
print '笛卡尔结果：{}'.format(res)

笛卡尔结果：org.apache.spark.api.java.JavaPairRDD@5af637a2

print '笛卡尔结果：{}'.format(res.collect())

笛卡尔结果：[(1, 11), (1, 22), (1, 33), (2, 11), (3, 11), (2, 22), (2, 33), (3, 22), (3, 33)]

groupBy

groupBy(f, numPartitions=None, partitionFunc=func)

Return an RDD of grouped items.

rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
result = rdd.groupBy(lambda x: x % 2).collect()
res = sorted([(x, sorted(y)) for (x, y) in result])
print 'groupBy后的结果：{}'.format(result)
print '优化后的结果对比：{}'.format(res)

groupBy后的结果：[(0, <pyspark.resultiterable.ResultIterable object at 0x7f79600a9990>), (1, <pyspark.resultiterable.ResultIterable object at 0x7f79600a9dd0>)]
优化后的结果对比：[(0, [2, 8]), (1, [1, 1, 3, 5])]

pipe

pipe(command, env=None, checkCode=False)

Return an RDD created by piping elements to a forked external process.
Parameters: checkCode – whether or not to check the return value of the shell command.

rdd = sc.parallelize(['wills', 'kris', 'april'])
rdd2 = rdd.pipe('grep -i "r"')
print '经过pipe处理过后的数据：{}'.format(rdd2.collect())
print rdd.pipe('grep "W"').collect()

经过pipe处理过后的数据：[u'kris', u'april']
[]

foreach

foreach(f)

Applies a function to all elements of this RDD.

def f(x): print(x)
sc.parallelize([1, 2, 3, 4, 5]).foreach(f)

max, min, sum, count

x = sc.parallelize([1, 2, 3, 4, 5])
print '最大值：{}'.format(x.max())
print '最小值：{}'.format(x.min())
print '总和：{}'.format(x.sum())
print '总个数：{}'.format(x.count())

最大值：5
最小值：1
总和：15
总个数：5

mean, variance, sampleVariance, stdev, sampleStdev

x = sc.parallelize([1, 2, 3, 4, 5])
print '平均值：{}'.format(x.mean())
print '方差：{}'.format(x.variance())
print '样本方差：{}'.format(x.sampleVariance())
print '总体标准偏差：{}'.format(x.stdev())
print '样本标准偏差：{}'.format(x.sampleStdev())

平均值：3.0
方差：2.0
样本方差：2.5
总体标准偏差：1.41421356237
样本标准偏差：1.58113883008

countByKey, countByValue

countByKey()

Count the number of elements for each key, and return the result to the master as a dictionary.

countByValue()

Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1),("b", 2), ("a", 2)])
print '按key计算：{0}'.format(sorted(rdd.countByKey().items()))
print '按value计算：{0}'.format(sorted(sc.parallelize([1, 2, 1, 2, 2], 2).\
                                   countByValue().items()))

按key计算：[('a', 3), ('b', 2)]
按value计算：[(1, 2), (2, 3)]

first, top, take, takeOrdered

first()

Return the first element in this RDD.

top(num, key=None)

Get the top N elements from a RDD.

take(num)

Take the first num elements of the RDD.

takeOrdered(num, key=None)

Get the N elements from a RDD ordered in ascending order or as specified by the optional key function.

x = sc.parallelize(range(20))

print '第一个数：{}'.format(x.first())
print '前几个数(默认降序)：{}'.format(x.top(3))
print '取几个数：{}'.format(x.take(5))
print '按一定的排序规则取数：{}'.format(x.takeOrdered(3, key=lambda x: -x))

第一个数：0
前几个数(默认降序)：[19, 18, 17]
取几个数：[0, 1, 2, 3, 4]
按一定的排序规则取数：[19, 18, 17]

collectAsMap, keys, values

collectAsMap()

Return the key-value pairs in this RDD to the master as a dictionary.

rdd = sc.parallelize([('wills', 2),('chang',4), ('kris',28)])
res = rdd.collectAsMap()
print 'map结果为：{}'.format(res)
print 'keys:{}'.format(rdd.keys().collect())
print 'values:{}'.format(rdd.values().collect())

map结果为：{'wills': 2, 'chang': 4, 'kris': 28}
keys:['wills', 'chang', 'kris']
values:[2, 4, 28]

网友评论

本文标题：pyspark学习笔记（二）

本文链接：https://www.haomeiwen.com/subject/gknwmttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

pyspark学习笔记（二）

sortBy

cartesian

groupBy

pipe

foreach

max, min, sum, count

mean, variance, sampleVariance, stdev, sampleStdev

countByKey, countByValue

first, top, take, takeOrdered

collectAsMap, keys, values

相关文章

pyspark学习笔记（二）

PySpark机器学习 Machine Learning wit

pyspark学习笔记（一）

PySpark笔记(二)：RDD

pyspark常用算子学习笔记

Spark Python API Docs(part one)

pyspark 学习

pyspark整理

PySpark初见

Jupyter配置教程

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

程序猿日记

机器学习与数据挖掘

机器学习