Spark笔记

Spark笔记

作者: 开水的杯子 | 来源:发表于2017-03-24 13:13 被阅读51次

Spark 2.3.1测试笔记二：SortExec性能测试1
spark
spark 学习笔记
SKIL/配置/Spark解释器
Spark Core 学习笔记
spark核心编程
2020-03-17
Spark Task 的执行流程④ - task 结果的处理
初识Windows下Python开发Spark
Spark Storage ① - Spark Storage

It was designed to solve what MR failed to address: perf issues due to no way to re-use data between computations.
- Iterative jobs (popular in Machine Learning algorithms)
- Interactive analytics (ad hoc exploratory queries)
Resilient distributed dataset (RDD): which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. These can be cached and re-used in multiple parallel operations.
Fault tolerance achieved through lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.
- a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage.

Constructing RDDs

From a file in HDFS
Parallelizing a Scala collection
Transforming an existing RDD
Change the persistence of an RDD
- Cache: lazy, but leave in cache after computation. Hint only, won't force if no space.
- Save: writes it to file system
Parallel Operations
Reduce: dataset elements using an associative function to produce a result at the driver program; reduce results are only collected at one process
Collect: sends all elements of the dataset to the driver program
Foreach: passes each element through a user provided function

Shared Variables

Broadcast variables: distribute a large piece of read-only data to distribute to all workers and not package with every closure.
Accumulators: workers can only add to it; only the driver can read it.

Implementation

What is Mesos?!
Spark is built on top of Mesos [16, 15], a “cluster operating system” that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster

RDD Implementation

Internally, each RDD object implements the same simple interface, which consists of three operations:
- getPartitions: returns a list of partition IDs.
- getIterator(partition): iterates over a partition.
- getPreferredLocations(partition): used for task scheduling to achieve data locality.
Delay scheduling: send each task to one if its preferred locations.
if a node fails, its partitions are re-read from their parent datasets and eventually cached on other nodes

Shared Variables Implementation

Broadcast variables and accumulators are implemented using classes with custom serialization formats
Broadcast variable is saved to filesystem, fetched and cached on worker node.
Accumulator is saved to filesystem. Each worker node updates own accumulator from zero and sends back for global update.

Interpreter Integration

Scala compiles a class for each line typed by user including a singleton that contains the variables and functions on the line.
Previous lines are referenced via Class.getInstance.
Sparked changed this to output compiled classes into a shared filesystem and reference the singleton objects directly.

Performance benchmarks

Logistic regression runs 10x faster than Map Reduce.
Interactive queries are much faster after first query, e.g. 35 s, → 0.5 s.

Related Work

Distributed Shared Memory
- Fault tolerance: checkpointing, lineage. Lineage is better.
- Lineage: only the lost partitions need to be recomputed, and that can be done in parallel on different nodes, without requiring the program to revert to a checkpoint. No overhead if no nodes fail.
Language Integation
- Unlike DryadLINQ, Spark allows RDDs to persist in memory across parallel operations. [What does DryadLINQ do again?]
- In addition, Spark enriches the language integration model by supporting shared variables (broadcast variables and accumulators), implemented using classes with custom serialized forms.

Future work — was this achieved?

Formally characterize the properties of RDDs and Spark’s other abstractions, and their suitability for various classes of applications and workloads.
**Enhance the RDD abstraction to allow programmers to trade between storage cost and re-construction cost. **
Define new operations to transform RDDs, including a “shuffle” operation that repartitions an RDD by a given key. Such an operation would allow us to implement group-bys and joins.
Provide higher-level interactive interfaces on top of the Spark interpreter, such as SQL and R [4] shells.

相关文章

Spark 2.3.1测试笔记二：SortExec性能测试1
前言本例基于1 Spark 2.3.0测试笔记一：Shuffle到胃疼 2 Spark 2.3.0测试笔记二：还...
spark
*Spark Spark 函数Spark (Python版) 零基础学习笔记（一）—— 快速入门 1.map与fl...
spark 学习笔记
Spark学习笔记 Data Source->Kafka->Spark Streaming->Parquet->S...
SKIL/配置/Spark解释器
Spark解释器可以配置笔记本和Zeppelin Spark 解释器，以使用Spark来获得更多的处理能力。如果...
Spark Core 学习笔记
Spark Core 学习笔记 1、Spark 简介 Spark 是一种用于大规模数据处理的统一计算引擎...
spark核心编程
Spark 学习笔记 Spark 架构及组件 client：客户端进程，负责提交job到master Driver...
2020-03-17
spark学习笔记centos安装Oracle VirtualBox: Centos安装Vagrant
Spark Task 的执行流程④ - task 结果的处理
本文为 Spark 2.0 源码分析笔记，其他版本可能稍有不同 Spark Task 的执行流程③ - 执行 ta...
初识Windows下Python开发Spark
最近需要在Windows上配置python 开发 Spark应用，在此做一个总结笔记。 Spark 简介 Spar...
Spark Storage ① - Spark Storage
本文为 Spark 2.0 源码分析笔记，某些实现可能与其他版本有所出入 Storage 模块在整个 Spark ...

网友评论

本文标题：Spark笔记

本文链接：https://www.haomeiwen.com/subject/hfffottx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

栏目导航

热点阅读

关于我们|服务条款|联系我们|Spark笔记|投稿指南|网站地图|RSS订阅|排版工具|手机版

提供经典美文摘抄,优美散文欣赏,现代诗歌精选,短篇小说,心情随笔,表白情书范文,故事会在线阅读欣赏

Copyright © 2014-2023 Haomeiwen.com All Rights Reserved. 好美文阅读网版权所有

备案信息：桂公网安备 45052102000051号 · 桂ICP备13007215号-3

本站所收录作品、热点评论等信息部分来源互联网，目的只是为了系统归纳学习和传递资讯

所有作品版权归原创作者所有，与本站立场无关，如不慎侵犯了你的权益，请联系我们告知，我们将做删除处理！