美文网首页
Spark笔记

Spark笔记

作者: 开水的杯子 | 来源:发表于2017-03-24 13:13 被阅读51次
  • It was designed to solve what MR failed to address: perf issues due to no way to re-use data between computations.
    • Iterative jobs (popular in Machine Learning algorithms)
    • Interactive analytics (ad hoc exploratory queries)
  • Resilient distributed dataset (RDD): which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. These can be cached and re-used in multiple parallel operations.
  • Fault tolerance achieved through lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.
    • a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage.

Constructing RDDs

  • From a file in HDFS
  • Parallelizing a Scala collection
  • Transforming an existing RDD
  • Change the persistence of an RDD
    • Cache: lazy, but leave in cache after computation. Hint only, won't force if no space.
    • Save: writes it to file system
  • Parallel Operations
  • Reduce: dataset elements using an associative function to produce a result at the driver program; reduce results are only collected at one process
  • Collect: sends all elements of the dataset to the driver program
  • Foreach: passes each element through a user provided function

Shared Variables

  • Broadcast variables: distribute a large piece of read-only data to distribute to all workers and not package with every closure.
  • Accumulators: workers can only add to it; only the driver can read it.

Implementation

  • What is Mesos?!
  • Spark is built on top of Mesos [16, 15], a “cluster operating system” that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster

RDD Implementation

  • Internally, each RDD object implements the same simple interface, which consists of three operations:
    • getPartitions: returns a list of partition IDs.
    • getIterator(partition): iterates over a partition.
    • getPreferredLocations(partition): used for task scheduling to achieve data locality.
  • Delay scheduling: send each task to one if its preferred locations.
  • if a node fails, its partitions are re-read from their parent datasets and eventually cached on other nodes

Shared Variables Implementation

  • Broadcast variables and accumulators are implemented using classes with custom serialization formats
  • Broadcast variable is saved to filesystem, fetched and cached on worker node.
  • Accumulator is saved to filesystem. Each worker node updates own accumulator from zero and sends back for global update.

Interpreter Integration

  • Scala compiles a class for each line typed by user including a singleton that contains the variables and functions on the line.
  • Previous lines are referenced via Class.getInstance.
  • Sparked changed this to output compiled classes into a shared filesystem and reference the singleton objects directly.

Performance benchmarks

  • Logistic regression runs 10x faster than Map Reduce.
  • Interactive queries are much faster after first query, e.g. 35 s, → 0.5 s.

Related Work

  • Distributed Shared Memory
    • Fault tolerance: checkpointing, lineage. Lineage is better.
    • Lineage: only the lost partitions need to be recomputed, and that can be done in parallel on different nodes, without requiring the program to revert to a checkpoint. No overhead if no nodes fail.
  • Language Integation
    • Unlike DryadLINQ, Spark allows RDDs to persist in memory across parallel operations. [What does DryadLINQ do again?]
    • In addition, Spark enriches the language integration model by supporting shared variables (broadcast variables and accumulators), implemented using classes with custom serialized forms.

Future work — was this achieved?

  1. Formally characterize the properties of RDDs and Spark’s other abstractions, and their suitability for various classes of applications and workloads.
  2. **Enhance the RDD abstraction to allow programmers to trade between storage cost and re-construction cost. **
  3. Define new operations to transform RDDs, including a “shuffle” operation that repartitions an RDD by a given key. Such an operation would allow us to implement group-bys and joins.
  4. Provide higher-level interactive interfaces on top of the Spark interpreter, such as SQL and R [4] shells.

相关文章

网友评论

      本文标题:Spark笔记

      本文链接:https://www.haomeiwen.com/subject/hfffottx.html