Lec 1

作者: 往日未尝认真 | 来源:发表于2017-05-20 13:56 被阅读0次

6.824 2015 Lecture 1: Introduction and lab overview

6.824: Distributed Systems Engineering

What is a distributed system?

multiple networked cooperating computers

Examples: Internet E-Mail, Athena file server, Google MapReduce, etc.

Why distribute?

to connect physically separate entities

to achieve security via physical isolation

to tolerate faults via replication at separate sites

to increase performance via parallel CPUs/mem/disk/net

But:

complex, hard to debug

new classes of problems, e.g. partial failure (did server accept my e-mail?)

advice: don't distribute if a central system will work

Why take this course?

interesting -- hard problems, non-obvious solutions

active research area -- lots of progress + big unsolved problems

used by real systems -- driven by the rise of big Web sites

hands-on -- you'll build a real system in the labs

COURSE STRUCTURE

http://pdos.csail.mit.edu/6.824

Course components:

Lectures about big ideas, papers, labs

Readings: research papers as case studies

please read papers before class

otherwise boring, and you can't pick it up by listening

each paper has a question for you to answer

and you should think of a question you would like to have answered

submit question&answer before class, one or two paragraphs

Mid-term exam in class, and final exam

Labs: build increasingly sophisticated fault-tolerant services

First lab is due on Monday

For PhD students, you can substitute a small research project for 5th lab

talk to us

TAs: Steven Allen, Rohan Mahajan, Steven Valdez

answer questions about material

help you with labs

will post office hours

MAIN TOPICS

Example:

a shared file system, so users can cooperate, like Athena's AFS

lots of client computers

[diagram: clients, network, vague set of servers]

Topic: architecture

What interface?

Clients talk to servers -- what do they say?

File system (files, file names, directories, &c)?

Disk blocks, with FS in client?

Separate naming + file servers?

Separate FS + block servers?

Single machine room or unified wide area system?

Wide-area more difficult.

Transparent?

i.e. should it act exactly like a local disk file system?

or is it OK if apps/users have to cope with distribution,

e.g. know what server files are on, or deal with failures.

Client/server or peer-to-peer?

All these interact w/ performance, usefulness, fault behavior.

Topic: implementation

How to simplify network communication?

Can be messy (msg formatting, re-transmission, host names, &c)

Frameworks can help: RPC, MapReduce, &c

How to cope with inherent concurrency?

Threads, locks, &c.

Topic: performance

Distribution can hurt: network b/w and latency bottlenecks

Lots of tricks, e.g. caching, concurrency, pre-fetch

Distribution can help: parallelism, pick server near client

Idea: scalable design

Nx servers -> Nx total performance

Need a way to divide the load by N

Divide data over many servers ("sharding" or "partitioning")

By hash of file name?

By user?

Move files around dynamically to even out load?

"Stripe" each file's blocks over the servers?

Performance scaling is rarely perfect

Some operations are global and hit all servers (e.g. search)

Nx servers -> 1x performance

Load imbalance

Everyone wants to get at a single popular file

-> one server 100%, added servers mostly idle

-> Nx servers -> 1x performance

Topic: fault tolerance

Big system (1000s of server, complex net) -> always something broken

We might want:

Availability -- I can keep using my files despite failures

Durability -- my files will come back to life someday

Availability idea: replicate

Servers form pairs, each file on both servers in the pair

Client sends every operation to both

If one server down, client can proceed using the other

Opportunity: operate from both "replicas" independently if partitioned?

Opportunity: can 2 servers yield 2x availability AND 2x performance?

Topic: consistency

Assume a contract w/ apps/users about meaning of operations

e.g. "read yields most recently written value"

Consistency is about fulfiling the contract

despite failure, replication/caching, concurrency, &c

Problem: keep replicas identical

If one is down, it will miss operations

Must be brought up to date after reboot

If net is broken, *both* replicas maybe live, and see different ops

Delete file, still visible via other replica

"split brain" -- usually bad

Problem: clients may see updates in different orders

Due to caching or replication

I make 6.824 directory private, then TA creates grades file

What if the operations run in different order on different replicas?

Consistency often hurts performance (communication, blocking)

Many systems cut corners -- "relaxed consistency"

Shifts burden to applications

LABS

focus: fault tolerance and consistency -- central to distrib sys

lab 1: MapReduce

labs 2 through 5: storage servers

progressively more sophisticated (tolerate more kinds of faults)

progressively harder too!

patterned after real systems, e.g. MongoDB

end up with core of a real-world design for 1000s of servers

what you'll learn from the labs

easy to listen to lecture / read paper and think you understand

building forces you to really understand

you'll have to do some design yourself

we supply skeleton, requirements, and tests

you'll have substantial scope to solve problems your own way

you'll get experience debugging distributed systems

tricky due to concurrency, unreliable messages

we've tried to ensure that the hard problems have to do w/ distrib sys

not e.g. fighting against language, libraries, &c

thus Go (type-safe, garbage collected, slick RPC library)

thus fairly simple services (mapreduce, key/value store)

grades depend on how many test cases you pass

we give you the tests, so you know whether you'll do well

careful: if it usually passes, but occasionally fails,

chances are it will fail when we run it

code review

look at someone else's lab solution

perhaps learn about another approach

send feedback

and receive feedback

Lab 1: MapReduce

framework for parallel programming on 1000s of computers

help you get up to speed on Go and distributed programming

first exposure to some fault tolerance

motivation for better fault tolerance in later labs

motivating app for many papers

popular distributed programming framework

with many intellectual children

MapReduce computational model

programmer defines Map and Reduce functions

input is key/value pairs, divided into splits

perhaps lots of files, k/v is filename/content

Input Map -> a,1 b,7 c,9

Input Map ->    b,2

Input Map -> a,3    c,7

|  |  |

|  -> Reduce -> c,16

-----> Reduce -> b,9

MR framework calls Map() on each split, produces set of k2,v2

MR framework gathers all Maps' v2's for a given k2,

and passes them to a Reduce call

final output is set of pairs from Reduce()

Example: word count

input is thousands of text files

Map(k, v)

split v into words

for each word w

emit(w, "1")

Reduce(k, v)

emit(len(v))

What does MR framework do for word count?

[master, input files, map workers, map output, reduce workers, output files]

input files:

f1: a b

f2: b c

send "f1" to map worker 1

Map("f1", "a b") ->

send "f2" to map worker 2

Map("f2", "b c") ->

framework waits for Map jobs to finish

workers sort Map output by key

framework tells each reduce worker what key to reduce

worker 1: a

worker 2: b

worker 2: c

each reduce worker pulls needed Map output from Map workers

worker 1 pulls "a" Map output from every worker

each reduce worker calls Reduce once for each of its keys

worker 1: Reduce("a", [1]) -> 1

worker 2: Reduce("b", [1, 1]) -> 2

Reduce("c", [1]) -> 1

Why is the MR framework convenient?

* programmer only needs to think about the core work,

the Map and Reduce functions, does not have to worry

network communication, failure, &c.

* the grouping by key between Map and Reduce fits

some applications well (e.g., word count), since

it brings together data needed by the Reduce.

* but some applications don't fit well, because MR

only allows the one type of communication between

different parts of the application.

e.g. word count but sort by frequency.

Why might MR have good performance?

Map and Reduce functions run in parallel on different workers

Nx workers -> divide run-time by N

But rarely quite that good:

move map output to reduce workers

stragglers

read/write network file system

What about failures?

People use MR with 1000s of workers and vast inputs

Suppose each worker only crashes once per year

That's 3 per day!

So a big MR job is very likely to suffer worker failures

Other things can go wrong:

Worker may be slow

Worker CPU may compute incorrectly

Master may crash

Parts of the network may fail, lose packets, &c

Map or Reduce or framework may have bugs in software

Tools for dealing with failure?

retry -- if worker fails, run its work on another worker

replicate -- run each Map and Reduce on *two* workers

replace -- for long-term health

MapReduce uses all of these

Puzzles for retry

how do we know when to retry?

can we detect when Map or Reduce worker is broken?

can we detect incorrect worker output?

can we distinguish worker failure from worker up, network lossy?

why is retry correct?

what if Map produces some output, then crashes?

will we get duplicate output?

what if we end up with two of the same Map running?

in general, calling a function twice is not the same as calling it once

why is it OK for Map and Reduce?

Helpful assumptions

One must make assumptions, otherwise too hard

No bugs in software

No incorrect computation: worker either produces correct output,

or nothing -- assuming fail-stop.

Master doesn't crash

Map and Reduce are strict functions of their arguments

they don't secretly read/write files, talk to each other,

send/receive network messages, &c

lab 1 has three parts:

Part I: just Map() and Reduce() for word count

Part II: we give you most of a distributed multi-server framework,

you fill in the master code that hands out the work

to a set of worker threads.

Part III: make master cope with crashed workers by re-trying.

Part I: main/wc.go

stubs for Map and Reduce

you fill them out to implement word count

Map argument is a string, a big chunk of the input file

demo of solution to Part I

./wc master kjv12.txt sequential

more mrtmp.kjv12.txt-1-2

more mrtmp.kjv12.txt

Part I sequential framework: mapreduce/mapreduce.go RunSingle()

split, maps, reduces, merge

Part II parallel framework:

master

workers...

shared file system

our code splits the input before calling your master,

and merges the output after your master returns

our code only tells the master the number of map and reduce splits (jobs)

each worker sends Register RPC to master

your master code must maintain a list of registered workers

master sends DoJob RPCs to workers

if 10 map jobs and 3 workers,

send out 3, wait until one worker says it's done,

send it another, until all 10 done

then the same for reduces

master only needs to send job # and map vs reduce to worker

worker reads input from files

so your master code only needs to know the number of

map and reduce jobs!

which it can find from the "mr" argument

Thursday:

master and workers talk via RPC, which hides network complexity

more about RPC on Thursday

相关文章

网友评论

      本文标题:Lec 1

      本文链接:https://www.haomeiwen.com/subject/xyhpxxtx.html