TODO: Basics + step by step + TinyURL + Pastebin
Basics:
A/C: 看Doc 理解基本概念,google + 实际工作经验 消化理解。
ETA 7号
实际完成时间:14号
Why System Design Interviews?
Points:
-
high-level design
-
guide and move the conversation forward
-
discussion with interviewer --> core
-
gather all requirements 因为interviewer 不会告诉你
-
Leading the conversation: candidate leads the discussion to go broad and deep -> take the interviewer with you step by step
-
Solving by breaking down: top-down and modularization
6.1 break problems into modules and solve them independently
6.2 each component -> sub-problem -> algorithm
6.3 NOTE:
大多数情况是没有以前6.2这么理想的情况的(得到具体solution)
真正重要的是:how you make progress on 解决问题和采用哪种strategy -
handle bottleneck:
7.1 each solution is a kind of trade-off
7.2 talk about these trade-offs and to measure their impact on the system keeping all the constraints and use cases in mind -
Steps:
8.1 Scoping the problem: No assumptions, Ask, 限制,use cases
8.2 Abstract design: blocks of the system and the relationships between them.
8.3 Identify and address the bottlenecks by using the fundamental principles of scalable system design. -
Know the preference of the interviewer
9.1 focus on the right things while discussing the problem 心里有大局掌控的同时 要多问面试官是否想要多讨论一下 当前/某个 部分
Must-do.
- 花几分钟跟interviewer搞清楚 the full scope of the system
- 在完成High-level design之后,make sure that interviewer is OK with that. then move on to details --> scale
Super Important:
- NEVER assume things that are not stated!!!
LB:
-
LB is to distribute load to multiple resources
-
Where to ADD:
2.1 user <--> web server
2.2 web server <--> internal platform layer(e.g. app server or cache server)
2.3 internal platform layer <--> DB
image.png -
Implement
3.1 Smart Clients
In a word: 纯自己develop,可以在各个layer,包括user和db server
3.2 Hardware Load Balancers
In a word: 纯硬件
3.3 Software Load Balancers
In a word: 用LB的软件,但是有时候需要software + hardware结合(如果如果无法控制user的host的port)
Cache
1. 各种Cache
1.1 Application server cache
request 到哪个node,哪个node就cache一下
pros: straightforward
cons: 如果LB是random分的request,会miss cache
1.2 Distributed cache
增加一个node来cache "cache是否available",利用hashing function,每个node上面只存自己需要负责存的cache。当查到自己这儿有相应的cache的时候,(在接到request的node去database那儿拿data前)发一个request告诉接到这个request 的node他这儿有cache
pros: 解决了application server cache的cons
cons: 如果有Missing node会很麻烦,即使利用"让多份copy of data到不同的nodes上"来解决这个con也会使整个cache变得很复杂
1.3 Global Cache
有两种global cache:
第一种的好处不用说
第二种的对于有的情况会有好处,比如:(1)cache的file很大 (2)cache的东西是static的,不希望被evict
1.4 Content Distribution Network (CDN)
个人理解(To be changed if it's wrong) 本质就是global cache
2. Cache Invalidation
三种strategies:
2.1 Write-through cache
两边(cache + DB)一起更新
cons: 特别慢,load大
2.2 Write-around cache
只存DB
cons: 慢, miss cache
2.3 Write-back cache
只存cache,定期更新
cons: 如果cache在DB更新前突然崩了,data可能会丢失
3. Cache eviction policies
FIFO, LIFO, LRU, MRU, LFU, RR
Sharding or Data Partitioning
-> break up a big database (DB) into many smaller parts
-
Partition Methods
1.1 Horizontal partitioning: range-based
cons: 分布不均
1.2 Vertical Partitioning: column-based
cons: 一般都得需要再partition
1.3 Directory Based Partitioning: mapping -
Partitioning Criteria
2.1 Key or Hash-based partitioning:
hash -> consistent hashing
2.2 List partitioning:
Only contains the content with specific values
2.3 Round-robin partitioning
2.4 Composite partitioning -
Common Problems of Sharding
a. Joins and Denormalization:
problem: 不能join
solution: denormalization -> con: have to deal with data inconsistency(caused by denormalization)
b. Referential integrity:
problem: not support foreign key constraint
solution: application needs to (1)handle itself (2)clean up dangling references
c. Rebalancing:
problem: 有时候各种原因会导致各个shard间unbalance
solution: Hash-based partitioning with consistent hashing
Index
- used to improve the speed of data retrieval operations on the data store
- It's a data structure
Proxies
Used to filter requests or log requests or transform requests
?As a cache?
Collapse the same requests from a system-wide perspective
Collapse the requests that are 空间相同(spatially close together,如同一个DB)
在(1)high load的情况下(2)cache 有限的情况下 非常有用
Queues
LB有时候也会让request分配不均,产生可以本避免的latency。
用了Queue, client不需要就在那儿等着某个Server的response。
好处:
- asynchronously run tasks
- 可以有更灵活的retry机制, fault tolerance.
Queue的限制: size of data and the number of outstanding requests
Redundancy and Replication
Backup important data or service.
- Failover
- shared-nothing architecture
SQL vs. NoSQL
Differ in (1) the way they were built (2) the kind of info they stored (3) how they store it
SQL: structured, pre-defined schema(例子:电话簿 姓名-电话-地址)
NoSQL: unstructured, distributed, dynamic schema(例子:folder 姓名-一切关于这个人的东西,如地址、电话、FB点赞数)
NoSQL types:
- Key-Value Stores:
Stored in key-value pairs.
e.g. Redis, Dynamo - Document Databases:
Stored in documents. Documents are grouped by collections. Each document 可以有完全不同的数据结构。
e.g. MongoDB - Wide-Column Databases:
Colum families --> container of rows. 不需要提前了解columns, 每个row 不需要有同样的column.
常用于large dataset.
e.g. Cassandra, HBase. - Graph Databases:
用于存那些关系容易用graph来表示的数据
High level differences between SQL and NoSQL
- Storage:
SQL: row --> entity
NoSQL: different models - Schema
SQL: fixed schema <-- whole database
NoSQL: dynamic - Querying
SQL: SQL -> data
NoSQL: focus on collections, different DB --> different syntax - Scalability
SQL: vertically scalable
NoSQL: horizontally scalable - Reliability or ACID Compliancy
SQL: Good
NoSQL: Bad
SQL VS. NoSQL - Which one to use?
- When SQL:
(1) ensure ACID compliance
(2)data is structured and unchanging - When NoSQL:
(1) data with no or little structure.
(2) Used in Cloud Computing. Scalable.
(3) Rapid development.
CAP Theorem
三个中最多满足两个。
CAP
Consistent Hashing
如果想不起来了可以参考wiki 或者
https://www.educative.io/collection/page/5668639101419520/5649050225344512/5709068098338816
上面的flash
Long-Polling vs WebSockets vs Server-Sent Events
Ajax Polling
client 发request --> server 收到然后处理发response(不管有没有能回的都要回,所以会产生empty response)
一般都是固定频率的发request
cons: too many requests
Long polling
client 发一个request,但是NOT expect 马上回复 --> server收到request后,每当有东西需要回的时候再回 --> client收到response再马上发一个request
WebSocket
persistent connection between client and server
双方想什么时候发data就什么时候发
Server-Sent Events
只server想什么时候给client发就什么时候发
如果client想给server发data需要用别的protocol
网友评论