美文网首页
(GeekBand)系统设计与实践 案例分析

(GeekBand)系统设计与实践 案例分析

作者: Linary_L | 来源:发表于2017-01-10 12:58 被阅读0次

    案例

    • News Feeds
    • Stats Server
    • Web Crawler
    • Amazon Product Page

    News feed(信息流)

    Define feed

    Organize
    • aggregate(分类)
    • dedup(去重)
    • sort(排序)

    Level1.0

    Database Schema:
    • User
    • Friendship
    • News
    GetNewsfeed:
    • merge news
    • Newsfeed vs News

    Why bad?

    100+ friends
    1Query-->Get friends list
    
    1Query-->
    

    SELECT news

    WHERE timestamp>xxx
    AND sourceid IN friend list
    LIMIT 1000
    

    IN is slow

    Either Sequential scan or 100+ index queries
    

    Level 2.0

    Pull vs Push

    Pull:Get news from each friend,merge them together.(NewsFeed generated when user request)

    Push:NewsFeed generated when news generated.(we have another table to store newsfeed,may cause duplicate news)

    Push:
    1Query to select latest 1000 newsfeed.
    100+ insert queries(Async)
    

    Disadvantage:News Delay.

    Level 3.0

    Popular star(Justin Bieber)

    Flowers 13M+

    Async Push may cause over 30 minutes(13M+ insertions,delay too long)

    Push+Pull

    for popular star,don't push news to flowers

    for every newfeed reqiest,merge non-popular user newfeed(push) and popular users newsfeed(pull)

    Level 4.0

    Push disadvantage
    • Realtime
    • Storage(Duplicate)
    • Edit
    Go back to PULL:
    • Cache users' latest (14days) news
    • Broadcast multiple request to multiple servers(Shard by userld).
    • Merge & sort newsfeed
    • Cache newsfeeds for this user with timestamp

    Click Stats Server

    How are click stats stored

    A poor candidate will suggest write-back to a data store on every click

    A good candidate will suggest some form of aggregation tier that accepts clickstream data,aggregates it,and writes back a persistent data store periodically

    A great candidate will suggest alow-latecy messaging system to bugger the click data and transfer it to the aggregation tier.

    If daily,storing in hdfs and running map/reduce jobs to compute stats is a reasonable approach

    If near real-time,the aggregation logic should compute stats

    PS:要如何统计鼠标点击的次数以及相关区域呢?普通的程序员会将每次点击的数据(log)直接存储在数据库一层。比较好的程序员会在前段与数据库间加一个中间层,为点击的数据流做一次聚合,每隔一段时间(1分钟或10分钟)做一次刷新,存储到数据库,大大减轻了后端的压力。优秀的程序员综合以上的两种情况,对于数据量很大,实时性效果不高的情况下,可以通过分布式的批处理方式,将刷新聚合层的时间定位一天。对于时效性强的要适当缩短刷新时间。

    Cache Requirement

    • When a request comes look it up in the cache and if it hits then return the response from here and do not pass the request to the system.
    • If the request is not found in the cache then pass it on to the system.
    • Since cache can only store the last n requests,Insert the n+1th request in the cache and delete one of the older requests from the cache
    • Design one cache such that all operations can be done in O(1)-lookup,delete and insert.
    PS:如何设计cache(LRU设计相关):
    • 在层中缓存部分请求的处理方式,如果接收的请求在层中存在对应的处理方式,则无需把请求发送到后端系统
    • 如果在层中找不到对应处理,则发送需求到后端
    • 以定长队列的形式缓存,缓存最近的n个需求,头进尾出
    • 将层中的匹配操作算法控制在O(1)范围

    Web Crawler

    爬虫

    Amazon Product Page

    The product page includes information such as
    • product information
    • user information
    • recommended products(what do other customers buy after viewing this item,recommendations for you like this product,etc)
    Reference
    • http://highscalability.com
    • The Log:What every software engineer should know about real-time data's unifying abstraction
    • Job Interviews:How should I prepare system design questions for Goole/Facebook Interview?
    • HOW TO ACE A SYSTEMS DESIGN INTERVIEW
    • <Design Pattern>
    • <Design_Oatterns_For_Dummies.pdf>
    • http://www.hiredintech.com/app

    相关文章

      网友评论

          本文标题:(GeekBand)系统设计与实践 案例分析

          本文链接:https://www.haomeiwen.com/subject/sgpkbttx.html