最主要的还是要问清楚各种问题,每日用户数量(也就是确认scale) 根据scale来确定DB,之后才是具体的应用
数据的形式(确定是SQL还是NoSQL)
Platform-level software: The common firmware, kernel, operating system distribution, and libraries expected to be present in all individual servers to abstract the hardware of a single machine and provide a basic machine abstraction layer.
Cluster-level infrastructure: The collection of distributed systems software that manages resources and provides services at the cluster level. Ultimately, we consider these services as an operating system for a data center.
Application-level software: Software that implements a specific service. It is often useful to further divide application level software into online service and offline computations.
Example of Online services are Goole search, Gmail. Offline computations are typically used in large-scale data analysis or as part of the pipeline that generates the data used in online services, for example, building an index of web or processing satellite images to create map tiles for the online service.
Monitoring and development software: Software that keeps track of system health and availability by monitoring application performance, identifying system bottlenecks, and measuring cluster health.

A virtual machine provides a concise and portable interface to manage both the security and performance isolation of a customer's application, and allows multiple guest operating system to co-exist with limited additional complexity.
Containers are an alternate popular abstraction that allow for isolation across multiple workloads on a single OS instance. Because each container shares the host OS kernel and associated binaries and libraries, they are more lightweight compare to VMs, smaller in size and much faster to start.
2.3 Cluster-level infrastructure software
Much like an operating system layer is needed to manage resources and provide basic services in a single computer, a system composed of thousands of computers, networking, and storage also requires a layer of software that provides analogous functionality at a larger scale. We call this layer the cluster-level infrastructure. Three broad groups of infrastructure software make up this layer.
2.3.1 Resource management
This is perhaps the most indispensable component of the cluster-level infrastructure layer. It controls the mapping of user tasks to hardware resources, enforces priorities and quotas, and provides basic task management services. In its simplest form, it is an interface to manually (and statically) allocate groups of machines to a given user or job. A more useful version would present a higher level of abstraction, automate allocation of resources, and allow resource sharing at a finer level of granularity. Users of such systems would be able to specify their job requirements at a relatively high level (for example, how much CPU performance, memory capacity, and networking bandwidth) and have the scheduler translate those requirements into an appropriate allocation of resources.
Kubernetes is a popular open-source program which fills this role that orchestrates these functions for container-based workloads. 所以Kubernetes是cluster层面的“系统”,主要用来管理资源(allocate recourse)
Kubernetes provides a family of APIs and controller that allow users to specify tasks in the popular Open Containers Initiative format (which derives from Docker containers) Several patterns of workloads are offered, from horizontally scaled stateless applications to critical stateful applications like databases.
Users define their workloads' resource needs and Kubernetes finds the best machines on which to run them.
Colossus (successor to GFS), Dynamo, and Chubby are examples of reliable storage and lock servies developed at Google and Amazon for large clusters.
Many tasks that are amenable to manual processes in a small deployment require a significant amount of infrastructure for efficient operations in large-scale systems.
2.3.3 application framework
The entire infrastructure described in the preceding paragraphs simplifies the deployment and efficient usage of hardware resources, but it does not fundamentally hide the inherent complexity of a large scale system as a target for the average programmer. From a programmer's standpoint, hardware clusters have a deep and complex memory/storage hierarchy,
Some types of higher-level operations or subsets of problems are common enough in large-scale services that it pays off to build targeted programming frameworks that simplify the development of new products. Flume, MapReduce, Spanner, BigTable, and Dynamo are good examples of pieces of infrastructure software that greatly improve programmer. productivity by automatically handling data partitioning, distribution, and fault tolerance within their respective domains. Equivalents of such software for the cloud, such as Google Kubernetes Engine (GKE)
2.4 Application level software
2.4.1 workload diversity
web search was one of the first large-scale internet services to gain widespread popularity, as the amount of web content exploded in the mid-1990s, and organizing this massive amount of information wen beyond what could be accomplished with available human-managed directory services.
2.4.2 Web search
This is the quintessential "needle in a haystack" problem Although it is hard to accurately determine the size of the web at any point in time, it is safe to say that it consists o
A lexicon structure associates an ID to every term in the repository. The termID identifies a list of documents in which the term occurs, called a posting list, and some contextual information about it, such as position and various other attributes (for example, whether the term is in the document title)
The size of the resulting inverted index depends on the specific implementation, but it tends to be on the same order of magnitude as the original repository. The typical search query consists of a sequence of terms, and the system's task is to find the documents that contain all of the terms (an AND query)
我去,到最后还是落实到了AND Query 我记得吴军老师有讲过这一点
and decide which of those documents are most likely to satisfy the user. Queries can optionally contain special operators to indicate alternation (OR operators) or to restrict the search to occurrences of the terms in a particular sequence (phrase operators). For brevity we focus on the more common AND query in the example below
Consider a query such as [new york restaurants]. The search algorithm must traverse the posting lists for each term (new, york, restaurants) until it finds all documents contained in all three posting lists. At that point it ranks the documents found using a variety of parameters, such as the overall importance of the document (In google's case, it would be PageRank score as well as other properties such as number of occurrences of the terms in the documents, positions, and so on) and returns the highest ranked documents to the user.
Given massive size of the index, the search algo need to run across a few thousand machines. **That is accomplished by splitting (or sharding) the index into load-balanced subfiles and distributing them across all of the machines. Index partitioning can be done by document or by term. The user query is received by a front-end web server and distributed to all of the machines in the index cluster. As necessary for thoughput or fault tolerance, multiple copies of index subfiles can be placed in different machines, in which case only a subset of the machines is involved in a given query.
However, high throughput is also a key performance metric because a popular service may need to support many thousands of queries per second. The index is updated frequently, but in the time granularity of handling a single query, it can be considered a read-only structure. Also, because there is no need for index lookups in different machines to communicate with each other except for the final merge step, the computation is very efficiently parallelized. Finally, further parallelism is available by exploiting the fact that there are no logical interactions across different web search queries.
If the index is sharded by doc_ID, this workload has relatively
2.4.4 Scholarly article similarity
Services that respond to user requests provide many examples of large-scale computations required for the operation of internet services. These computations are typically data-parallel workloads needed to prepare or package the data that is subsequently used by the online services. For example, computing PageRank or creating inverted index files from a web repository fall in this category. But in this section, we use a different example: Article similarity relationships complement keyword-based search systems as another way to find relevant information; after finding an article of interest, a user can ask the service to display other articles that are strongly related to the original article.
There are several ways to compute similarity scores, and it is often appropriate to use multiple methods and combine the results. Here we consider co-citation. The underlying idea is to count every article that cites articles A and B as a vote for the similarity between A and B. After that is done for all articles and appropriately normalized, we obtain a numerical score for the (co-citation) similarity between all pairs of articles and create a data structure that for each article returns an ordered list (by co-citation score) of similar articles. This ar
网友评论