这里的解释并不准确,主要为了容易记住, 顺利通过认证考试。 严格的解释还需要参考AWS 官方文档。
1. AWS Kinesis Data Streams
以低延时的方案,高可定制化方式, 为分析端提供 stream 数据。 在题目中看到 steream data, 而且data 直接分析,就高度怀疑该选项。 注意区别与 Kinesis Data Firehose。
Example
-
Your organization is looking for a solution that can help the business with streaming data several services will require access to read and process the same stream concurrently. What AWS service meets the business requirements?
A. Amazon Kinesis Firehose
B.Amazon Kinesis Streams
C. Amazon CloudFront
D. Amazon SQS -
Your application generates a 1 KB JSON payload that needs to be queued and delivered to EC2 instances for applications. At the end of the day, the application needs to replay the data for the past 24 hours. In the near future, you also need the ability for other multiple EC2 applications to consume the same stream concurrently. What is the best solution for this?
A. Kinesis Data Streams
B. Kinesis Firehose
C. SNS
D. SQS
2. Amazon Kinesis Firehose
将 stream data 存储到AWS 某个地方,比如 S3,Elasticsearch Service, 或者 Redshift 。后继的分析过程是基于已存储的Data。
Example
- Your organization needs to ingest a big data stream into their data lake on Amazon S3. The data may stream in at a rate of hundreds of megabytes per second. What AWS service will accomplish the goal with the least amount of management?
A. Amazon Kinesis Firehose
B. Amazon Kinesis Streams
C. Amazon CloudFront
D. Amazon SQS
Reference
AWS Kinesis Data Streams vs Kinesis Data Firehose
3. Protobuf RecordIO Format
Protobuf RecordIO 是AWS 反复强调可以提高训练速度的数据格式, 如果题目中碰到关于数据格式, 训练速度的概念。 就要高度警惕答案中的 Protobuf RecordIO 格式。我参考 AWS 的文档, 整理了一个文件格式 vs Buildin Algorithm 的表格, 该表格比 AWS的表格少2行, 更易于备考记忆。
ContentType | Algorithm |
---|---|
application/x-image, image/jpeg, image/png | Object Detection Algorithm, Semantic Segmentation |
application/x-recordio | Object Detection Algorithm |
application/x-recordio-protobuf, text/csv | K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF |
application/x-recordio-protobuf, | Factorization Machine, Sequence-to-Sequence |
text/csv, text/libsvm | XGBoost |
application/jsonlines | BlazingText, DeepAR |
Reference
4. Amazon QuickSight
提供Dash Board 的 BI 工具。
1. Amazon QuickSight ML Insights
AWS 给 ML 定制的 Data Visualization 的工具, 由于是ML 的一个出口, 又是AWS力推的产品, 考试大概率会涉及。
https://aws.amazon.com/quicksight/features-ml/?nc=sn&loc=2&dn=2
5. SageMaker 内建 ML 算法
Algorithm | Comments |
---|---|
BlazingText | Word2vec , 文本分类 |
DeepAR Forecasting | 基于RNN的, 一维时间序列预测算法, 有监督 |
Factorization Machines | 在高维稀疏数据中, 寻找 interactions |
Image Classification Algorithm | 图片分类, 有监督 |
IP Insights | 分析可能与IPv4 有关联的数据 |
K-Means Algorithm | K-means 离散分组, 无监督 |
K-Nearest Neighbors (k-NN) Algorithm | 用已经标记的数据分组, 有监督(与K-means 不同) |
Latent Dirichlet Allocation (LDA) | 无监督,文档分类 |
Linear learner algorithm | 有监督,线性分类 |
Neural Topic Model (NTM) Algorithm | 无监督,文档分类 |
Object2Vec | 有监督,用于特征工程,用高密度低维特征,替代高维特征 |
Object Detection Algorithm | 有监督图像目标检测 |
Principal Component Analysis (PCA) Algorithm | 无监督,降维 |
Random Cut Forest (RCF) Algorithm | 无监督,异常检测 |
Semantic Segmentation | 图像处理,细颗粒度 |
Sequence to Sequence (seq2seq) | 有监督,序列生成,不限于文本 |
XGBoost Algorithm | 有监督,回归,分类,分级 |
Reference
6. SageMaker 读取训练数据
SageMaker 只能从S3中读取数据, 如果数据不在 S3中 (所以不能从类似数据库这样的地方直接读取), 要先存在S3中, Glue 服务会帮忙

7. AWS ML Related Services
- Kinesis : Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application.
- Athena : Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
- Glue : AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
- Quick Sight : Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud.
- Quick Redshift : No other data warehouse makes it as easy to gain new insights from all your data. With Redshift, you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL.
- Lex : Amazon Lex is a service for building conversational interfaces into any application using voice and text.
- Polly : Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products.
- Transcribe : Amazon Transcribe makes it easy for developers to add speech to text capabilities to their applications. Audio data is virtually impossible for computers to search and analyze.
网友评论