谷歌云数据工程师考试 - Data Proc 复习笔记

作者: 塞小娜 | 来源:发表于2018-08-07 20:02 被阅读0次

谷歌云数据工程师考试 - Data Proc 复习笔记
谷歌云数据工程师考试 - BigQuery复习笔记
谷歌云数据工程师考试 - Encryption 复习笔记
谷歌云数据工程师考试 - Bigtable复习笔记
谷歌云数据工程师考试 - Cloud Pub/Sub 复习笔记
笔记
R package pROC
判断数据集中，某个变量是否存在
2019年咨询工程师考试复习备考攻略技巧汇总
关于考试复习及准备的想法

Dataproc Summary

How to load data?

a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Dataproc connects to BigQuery

Option 1:

Screen Shot 2018-07-15 at 12.34.04 am.png

BigQuery does not natively know how to work with a Hadoop file system.

Cloud storage can act as an intermediary between BigQuery and data proc.

You would export the data from BigQuery into cloud storage as sharded data.

Then the worker notes in data proc would read the sharded data.

Symmetrically, if the data proc job is producing output it can be stored in a format in cloud storage that can be input to BigQuery.

Appropriate for periodic or infrequent transfers

Option 2:

Another option is to setup a BigQuery connector on the Dataproc cluster. The connector is a Java library that enables read write access from Spark and Hadoop directly into BigQuery.

Need to save BigQuery result as table first.

![Screen Shot 2018-07-15 at 12.48.01 am.png](https://img.haomeiwen.com/i9976001/6fcaa78c38c1d404.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ![Screen Shot 2018-07-15 at 12.50.02 am.png](https://img.haomeiwen.com/i9976001/9a1b2c9c68b70469.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Screen Shot 2018-07-15 at 12.44.25 am.png

Screen Shot 2018-07-15 at 12.44.35 am.png

Screen Shot 2018-07-15 at 12.48.01 am.png

Screen Shot 2018-07-15 at 12.50.02 am.png

Screen Shot 2018-07-15 at 12.50.20 am.png

Option 3:

When you want to process data in memory for speed - Pandas Dataframe

In memory, fast but limited in size

Creating a Dataproc cluster

Ways:
Deployment manager template, which is an infrastructure automation service in Google Cloud.
CLI commands
Google cloud console

Keys:

0 Create a cluster specifically for one job

1 Match your data location to the compute location
-> better performance
-> also able to shut down cluster when not processing jobs

2 use Cloud Storage instead of HDFS, shutdown the cluster when it’s not actually processing data
-> It reduces the complexity of disk provisioning and enables you to shut down your cluster when it's not processing a job.

3 Use custom machine types to closely manage the resources that the job requires

4 On non-critical jobs requiring huge clusters, use preemptible VMs to hasten results and cut costs at the same time