美文网首页
谷歌云的公开数据集资源

谷歌云的公开数据集资源

作者: SeanCheney | 来源:发表于2019-03-07 19:52 被阅读82次

谷歌云托管了一些数据集,每月有1TB的免费额度。Kaggle有的比赛所用的数据集也来自谷歌云的公开数据集。

可用的公开数据集如下:
https://www.reddit.com/r/bigquery/wiki/datasets

按照下面链接的提示,创建密钥文件(“GOOGLE_APPLICATION_CREDENTIALS”):
https://cloud.google.com/docs/authentication/getting-started

获得如下的一个json文件:

{
  "type": "service_account",
  "project_id": "t-skyline-231233803",
  "private_key_id": "10f9ce0f8f0c29ec254b54234452345234623b4b9737595084296",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQDjgBxRV684v2gn\nQTyb7KRUq4u4GCjb7yrDl+fna8p7BX143452344523445yOoJjti7dv5B+4eaIW/0HuDtJSlSYt+Me+\n5ajvg570wm/aFGBAzRxQiq5tkO8kUELC306bLlBY25WeN1SA3OzPkrDjR0TAXsb2\nDYyzKYKgtHZgyH/C1BByGFyB6W6GSebWaYcd0yQovFxZjigGVhvwZnNmErEW+RJj\nz7PIflj0vGjCnC4vqpVuzDhRz6dWZkX/3Jd4Md+ffRl2AApYBaYKdPd3Z38aEMFT\nLKeY3p/ilNDj5c0SEBKV8jwhbo498HaZ8iQ+nQlAtxOqZCa7QbjkNy5n3PN9T8XZ\n9e6mnuqjAgMBAAECggEAJQBgBV4zambom+3b25PBvIsJgVUzRruKYzVF626tiD8X\nCE7ti0E8rMRX4RYDZiXOcCzVShzmZfa/F3jcoQFxgpH3RZXSe+erBxyEjInUo2gx\nxjvcr0JOoM0syLFrcJQTfhqijiy4GBOyMjB3pbzcI/gwrW10KNjQnfCsEcMeOKqQ\n4wkSpnFLyblWdHnO+MRqObal5U5u892srhS6y7vob6KJAaFdC6wM3m/c6c86uNXS\npQRbkPtS5LdXMnAp0LXsdffgsdffgsdffgxKLxdp50Cw79E9NhOkqOm9s+kA+QhURwRw6K543lMyjdg\nDtW78vcTTn8FR3lIFIxiFG69+26Vyzn0/3WO2loh4QKBgQD2Z29C85ccQjXjp5+g\nsLBB7tYI0KqWmC32OazFjVp8Tnjq2UTOUCYK3kDIK9ThAfUzCiWY78sdffgsdffgbS/7muBqVA\nXf5dkcuaF57zIrZ2LhXC5XJV++OaDhBtSOAX6WtT6e22q3BrvQjeqlPY7ev2/PRm\nA8UvEEaQNe7p3JMkoPm23HqP4QKBgQDsXDafzx2fuxNh2v+Q3hEF6vu4npz8iPgx\nTzBMh7qixpdPtwyPyjp8+78aHJoAfkINsSR+RklK5h6EjtYMr8UivZPTAFkq1/9y\n4tYmWjHil+gFWgr/tmvDpnujesxPDk6oOXs2KulFUw/OQzJFs8v4nkMvoQ3qI2UJ\n7yn6hlKbAwKBgQC2VH2pLjUALSJRTiU5s+UJNOFeboH8o6lHDRCCsoFlgG/LYHCg\ntgAghKutM3n38gnt9bEyhSM72Q0d8D7x7VufA5aEdLwir+oSczGZIU3EwHp/8a5I\nH+fq6ceItY44YI30u1HH6oPW/t9fyXhT0LBljgaZCb+7f4PGU/PUUzvCIQKBgFxK\nbXHdN54FOD4/ewcgDtmWtY+TL41UFkV6vEtFvFSO4spXmWoT9t6Slj4l6OREaJpc\nDvnXjArY7BWqiF4PzBQSnkQHoMmf6bj8Zc437b6ivhBI5n6Oxmlw+loNOvWrin8Z\nYsj13HwbQNKzXk/lMsHzWhyg8IP6KhvE7UlLZbntAoGAfh6r4tlz8KtRRQx2y9Yj\n7LekpKhBS7TRQLMMktLbiSuTfr7XA+U4yKzwog30MmfWDtg8Gi3L2+tZ01IkpYuX\ntcaUCrxGeEpwI8P0Fh0KJCjBrWkenGugWXIrYChInzUCA2e1DV6L6c2LIsk/1eBy\n9E8EaTUzAFh5bK+U2zy6c5Q=\n-----END PRIVATE KEY-----\n",
  "client_email": "eth-token@t-skyline-233803.iam.gserviceaccount.com",
  "client_id": "107722234248508205191129",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/eth-token%40t-skyline-233803.iam.gserviceaccount.com"
}

"GOOGLE_APPLICATION_CREDENTIALS"可以在环境中export,也可以通过os添加。

通过pip install google-cloud-bigquery,安装bigquery。通过写SQL就可以获得数据了:

from google.cloud import bigquery
import pandas as pd
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/Users/seancheney/cm_project/google-cloud/My Project 33905-a3d655e962d9.json"

client = bigquery.Client()

# # Perform a query.
QUERY = '''
SELECT 
  SUM(value/POWER(10,18)) AS sum_tx_ether,
  AVG(gas_price*(receipt_gas_used/POWER(10,18))) AS avg_tx_gas_cost,
  DATE(timestamp) AS tx_date
FROM
  `bigquery-public-data.crypto_ethereum.transactions` AS transactions,
  `bigquery-public-data.crypto_ethereum.blocks` AS blocks
WHERE TRUE
  AND transactions.block_number = blocks.number
  AND receipt_status = 1
  AND value > 0
GROUP BY tx_date
HAVING tx_date >= '2018-01-01' AND tx_date <= '2018-12-31'
ORDER BY tx_date
    '''
query_job = client.query(QUERY)  # API request
iterator = query_job.result()  # Waits for query to finish
rows = list(iterator)
df = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))
print(df)

相关文章

网友评论

      本文标题:谷歌云的公开数据集资源

      本文链接:https://www.haomeiwen.com/subject/glxcpqtx.html