美文网首页
跑通MetricFlow官方示例项目:jaffle_shop

跑通MetricFlow官方示例项目:jaffle_shop

作者: 猫留下你走吧 | 来源:发表于2024-07-13 23:24 被阅读0次
    背景

    MetricFlow 目前国内的资料比较少,在学习官方文档过程中给了经典的煎饼店的 GitHub 示例代码。在Windows环境按其 Readme 说明跑通比较困难。因此主要记录自己将项目运行起来的过程。

    环境信息

    数据库:PostgreSQL 15
    Python版本:3.12.4
    官方文档:https://docs.getdbt.com/docs/build/about-metricflow
    官方示例项目:https://github.com/dbt-labs/jaffle-sl-template

    运行流程

    特别说明:参考官方文档的Readme

    项目下载

    方式一:通过Git下载
    在文件系统shift + 右键 打开菜单,打开 PowerShell 或使用Win + X打开CMD
    克隆项目:git clone https://github.com/dbt-labs/jaffle-sl-template.git
    方式二:下载Zip包并解压(推荐)

    GitHub
    安装 Metricflow

    采用虚拟环境安装,打开CMD命令行并切换到的项目目录下:jaffle-sl-template-main
    1)创建虚拟环境:python -m venv .env
    2)切换到虚拟环境下:.env\Scripts\active.sh

    cmd
    前面显示(.env) 即成功。
    3)安装dbt和postgres适配器:pip install dbt-core dbt-postgres
    4)安装metricflow根据官方readme的说明:pip install "dbt-metricflow[<YOUR_DBT_ADAPTER_NAME>]" 我是用PostgreSQL适配器,因此执行执行:pip install "dbt-metricflow[postgres]"
    5)验证dbt版本:dbt --version
    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>dbt --version
    Core:
      - installed: 1.8.3
      - latest:    1.8.3 - Up to date!
    
    Plugins:
      - postgres: 1.8.2 - Up to date!
    

    6)验证metricflow版本:mf --version

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>mf --version
    mf, version 0.7.0
    
    连接测试

    根据dbt_project文件知道profile是:snowflake
    因此在用户目录下,找到:.dbt\profiles.yml文件,添加项目的连接信息。

    snowflake:
      target: dev
      outputs:
        dev:
          type: postgres
          host: 192.168.19.128 # PG数据库地址
          user: postgres # PG数据库用户名
          password: postgres # PG数据库密码
          port: 5432 # PG数据库端口号
          dbname: jaffle_shop # PG数据库名称
          schema: public # PG数据库模式
          threads: 4
          connect_timeout: 10
    

    需提前在PG数据库创建jaffle_shop库:create database jaffle_shop
    运行dbt debug 测试连通性:

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>dbt debug
    13:14:59  Running with dbt=1.8.3
    13:14:59  dbt version: 1.8.3
    13:14:59  python version: 3.12.4
    13:14:59  python path: C:\Users\admin\Downloads\jaffle-sl-template-main\.env\Scripts\python.exe
    13:14:59  os info: Windows-10-10.0.19045-SP0
    13:14:59  Using profiles dir at C:\Users\admin\.dbt
    13:14:59  Using profiles.yml file at C:\Users\admin\.dbt\profiles.yml
    13:14:59  Using dbt_project.yml file at C:\Users\admin\Downloads\jaffle-sl-template-main\dbt_project.yml
    13:14:59  adapter type: postgres
    13:14:59  adapter version: 1.8.2
    13:14:59  Configuration:
    13:14:59    profiles.yml file [OK found and valid]
    13:14:59    dbt_project.yml file [OK found and valid]
    13:14:59  Required dependencies:
    13:14:59   - git [OK found]
    
    13:14:59  Connection:
    13:14:59    host: 192.168.19.128
    13:14:59    port: 5432
    13:14:59    user: postgres
    13:14:59    database: jaffle_shop
    13:14:59    schema: public
    13:14:59    connect_timeout: 10
    13:14:59    role: None
    13:14:59    search_path: None
    13:14:59    keepalives_idle: 0
    13:14:59    sslmode: None
    13:14:59    sslcert: None
    13:14:59    sslkey: None
    13:14:59    sslrootcert: None
    13:14:59    application_name: dbt
    13:14:59    retries: 1
    13:14:59  Registered adapter: postgres=1.8.2
    13:14:59    Connection test: [OK connection ok]
    
    13:14:59  All checks passed!
    
    安装dbt依赖

    1)下载项目依赖的2个dbt包:dbt_utils 和 dbt_date:dbt deps
    发现运行到这里就卡住了,日志说让更新到最新版本:

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>dbt deps
    13:17:03  Running with dbt=1.8.3
    13:17:04  Installing dbt-labs/dbt_utils
    13:17:05  Installed from version 1.0.0
    13:17:05  Updated version available: 1.2.0
    13:17:05  Installing calogica/dbt_date
    13:17:06  Installed from version 0.8.1
    13:17:06  Updated version available: 0.10.1
    13:17:06
    13:17:06  Updates available for packages: ['dbt-labs/dbt_utils', 'calogica/dbt_date']
    Update your versions in packages.yml, then run dbt deps
    

    2(因此根据提示的版本修改了package.yml 文件:

    packages:
      - package: dbt-labs/dbt_utils
        version: 1.2.0
      - package: calogica/dbt_date
        version: 0.10.1
    

    再次运行:dbt deps

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>dbt deps
    13:20:35  Running with dbt=1.8.3
    13:20:36  Updating lock file in file path: C:\Users\admin\Downloads\jaffle-sl-template-main/package-lock.yml
    13:20:36  Installing dbt-labs/dbt_utils
    13:20:37  Installed from version 1.2.0
    13:20:37  Up to date!
    13:20:37  Installing calogica/dbt_date
    13:20:38  Installed from version 0.10.1
    13:20:38  Up to date!
    
    加载数据

    执行:dbt seed

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>dbt seed
    13:25:15  Running with dbt=1.8.3
    13:25:15  Registered adapter: postgres=1.8.2
    13:25:15  Unable to do partial parsing because saved manifest not found. Starting full parse.
    13:25:16  [WARNING]: Deprecated functionality
    The `tests` config has been renamed to `data_tests`. Please see
    https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more
    information.
    13:25:17  Found 10 models, 6 seeds, 18 data tests, 15 sources, 17 metrics, 664 macros, 1 group, 5 semantic models, 3 saved queries
    13:25:17
    13:25:17  Concurrency: 4 threads (target='dev')
    13:25:17
    13:25:17  1 of 6 START seed file public.raw_customers .................................... [RUN]
    13:25:17  2 of 6 START seed file public.raw_items ........................................ [RUN]
    13:25:17  3 of 6 START seed file public.raw_orders ....................................... [RUN]
    13:25:17  4 of 6 START seed file public.raw_products ..................................... [RUN]
    13:25:22  1 of 6 OK loaded seed file public.raw_customers ................................ [INSERT 939 in 4.09s]
    13:25:22  4 of 6 OK loaded seed file public.raw_products ................................. [INSERT 10 in 4.55s]
    13:25:22  5 of 6 START seed file public.raw_stores ....................................... [RUN]
    13:25:22  6 of 6 START seed file public.raw_supplies ..................................... [RUN]
    13:25:28  5 of 6 OK loaded seed file public.raw_stores ................................... [INSERT 5 in 5.26s]
    13:25:28  6 of 6 OK loaded seed file public.raw_supplies ................................. [INSERT 65 in 5.03s]
    13:25:55  2 of 6 OK loaded seed file public.raw_items .................................... [INSERT 95368 in 37.87s]
    13:26:03  3 of 6 OK loaded seed file public.raw_orders ................................... [INSERT 59652 in 45.51s]
    13:26:03
    13:26:03  Finished running 6 seeds in 0 hours 0 minutes and 45.77 seconds (45.77s).
    13:26:03
    13:26:03  Completed successfully
    13:26:03
    13:26:03  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
    

    将jaffle-data目录下的csv文件加载到PG数据库中:


    PostgreSQL
    运行你的 dbt 项目并查询指标

    1)运行:dbt build --exclude path:jaffle-data
    执行爆了很多错误,但是同时类似的。节选其中一个:

    13:29:37    Database Error in model stg_supplies (models\staging\stg_supplies.sql)
      relation "dbt_sl_test.raw_supplies" does not exist
      LINE 11:     select * from "jaffle_shop"."dbt_sl_test"."raw_supplies"
    

    从报错的select语句看出来,我们没有dbt_sl_test这个 schema。前面测试连接的时候我们PG数据库的 schema 的是:public
    根据报错定位到模型里面用了 source 。发现 source 的配置写死了 schema,因此需要调整 source 的 schema。
    文件路径:models\staging\_source.yml

    sources:
      - name: ecom
        schema: public # dbt_sl_test
        description: E-commerce data
    

    再次执行:dbt build --exclude path:jaffle-data

    13:40:51  Finished running 6 view models, 4 table models, 18 data tests, 3 saved queries in 0 hours 0 minutes and 2.44 seconds (2.44s).
    13:40:51
    13:40:51  Completed successfully
    13:40:51
    13:40:51  Done. PASS=31 WARN=0 ERROR=0 SKIP=0 TOTAL=31
    

    2)运行:mf validate-configs
    不出意外,报错:

    • ERROR: with metric `revenue`  - Unable to query metric `revenue`.
    Received following error from data warehouse:
    Database Error
      type "datetime" does not exist
      LINE 15:       DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS met...
    

    看样子是PG数据库不支持使用cast将 timestamp 强转转为 datetime 类型。测试了一下,可以不需要强转。
    查找代码,发现是 models\marts\customer360\order_items.yml 的语义模型的维度字段 ordered_at 定义的。修改expr

    semantic_models:
      - name: order_item
          ... ...
        dimensions:
          - name: ordered_at
            expr: ordered_at  # cast(ordered_at as DATETIME)
            type: time
    

    保存后重新构建:dbt build --exclude path:jaffle-data
    再次执行:mf validate-configs

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>mf validate-configs
    (To see warnings and future-errors, run again with flag `--show-all`)
    v 🎉 Successfully parsed manifest from dbt project
    v 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
    v 🎉 Successfully validated semantic models against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
    v 🎉 Successfully validated dimensions against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
    v 🎉 Successfully validated entities against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
    v 🎉 Successfully validated measures against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
    v 🎉 Successfully validated metrics against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
    

    成功!根据官方文档查询指标large_orders:mf query --metrics large_orders
    不出意外,又报错了:

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>mf query --metrics large_orders
    | Initiating query…
    ERROR: Got errors while resolving the query.
    
    Error #1:
      Message:
    
        The given input does not exactly match any known metrics.
    
        Suggestions:
          ['large_order', 'orders', 'food_orders', 'order_cost', 'customers_with_orders', 'order_total']
    
      Query Input:
    
        large_orders
    
      Issue Location:
    
        [Resolve Query(['large_orders'])]
    Log file: C:\Users\admin\Downloads\jaffle-sl-template-main\logs\metricflow.log
    

    看了一下,命令的字面意思是查询指标:large_orders,但是从报错的来看,压根就没有large_orders,只有large_order
    文档错误???改命令执行:mf query --metrics large_order

    (.env) C:\Users\admin\Downloads\jaffle-sl-template-main>mf query --metrics large_order
    v Success 🦄 - query completed after 0.30 seconds
      large_order
    -------------
            10504
    

    成功!!!
    初学dbt,如有不对请指正

    相关文章

      网友评论

          本文标题:跑通MetricFlow官方示例项目:jaffle_shop

          本文链接:https://www.haomeiwen.com/subject/ucuqhjtx.html