美文网首页
Lesson 5 ETL

Lesson 5 ETL

作者: 可苟可远 | 来源:发表于2021-01-08 00:09 被阅读0次

    The importance of ETL

    OLTP vs OLAP: Frequent DML vs mulit-dimensional data read

    Table in OLTP DBs --ETL--> Data Wh -->Data Mart-->Decision making

    dimension table vs fact table

    根据事实表维度表的关系,又可将常见的模型分为星型模型(反规范化数据/冗余)和雪花型模型

    Row Oriented DB

    001:10,Smith,Joe,60000;10:001,12:002,11:003,22:004;

    002:12,Jones,Mary,80000;

    003:11,Johnson,Cathy,94000;

    004:22,Jones,Bob,55000;

    Column Oriented DB -->Aggregation/Compression

    Smith:001,Jones:002,Johnson:003,Jones:004;

    Joe:001,Mary:002,Cathy:003,Bob:004;

    60000:001,80000:002,94000:003,55000:004;

    Common column-oriented databases:

    Redshift BigQuery Snowflake

    ETL Solutions: ETL Tools vs. Hand Written SQL 

    ETL Skills for the Data Scientist:

    ·      Don’t do everything in Python/R – you are doing data manipulation use a database tool

    ·      Run the extract in the memory of the source database

    ·      Limit the transaction extraction by a variable set to be only those greater than the start date/time of the last successful run

    ·      Transfer to destination database prior to joins

    ·      Execute joins/transformations in destination database

    ·      Build any aggregations needed to speed performance

    ·      Build appropriate checks in the process. Every ETL will eventually fail. By setting up the last successful run variable, the ETL should be setup so that it will automatically catch up. A full re-build may be necessary and that is just setting the date/time of the start of the last good run to the date/time of the first transaction.

    ·      Build the joins tables assuming you will re-use them. Aggregations and transformations may vary as time goes by so do that in a separate step.

    ·      Unless you are working with very large data sets, use as many intermediate tables on your destination database as you need. You’ll get better performance results and the extra storage space is no big deal.

    Common Tasks - SQL, Pandas
    SQL

    • Upper/lower convert

    • Char/date/number convert

    • Regexp

    • Nvl/nvl2/coalesce

    • Join

    • Group by

    • DML

    Common Tasks - SQL, Pandas

    Pandas

    • Slice/iloc vs loc/apply

    • Index/column convert

    • Data type convert

    • Format convert

    • isnull/fillna/dropna/replace

    • append/join/concat

    • sort_values/groupby/pivot_table

    sklearn六大板块:分类 回归 聚类 数据降维 数据预处理 特征抽取

    sklearn中常用的模块有预处理、分类、回归、聚类、降维、模型选择。

    预处理(Preprocessing):特征提取和归一化

    常用的模块有:preprocessing,feature extraction

    常见的应用有:把输入数据(如文本)转换为机器学习算法可用的数据。

    分类(Classification):识别某个对象属于哪个类别

    常用的算法有:SVM(支持向量机)、nearest neighbors(最近邻)、random forest(随机森林)

    常见的应用有:垃圾邮件识别、图像识别。

    回归(Regression):预测与对象相关联的连续值属性

    常见的算法有:SVR(支持向量机)、 ridge regression(岭回归)、Lasso

    常见的应用有:药物反应,预测股价。

    聚类(Clustering):将相似对象自动分组

    常用的算法有:k-Means、 spectral clustering、mean-shift

    常见的应用有:客户细分,分组实验结果。

    降维(Dimensionality Reduction):减少要考虑的随机变量的数量

    常见的算法有:PCA(主成分分析)、feature selection(特征选择)、non-negative matrix factorization(非负矩阵分解)

    常见的应用有:可视化,提高效率。

    模型选择(Model Selection):比较,验证,选择参数和模型

    常用的模块有:grid search(网格搜索)、cross validation(交叉验证)、 metrics(度量)

    它的目标是通过参数调整提高精度。

    SKlearn | 学习总结的更多相关文章

    附:算法选择路径

    这个图屌啊,不看此文的同学错过一个亿。 损失第二个亿就没什么感觉了

    petl (200+Functions...)
    >>> frompetlimport*

    >>> table1=fromcsv('example1.csv')

    >>> table2=convert(table1,'foo','upper')

    >>> table3=convert(table2,'bar',int)

    >>> table4=convert(table3,'baz',float)

    >>> table5=addfield(table4,'quux',expr('{bar} * {baz}'))

    ...no actual transformation work will be done, until data are requested from table5 or any of the other row containers returned by the intermediate steps. So in effect, a 5 step transformation pipeline has been set up.

    >>> look(table5)

    作业:

    Create Python script in AWS EC2 to fetch data from finnhub, and load it to AWS RDS:

    • You can choose any data;

    • You can use any of these libs: request, finnhub-python, petl, pandas, etc.

    • Create shell script to run above job automatically and recursively:

    • You can use any job scheduling tools, prefer use crontab

    • Implement simple alter/notification system:

    • Notify ETL job finished status, either via email or SMS

    • Sending Alert during etl job if anything need immediate attention

    使用 GitHub Actions 实现 CI/CD

    Tasks: 

    1.7 local IDE; Finhub reg; API usage and format; =get first data

    1.8 API get data tone; Connect RDS from local; DB design and create; 

    1.9 local done; Github pull and push from/to AWS; AWS deploy;

    1.10 crontab; email alert; 

    1.11 sms alert;  

    1.12 GithubActions

    https://www.linkedin.com/pulse/fintech-blog-1-step-by-step-guide-analyzing-stock-data-zachary-blitz/

    Setting Up VSCode For Python Programming

    一入DS深似海,从此玩耍是路人。

    相关文章

      网友评论

          本文标题:Lesson 5 ETL

          本文链接:https://www.haomeiwen.com/subject/pyetaktx.html