The importance of ETL
OLTP vs OLAP: Frequent DML vs mulit-dimensional data read
Table in OLTP DBs --ETL--> Data Wh -->Data Mart-->Decision making
dimension table vs fact table根据事实表和维度表的关系,又可将常见的模型分为星型模型(反规范化数据/冗余)和雪花型模型。
Row Oriented DB
001:10,Smith,Joe,60000;10:001,12:002,11:003,22:004;
002:12,Jones,Mary,80000;
003:11,Johnson,Cathy,94000;
004:22,Jones,Bob,55000;
Column Oriented DB -->Aggregation/Compression
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
60000:001,80000:002,94000:003,55000:004;
Common column-oriented databases:
ETL Solutions: ETL Tools vs. Hand Written SQL
ETL Skills for the Data Scientist:
· Don’t do everything in Python/R – you are doing data manipulation use a database tool
· Run the extract in the memory of the source database
· Limit the transaction extraction by a variable set to be only those greater than the start date/time of the last successful run
· Transfer to destination database prior to joins
· Execute joins/transformations in destination database
· Build any aggregations needed to speed performance
· Build appropriate checks in the process. Every ETL will eventually fail. By setting up the last successful run variable, the ETL should be setup so that it will automatically catch up. A full re-build may be necessary and that is just setting the date/time of the start of the last good run to the date/time of the first transaction.
· Build the joins tables assuming you will re-use them. Aggregations and transformations may vary as time goes by so do that in a separate step.
· Unless you are working with very large data sets, use as many intermediate tables on your destination database as you need. You’ll get better performance results and the extra storage space is no big deal.
Common Tasks - SQL, Pandas
SQL
• Upper/lower convert
• Char/date/number convert
• Regexp
• Nvl/nvl2/coalesce
• Join
• Group by
• DML
Common Tasks - SQL, Pandas
Pandas
• Slice/iloc vs loc/apply
• Index/column convert
• Data type convert
• Format convert
• isnull/fillna/dropna/replace
• append/join/concat
• sort_values/groupby/pivot_table
sklearn六大板块:分类 回归 聚类 数据降维 数据预处理 特征抽取
sklearn中常用的模块有预处理、分类、回归、聚类、降维、模型选择。
预处理(Preprocessing):特征提取和归一化
常用的模块有:preprocessing,feature extraction
常见的应用有:把输入数据(如文本)转换为机器学习算法可用的数据。
分类(Classification):识别某个对象属于哪个类别
常用的算法有:SVM(支持向量机)、nearest neighbors(最近邻)、random forest(随机森林)
常见的应用有:垃圾邮件识别、图像识别。
回归(Regression):预测与对象相关联的连续值属性
常见的算法有:SVR(支持向量机)、 ridge regression(岭回归)、Lasso
常见的应用有:药物反应,预测股价。
聚类(Clustering):将相似对象自动分组
常用的算法有:k-Means、 spectral clustering、mean-shift
常见的应用有:客户细分,分组实验结果。
降维(Dimensionality Reduction):减少要考虑的随机变量的数量
常见的算法有:PCA(主成分分析)、feature selection(特征选择)、non-negative matrix factorization(非负矩阵分解)
常见的应用有:可视化,提高效率。
模型选择(Model Selection):比较,验证,选择参数和模型
常用的模块有:grid search(网格搜索)、cross validation(交叉验证)、 metrics(度量)
它的目标是通过参数调整提高精度。
附:算法选择路径
这个图屌啊,不看此文的同学错过一个亿。 损失第二个亿就没什么感觉了petl (200+Functions...)
>>> frompetlimport*
>>> table1=fromcsv('example1.csv')
>>> table2=convert(table1,'foo','upper')
>>> table3=convert(table2,'bar',int)
>>> table4=convert(table3,'baz',float)
>>> table5=addfield(table4,'quux',expr('{bar} * {baz}'))
...no actual transformation work will be done, until data are requested from table5 or any of the other row containers returned by the intermediate steps. So in effect, a 5 step transformation pipeline has been set up.
>>> look(table5)
作业:
Create Python script in AWS EC2 to fetch data from finnhub, and load it to AWS RDS:
• You can choose any data;
• You can use any of these libs: request, finnhub-python, petl, pandas, etc.
• Create shell script to run above job automatically and recursively:
• You can use any job scheduling tools, prefer use crontab
• Implement simple alter/notification system:
• Notify ETL job finished status, either via email or SMS
• Sending Alert during etl job if anything need immediate attention
Tasks:
1.7 local IDE; Finhub reg; API usage and format; =get first data
1.8 API get data tone; Connect RDS from local; DB design and create;
1.9 local done; Github pull and push from/to AWS; AWS deploy;
1.10 crontab; email alert;
1.11 sms alert;
1.12 GithubActions
https://www.linkedin.com/pulse/fintech-blog-1-step-by-step-guide-analyzing-stock-data-zachary-blitz/
一入DS深似海,从此玩耍是路人。
网友评论