A Collection of Data Science Tak

作者: Echo_1cc5 | 来源:发表于2019-01-21 11:17 被阅读0次

A Collection of Data Science Tak
Iterator源码
ch1
Data Collection
Machine Learning - 工具
Data Science is Worth it
（二）SparkSQL DataFrame和DataSet基本概
Numpy - Python for Data Science
科学计算、数据科学、机器学习、深度学习和计算机视觉软件包
DataScience

1⃣️. How would you improve engagement on FB? 【How do you increase X on site Y?】

1. Define the metric

- Based on the company's mission

- Has to be measurable (e.g the proportion of users who take at least one action( like, post, upload) per day / % of questions that get at least one response within a day / response >=3 up votes within the first day )

2. Pick the variables

- User characters: sex, age, country, #of friends

- Behavior: device, came from ads/SEO(Search Engine Optimization)/direct link, session time

3. Pick a model

- Random Forest: high accuracy/ works well in high dimension, with categorical variables and outliers

- Get model insights from partial dependence and variable importance plots

4. Come up with one good and one bad scenarios (realistic segments)

- i.e users from Argentina are not very engaged / Indians<30 yrs are very engaged

5. Define next step

- Check Spanish translation, more localized version?

- Reach more young Indians via ads or other market campaigns.

2⃣️【Can't split randomly】Outline a testing strategy to see if the new app is better?

***It's difficult to design a A/B test in marketplace or social networks since users are connected

1. Can't randomly split users because it will effect the control group.

2. Test by market. Comparable markets (main metrics are expected to be similar) in pairs

3. Choose sample size

- Precision based:

- Power based: n=ƒ(a,b)2s^2/∂^2 a-sig b-power s-SD ∂-the smallest difference

4. Check if the result is significant

[Bonus] check for novelty effect (waiting for couple weeks and see the improvement)

3⃣️A/B test wins the significant p-value but choose to not make change

1. Human labor cost : 1) engineer 2) PM 3)customer service 4)opportunity-cost

2. Risk of bugs

3. Future maintenance fee

4. Inferential stats: large sample size -> significant p

check the effect size (Cohen's = (x-x0)/SD / Pearson (r) )

5. Maybe novelty effect

4⃣️【Missing Values】Will Uber trips without rider review be better, worse or same?

- Non-random missing values, can' t assume have the same distribution as non-missing ones

- ML predict : supervised learning (waiting time, trip duration, cost, driver/rider info, time...)

- Company should keep running experiments and reduce this issue. Incentivize users to leave a review(coupon etc.)

5⃣️Jeans is not doing well, demand or supply problem?

-Run a campaign about the jeans. High CTR(click-through-rate) means high demand

- Look at conversion rate. (remove noise: only consider people who used filters/ people whose session time is above 5 mins)

- If there is a supply problem, check filter usage -- is price too high?

6⃣️Drawbacks of supervised learning predicting frauds

- Majority are legitimate, so model tends to have high classification power

- Change the model internal loss to penalize more false negatives, using an extremely aggressive cut-off point (>0.1), or reweighing the training events. --- massive data with positive cases is required

- If didn't detect fraud before, this will negatively reinforce the model

- There is always a time-lag considering people coming up with new techniques to cheat

- Using anomaly detection (problem: in high dimension, tend to consider every transaction as an outlier, needs massive investment in terms of time) and supervised ML

**False Positive : Type 1 error (没罪说有罪)

False Negative：Type 2 error(有罪说没罪)

网友评论

本文标题：A Collection of Data Science Tak

本文链接：https://www.haomeiwen.com/subject/onmqjqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

A Collection of Data Science Tak

相关文章