美文网首页
我眼中的数据科学家

我眼中的数据科学家

作者: abrocod | 来源:发表于2016-02-22 07:17 被阅读64次

    # 我眼中的数据科学家

    ## 1. Computer Science Foundation

    ### Data Structure

    ### Algorithm

    - LeetCode Question (t)

    ### OO Programming

    - Design Pattern (book) (l)

    - Refactoring (book) (l)

    ### Functional Programming

    - Scala (l)

    ### Debugging

    ### Linux / Shell script

    - Common linux command (l)

    > ls, cp, (l)

    - Network knowledge (scp, curl, ftp, http...) (l)

    - Awk, Vim (l)

    - Shell script (l)

    ### Makefile / CMake (l)

    ### Parallel Programming (l)

    ### C++

    - C++ Primer (l)

    - C++ Acceleartor (l)

    - 50 Principle of C++ (l)

    ### Database Management

    - SQL (make a cheatsheet for it !!! ) (m)

    - database optimization (write this in resume is more concrete than just say 'database management')  (l)

    -----

    ## 2. Python Skill

    ### Core Python knowledge

    ### Python Standard Library

    ### Numpy/SciPy/Matplotlib/pandas

    - Pandas DataFrame cheatsheet

    - Introduction to Numpy

    > make a cheatsheet of Numpy ! (t)

    ### Python's Machine Learning Package

    - Theano

    - Scikit-learn

    > Scikit-learn's interface is not well designed for statistics analysis. It is mainly for ML(makeing prediction, classification).

    > Scikit-learn's implementation of Regression. Is it use GSD or just matrix operation?? How is it different from statsmodel's regression implementation?(which I am pretty sure it just use matrix implementation)

    >- worth to spend sometime to read it source code

    - statsmodels

    > understand how they implement basic OLS and GLM model

    > learn the design of statistical package through reading source code

    ### Python's interface to Hadoop (Impyla, Happybase, etc...)

    - impyla (spend some time to study the source code of impyla)

    - happybase

    ----

    ## 3. Hadoop / AWS

    ### Hadoop Software

    - Hbase / Impala

    - buy Hadoop Definite Guide (book)

    ### MapReduce

    ### AWS

    - setup instance / account / system

    - Use AWS as MapReduce tool to do data analysis (EMR)

    - AWS programming

    ----

    ## 4. Machine Learning (except deep learning)

    ### R language

    - Cheetsheet of R's core syntax

    - Use R to do data mining (data cleaning, preprocessing, machine learning)

    - Use R to do big data ?? (interface to Hadoop? )

    ### Common model

    - SVM

    - Logistic Regression

    - Random Forest

    - Boosting Tree / Boosting method

    ### Optimization / Numerical Method

    - Gradient Descent /  SGD

    - Newton's method

    - How they are applied to solve the ML problem. And how to program them

    ### Learn through example and practise

    - Study through Kaggle example and kaggle blog

    + study a example of email spanning blog post and learn how they do such task

    + study a example of text classification blog post

    ----

    ## 5. Deep Learning

    ### Area:

    - Image Processing

    - NLP

    - Finance

    ### Model:

    - CNN

    - RNN

    + RNN's theory and implementation (Torch or Caffe)

    + RNN's application (and preprocessing)

    + Watch Oxford's online course (about use Torch + RNN)

    + etc...

    - Unsupervised Learning

    - Reinforcement Learning

    ### Feature Engineering

    ### Tools:

    - Caffe

    > TO DO:

    > beside image, what else can caffe do?

    > Consider of future career, I am probably not very interested in image type data

    > I am more interested in NLP, finance and other application. Is Caffe still a good choice for them?

    > Does caffe has a good RNN implementation ?

    - Torch

    - Python/Theano

    ### Theory of Neural Network

    > Forward/backward computation

    > Loss function selection

    > Optimization method (Alex has a good paper that cover these theory)

    > How loss function/optimization apply to different domain (NLP, .etc )

    ----

    ## 6. Statistics / Math

    ### A/B test

    - when to use which test

    - read the book "introduction to biostatistics"

    ### Mathematical Statistics (2nd year)

    ### ANOVA

    - how to interpret the result and concept

    ### Probability Theory

    ### Stochastic Process (important for finance)

    ----

    ## 7. Domain Knowledge

    ### Finance

    ### Signal Processing

    -----

    ## 8. Data Scientist Job Seeking

    - Keep reading the data scientist/engineer requirement from different industry

    - Keep reading the interview question / feedback on glassdoor

    > do they focus more on statistics or engineering/cs ?

    > do they require coding test?

    > what kind of statistics question do they ask?

    > what kind of CS background do they require?

    > what are the useful project for getting interview?

    > How to connect data science with finance?

    > How to connect data science with future career and business ?

    - do coding challenges (LeetCode, others ... )

    ## 9. Career Development

    -------------

    ##### ==========================================

    ## Priority list:

    - t: top priroity

    - m: middle priority

    - l: low priority

    相关文章

      网友评论

          本文标题:我眼中的数据科学家

          本文链接:https://www.haomeiwen.com/subject/nekiqttx.html