美文网首页我爱编程
Introduction to Data Science (st

Introduction to Data Science (st

作者: 黄瓜CBS | 来源:发表于2017-11-14 18:34 被阅读24次

    Introduction to Data Science

    Before we start

    Libraries that we are using

    There's bunch of library that data scientist will use during the research, here are some important libraries:

    • NumPy: provides multidimensional array with basic operation, linear algebra function
    • SciPy: provides numerical algorithms and domain-specific toolbox. Signal processing, optimization, statistics...
    • Matplotlib: data visualization
    • SCIKIT-Learn: machine learning library, classification, regression, clustering, dimensiality reduction, model selection, and preprocessing.
    • PANDAS: data analysis library. Importing CSV, text file, xls, SQL databases, and some other files. It can handling missing data and intelligent data alignment.

    Python version and import toolboxes

    This book use Anaconda distribution of python. It has integrated some of the most imoportant package that we will use. Detail about installing will not be covered in this note. And we are using Jupyter Notebook Environment. All the code was on python 2.7. The reason why we don't use python 3.6 is the poor 3rd party module support. Everything runs fine with 2.7.

    Begin with python

    Remember to import the toolboxes that we will use in the program

    import pandas as pd
    import numpy as np
    import matplotlib.pylot as plt
    #so that we can use short cut pd, np, plt
    

    To run the code in just one cell, use Cell -> Run or just press Ctrl+Enter
    To add a cell below, use Insert -> Insert Cell Below or just press Ctrl+B

    Data input and output

    The data structure of Pandas called DataFrame object. It's basiclly a tabular data structure, that's rows and columns.
    Colums are called series, consist of a list of serval values.
    e.g.

    data = { 'year': [
    2010, 2011, 2012,
    2010, 2011, 2012,
    2010, 2011, 2012
    ],
    'team': [
    'FCBarcelona', 'FCBarcelona',
    'FCBarcelona', 'RMadrid',
    'RMadrid', 'RMadrid',
    'ValenciaCF', 'ValenciaCF',
    'ValenciaCF'
    ],
    'wins':     [30, 28, 32, 29, 32, 26, 21, 17, 19],
    'draws':    [6, 7, 4, 5, 4, 7, 8, 10, 8],
    'losses':   [2, 3, 2, 4, 2, 5, 9, 11, 11]
    }
    
    football = pd.DataFrame(data, columns = [
    'year', 'team', 'wins', 'draws', 'losses'
    ]
    )
    
    index year team wins draws losses
    0 2010 FCBarcelona 30 6 2
    1 2011 FCBarcelona 28 7 3
    2 2012 FCBarcelona 32 4 2
    3 2010 RMadrid 29 5 4
    4 2011 RMadrid 32 4 2
    5 2012 RMadrid 26 7 5
    6 2010 ValenciaCF 21 8 9
    7 2011 ValenciaCF 17 10 11
    8 2012 ValenciaCF 19 8 11

    this is what the result looks like

    index year team wins draws losses
    0 2010 FCBarcelona 30 6 2
    1 2011 FCBarcelona 28 7 3
    2 2012 FCBarcelona 32 4 2
    3 2010 RMadrid 29 5 4
    4 2011 RMadrid 32 4 2
    5 2012 RMadrid 26 7 5
    6 2010 ValenciaCF 21 8 9
    7 2011 ValenciaCF 17 10 11
    8 2012 ValenciaCF 19 8 11

    We can not only use dictionary to input data, but also a CSV file.
    e.g.

    edu = pd.read_csv('files/Data.csv',
    na_values = ':',
    usecols = ["A","B","C"])
    edu
    

    na_values set the character that represents NaN
    usecols set the name of the columns.
    You can also use Pandas to read some other files, by using read_excel(), read_hdf(), read_table(), read_clipboard()
    by calling edu.head() edu.tail() we can see the first 5 rows and the last 5 rows of the Data in edu.
    If we want to know the name of the column or the index or the value, call attribute columns, index, value.
    If you want a quick view how this statistic is like. call edu.describe()
    It give us the count, mean, standard diviation, min, max, 25% 50% 75% quantile of each column or series

    Selecting Data

    Select or filter some of the data normally use square bracket [ ].

    # select just the column name "Value"
    edu[`Value`]
    
    #select some rows
    edu[10: 14]
    
    # select some rows of some suluns
    edu.i[90: 94, ['TIME', 'GEO']]
    

    We can also fiter out some of the data that we don't want. Criteria for filtering can use all the Boolean operator.

    # we want the Value larger than 6.5, and just last 5 row of the data
    edu[edu['Value'] > 6.5].tail()
    

    to represent NaN using isnull()

    edu[edu["Value"].isnull]
    

    Manipulating Data

    Aggregation function

    We can call the aggregation functions to get some information about the data that we want to know about.
    Here're some commenly used functions:count(), sum(), mean(), median(), max(), min(), std(), var(), prod() -> product of values
    They have a common parameter axis.
    axis = 0 means function apply to the rows for each column.
    axis = 1 means function apply to the columns for each row. (Normally this don't make sence)
    Note that the max function of Pandas and of Python are different.
    In Pandas, NaN are exclude.

    max function of python actually regard NaN as maximum

    print "Pandas max function:" edu['Value'].max()  #return 8.81
    print "Python max function:" max(edu['Value'])  #return NaN
    

    we can also apply functions in NumPy by append. apply(np.functionname)
    e.g.

    s = edu["Value"].apply(np.sqrt)
    s.head()
    

    we can also apply a lambda function

    s = edu["Value"].apply(lambda d: d**2)
    

    Lambda function is a easy and better way to define a simple function that with no name and will only be use one time.

    def f(x):
    return x**2
    print f(8) # result: 64
    
    #is equal to
    g = lambda x: x**2
    print g(8) # result: 64
    

    Insert and remove

    We can simply use assign operator over a DataFrame.

    edu['ValueNorm'] = edu['Value']/edu['Value'].max()
    

    Some of the data should be removed. Parameter axis shows whether we are removing the row or the column. Parameter inplace shows whether we are changing the duplicate or the original. If you don't want the old data , set True

    edu.drop('ValueNorm', axis = 1, inplace = True)
    

    Inssert new row at bottom using append.
    remember to set ignore_index to True, otherwise it will be set to index 0, which will lead to an error if index 0 already exist.

    edu = edu.append({"TIME": 2000, "Value": 5.00, "GEO": 'a'}, ignore_index = True)
    

    deleting the last row would be easy if you write like this.

    edu.drop(max(edu.index), axis = 0, inplace = True)
    

    To delete NaN we can use isnull() function and set axis to 0 to delate the whole row.

    eduDrop = edu.drop(edu['Value'].isnull(), axis = 0)
    

    or we can use specific dropna() function

    eduDrop = edu.dropna(how = 'any', subset = ['Value'])
    

    If we want to set a default value to all the NaN we can use fillna()

    eduFilled = edu.fillna(value = {"Value": 0})
    

    Sorting

    Data that sorted by column are commenly seen. We can use sort function.

    edu.sort_values(by = 'Value', ascending = False,
    inplace = True)
    

    note that if the inplace was True, there will be no new DataFrame return.
    Sorted data have their oringinal index, so we can just use sort_index to return the original order

    edu.sort_index(axis = 0, ascending = True, inplace = True)
    

    Grouping Data

    We want to group data that meet some commen criteria. Pandas has the groupby function to do this.

    # shows the mean of the value of each country
    group = edu[["GEO", "Value"]].groupby("GEO").mean()
    
    - Value
    GEO
    Austria 5.618333
    Belgium 6.189091
    Bulgaria 4.093333
    Cyprus 7.023333
    Czech Republic 4.16833

    result:

    - Value
    GEO
    Austria 5.618333
    Belgium 6.189091
    Bulgaria 4.093333
    Cyprus 7.023333
    Czech Republic 4.16833

    Rearranging Data

    Sometimes we need to make the index meaningful by rearranging data. The result is normally a new spreadsheet containing the data that we want and well arranged.

    filtered_data = edu[edu["TIME"] > 2005]
    pivedu = pd.pivot_table(filtered_data, values = 'Value',
    index = ['GEO'],
    columns = ['Timer'])
    pivedu.head()
    
    TIME 2006 2007 2008 2009 2010 2011
    GEO
    Austria ... ... ... ... .. ..
    Belgium .... ... ... ... ... ...
    Bulgaria ... ... .. ... ... ..
    Cyprus ... .. .. .. ... ..
    Czech Republic ... .... ... .... .. ...

    result:

    TIME 2006 2007 2008 2009 2010 2011
    GEO
    Austria ... ... ... ... .. ..
    Belgium .... ... ... ... ... ...
    Bulgaria ... ... .. ... ... ..
    Cyprus ... .. .. .. ... ..
    Czech Republic ... .... ... .... .. ...

    and we can directly select specific row and column by lable of the new spreadsheet that we just made.

    pivedu.ix[['Spain','Portugal'], [2006, 2011]]
    

    Ranking

    Ranking number is really useful. Forexample, we want to know the rank of each country by year.
    in the example, we first rename the Germany that with a long long description. Then we drop all the ``NaN`. Finally we do rank.

    pivedu = pivedu.rename(index = {'Germany(until 1990 former territory of the FRG)': Germany})
    pivedu = pivedu.dropna()
    pivedu.rank(ascending = False, method = 'first').head()
    

    if we want to see the rank from top to bottom, we can first rank then sort.

    totalSum = pivedu.sum(axis = 1)
    totalSum.rank(ascending = False, method = 'dense')
    .sort_values().head()
    
    GEO -
    Denmark 1
    Cyprus 2
    Finland 3
    Malta 4
    Belgium 5

    result:

    GEO -
    Denmark 1
    Cyprus 2
    Finland 3
    Malta 4
    Belgium 5

    Plotting

    By using plot function, we can plot the graphic.

    totalSum = pivedu.sum(axis = 1)
    .sort_values(ascending = False)
    totalSum.plot(kind = 'bar', style = 'b', alpha = 0.4, title = "Total Values for Country")
    

    kind define which kind of graphic are we using.
    Style refer to the style properties, in bar gragh means color of the bar.
    alpha refer to percentage transparency of the color
    if we want to add a legend, a description of the graph, use legend function.

    myplot.legend(loc = 'center left', # set reletive location of the legend with respect to the plot
    bbox_to_anchor = (1, .5)) # set a absolute position with respect to the plot
    

    Basic of statistics in Python

    First we introduce two concept. Population and sample. Population is a collection ob object, a sample is a part of the population that is observed.
    Before the statistics we should prepare the data that suit our need.

    Data Prepareation

    Data prepareation include following operation:

    1. Obtaining the data: e.g. read a file, scraping from internet
    2. Parsing the data: to unwrap the datapackage
    3. Cleaning the data: deal with incomplete records
    4. Building data structure: e.g. a database
    file = open('files/ch03/adult.data','r')
    data = [] # set a empty list
    for line in file:
    data1 = line.split(', ')
    if len(data1) == 15:
    data.append([.., .., .., .., .., .., .., .., ..., ..., .., .., .., .., ..])
    
    df = pd.DataFrame(data)
    df.columns = ['..', ... ,'..']
    # give them a data lable in column property
    

    after fit it in DataFrame, we can command shape to know the exact number of the data samples, in rows and column.

    df.shape
    # return (32561, 15)
    

    we can also group the dataFrame like we did earlier.

    ml50 = df[(df.sex == 'Male') & (df.income == '>50K\n')]
    

    Summerize some statistic value

    • Mean
      $$\mu=\frac{1}{n}\sum_{i=1}^nx_i$$
    ml['age'].mean()
    
    • Sample Variance
      $$\sigma2=\frac{1}{n}\sum_{i}(x_i-\mu)2$$
      σ is the standard deviation
    ml['age'].var()
    ml['age'].std()
    
    • Sample Median: ml['age'].median()
    • Quantil: a fraction p of the data value is less or equal to $x_p$
    • Data distribution: commenly use histogram to show the distribution
    maAge = ml['age']
    mlAge.hist(normed = 0, histtype = 'stepfilled', bins = 20)
    # bins: width of the bin
    # normed: 1 to show the value of the density function at the bin.(0-1)
    #         0 to show the number of the sample in the bin(normally a real number)
    

    to make two histogram compareable, we can overlapping each other in a same graphic

    import seaborn as sns
    fmAge.hist(normed = 0, histtype = 'stepfilled', alpha = .5, bins = 20)
    mlAge.hist(normed = 0, histtype = 'stepfilled', alpha = 0.5,
    color = sns.desaturate("indianred", .75),
    bin = 10)
    

    if we want to show the cumulative histogram, add a parameter with cumulative = True

    Outlier Treatment

    First define the standard of the outlier. Normally the samples that far from the median and whose value exceed the mean by 2 or 3 standard deviation.

    df2 = df.drop(df.index[
    (df.income == '>50K\n') &
    (df['age'] > df['age'].median() + 35) &
    (df['age'] > df['age'].median() - 15)
    ])
    

    you can check the removal of your data by overlapping the graphic before and after the removing.
    histogram() returns

    • hist: value of the bar(according to normed), Array
    • bin.edge: edge of the bin, array of float

    Measureing Asymmetry: Skewness and Pearson's Median Skewness Coefficient

    Skewness: how asymmetric the distribution is.
    Skewness of a set of n data samples, $x_i$:
    $$ g_1 = \frac{1}{n}\frac{\sum_{i}(x_i - \mu3)}{\sigma3} $$
    nagative skewness means the distribution "skews left". A normal distribution have a 0 skewness.
    we can define a function to calculate the skewness of a data set

    def skewness()
    res = 0
    m = x.mean()
    s = x.std()
    for i in x:
    res += (i-m) * (i-m) * (i-m)
    res /= (len(x) * s * s * s)
    return res
    
    

    Normal Distribution

    Also called the Gaussian distribution,
    $$ P(x) = \frac{1}{\sqrt{2\pi\sigma2}}e{- \frac{({x-\mu})2}{2\sigma2}} $$
    really important in the statistics

    Kernel Density

    We don't really get every distribution's parameter of every data set.
    Instead, we use kernel density estimation. We use Gaussian kernel to generate the density around the data. The sum of these Gaussian distribution can give us a continuous function. So that we can have a continuous representation of the original distribution

    x1 = np.random.normal(-1, 0.5, 15)
    # random.normal have 3 parameter
    # loc: center of the normal distribution
    # scale: standard deviation
    # size: output shape, canbe a tuple of int or int
    x2 = np.random.normal(6, 1, 10)
    y = np.r_[x1, x2]
    # r_[] merging all the input to a new Array.
    x = np.linspace(min(y), max(y), 100)
    # linspace return evenly distribute number, array
    
    s = 0.4 # smoothing parameter
    
    #calculate the kernels
    kernels = np.transpose([norm.pdf(x, yi, s) for yi in y])
    #norm.pdf(x, loc, scale)
    #pdf stands for probability density function
    plt.plot(x, kernels, 'k:')
    plt.plot(x, kernels.sum(1), 'r')
    plt.plot(y, np.zeros(len(y)), 'bo', ms = 10)
    

    we can also call the build in function of SciPy

    from scipy.stats import kde
    density = kde.gaussian_kde(y)
    # kde.gaussian_kde Representation of a kernel-density estimate using Gaussian kernels.
    xgrid = np.linspace(x.min(), x.max(), 200)
    plt.hist(y, bins = 28, normed = True)
    plt.plot(xgrid, density(xgrid), 'r-')
    

    Estimation

    Mean

    we guess the mean $\mu$ of the distribution using the sample mean. This process called estimation and the statistic called estimator.
    The sample mean minimize the following mean squared error
    $$ MSE = \frac{1}{n}\sum{(\bar{x}-\mu)^2} $$

    NTs = 200
    mu = 0.0
    var = 1.0
    err = 0.0
    NPs = 1000
    for i in range(NTs)
    x = np.random.normal(mu, var, NPs)
    err += (x.mean()-mu)**2
    print 'MSE: ', err/NTests
    
    Variance

    If the sample number is large enough, we can use sample variance estimator
    $$ \bar{\sigma}^2 = \frac{1}{n}\sum(x_i-\bar{x})^2 $$
    if we don't have enough sample, this estimator won't works well, it will be biased. So better estimator is
    $$ \bar{\sigma}^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2 $$

    Standard Score

    $$ z_i = \frac{(x_i-\mu)}{\sigma} $$
    It's a normalize number set. After the normalizing, it has a distribution, mean of 0 and variace of 1. It inherits the shape of the orincginal dataset.

    Covariance, Rank Correlation

    Covarience only will be mention when two variable share the same tendency.
    We center the data with respect to their mean, and look for the difference.
    $$ Cov(X, Y) = \frac{1}{n}\sum_{i=1}^ndx_idy_i $$
    Pearson's correlation is after we normalized the data and multiply them together.
    $$ \rho_i = \frac{x_i-\mu_X}{\rho_X}\frac{y_i-\mu_Y}{\rho_Y} $$
    $$ \rho = \frac{1}{n}\sum_{i=1}^n{\rho_i}=\frac{Cov(X,Y)}{\sigma_X\sigma_Y} $$
    Pearson's correlation is always between -1 and +1. The magnitude shows the correlation. The sign shows positive or negative relation
    $\rho = 0$ dose not mean that they have no correlation, it only shows the correlation of first order. And note that it dosen't work well in the presence of outliers.
    Spearmans Rank Correlation is a robust solution especially when the data contain outliers. Use rank to sort sample data.
    $$ r_s = 1-\frac{6\sum_i{d_i2}}{n\cdot(n2-1)} $$
    with $$ d_i = rg(x_i)-rg(y_i)$$
    is the difference of the rank of $x$ and $y$

    Statistical Inference

    Theres two way to address the problem of statistical inference.

    • frequentist approach
    • bayesian approach:

    The frequentist approach

    If we accept this asumption, that means we concern about the population parameters from analysis of sample.
    There's three things we mostly concern about:

    • point estimation: a particular value that approximates some parameter of interest. e.g. mean or variance
    • confidence intervals, or set estimattes: range of the value that best represent the parameter of insterest
    • hypothesis: accept or reject

    Estimation

    Remember estimation is not truth. More data the estimation will be better.

    Point estimation

    Sampling

    data = pd.read_csv("files/aaa.csv")
    data['Date'] = data[u'Dia de mes'].apply(lambda x: str(x)) +
    '-' +
    data[u'Mes de any'].apply(lambda x: str(x))
    data['Date'] = pd.to_datetime(data['Date'])
    accidents = data.groupby(['Date']).size()
    print accidents.mean()
    

    How we estimate the sampling distribution of the sample mean

    1. Draw $s$ (a large number) independent samples $ { x^1,..., x^s} $ from the population where each element $x^j$ is composed of ${x_i^j}_{i=1,...,n}$.
    2. Evaluate the sample mean $\hat{\mu}j=\frac{1}{n}\sum_{i=1}nx_i^j$ of each sample
    3. Estimate the sampling distribution of $\hat{\mu}$ by the empirical distribution of the sample replications.
    # population
    df = accidents.to_frame() # convert accidents to DataFrame
    nTest = 10000
    elements = 200
    
    # mean array of samples
    means = [0] * nTest # build a sero array with nTest 0
    
    #sample generation
    for i in range(nTest):
    rows = np.random.choice(df.index.values, elements)
    sampled_df = df.ix[rows]
    means[i] = sampled_df.mean()
    

    Traditional Approach

    But actually we can't always get access to the population, so we have to solve the problem by using some theoretical results from traditonal statistics.
    Standard error
    $$ SE = \frac{\sigma_x}{\sqrt{n}} $$
    This formula use the standard deviation $\sigma_x$, which is not known. If the population distribution is not skewed and estimation is sufficiently good if $n>30$. THis allow us to use the standard error of the sample value if we can't get access to the population.

    rows = np.random.choice(df.index.values, 200)
    sampledDf = df.ix[rows]
    estSigmaMean = sampledDf.std()/math.sqrt(200)
    

    Computationally intensive approach

    When the full dataset has super large population, a alternative to the traditional.
    In the bootstrap, we draw n observation with replacement from the oringinal data to create a bootstrap sample or resample. Then, we can calculate the mean for this resample. By repeating this process more times, we can get a good approximation of the mean sampling distribution.

    def meanBootStrap (X, numberb):
    x = [0] * numberb
    for i in range (numberb):
    sample = [X[j]
    for j in np.random.randint(len(X), size = len(X)
    ]
    x[i] = np.mean(sample)
    return x
    m = meanBootstrap(accidents, 10000)
    np.mean(m)
    

    Confidence Interval

    Confidence Level 90% 95% 99% 99.9%
    z Value 1.65 1.96 2.58 3.291

    A point estimate $\Theta$, is hardly to be perfect. So we want to provide a plausible range of value of the parameter. This range called confidence interval.
    We care about the interval and the plausibility .
    If the point estimate follow the normal model with standard error SE, then a confidence interval for the population parameter is $$\Theta \pm z \times SE$$

    Confidence Level 90% 95% 99% 99.9%
    z Value 1.65 1.96 2.58 3.291

    Use bootstrapping to calculate the 95% confidence interval

    m= meanBootstrap(accidents, 10000)
    sampleMean = np.mean(m)
    sampleSe = np.std(m)
    
    # Confidence Interval
    ci = [np.percentile(m, 2.5), np.precentile(m, 97.5)]
    

    95% confidence means:

    In 95% of the cases, when I compute the 95% confidence interval from this sample, the true mean of the population will fall within the interval define by the bounds: $\pm1.96\times SE$

    Hypothesis Testing

    Two kinds of hypotheses:

    • $H_0$: Two distribution is the same. Null hypothesis
    • $H_1$: Two distribution is different. Alternative hypothesis
    n = len(counts2013)
    mean = counts2013.mean()
    s = counts2013.std()
    ci = [ mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt()]
    

    Interpretation of CI Test
    The result canbe reject, or refuse to reject.

    if we use a 95% confidence interval to test a problem where the null hypothesis is true, we will make an error whenever the point estimate is at least $1.96\sigma$ away from the population parameter. This hapen about 5% of the time.

    相关文章

      网友评论

        本文标题:Introduction to Data Science (st

        本文链接:https://www.haomeiwen.com/subject/nnifvxtx.html