模拟零假设

作者: 兀o | 来源:发表于2018-12-30 17:19 被阅读4次
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    
    %matplotlib inline
    np.random.seed(42)
    
    full_data = pd.read_csv('coffee_dataset.csv')
    sample_data = full_data.sample(200)
    
    1. If you were interested in if the average height for coffee drinkers is the same as for non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the first quiz question below.

      Since there is no directional component associated with this statement, a not equal to seems most reasonable.

      𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜=0

      𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜≠0

      𝜇𝑐𝑜𝑓𝑓 and 𝜇𝑛𝑜 are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.

    2. If you were interested in if the average height for coffee drinkers is less than non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the second quiz question below.

      In this case, there is a question associated with a direction - that is the average height for coffee drinkers is less than non-coffee drinkers. Below is one of the ways you could write the null and alternative. Since the mean for coffee drinkers is listed first here, the alternative would suggest that this is negative.

      𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜≥0

      𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜<0

      𝜇𝑐𝑜𝑓𝑓 and 𝜇𝑛𝑜 are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.

    3. For 10,000 iterations: bootstrap the sample data, calculate the mean height for coffee drinkers and non-coffee drinkers, and calculate the difference in means for each sample. You will want to have three arrays at the end of the iterations - one for each mean and one for the difference in means. Use the results of your sampling distribution, to answer the third quiz question below.

    nocoff_means, coff_means, diffs = [], [], []
    
    for _ in range(10000):
        bootsamp = sample_data.sample(200, replace = True)
        coff_mean = bootsamp[bootsamp['drinks_coffee'] == True]['height'].mean()
        nocoff_mean = bootsamp[bootsamp['drinks_coffee'] == False]['height'].mean()
        # append the info 
        coff_means.append(coff_mean)
        nocoff_means.append(nocoff_mean)
        diffs.append(coff_mean - nocoff_mean)   
    
    np.std(nocoff_means) # the standard deviation of the sampling distribution for nocoff
    
    np.std(coff_means) # the standard deviation of the sampling distribution for coff
    
    np.std(diffs) # the standard deviation for the sampling distribution for difference in means
    
    plt.hist(nocoff_means, alpha = 0.5);
    plt.hist(coff_means, alpha = 0.5); # They look pretty normal to me!
    
    plt.hist(diffs, alpha = 0.5); # again normal - this is by the central limit theorem
    
    null_vals = np.random.normal(0, np.std(diffs), 10000) # Here are 10000 draws from the sampling distribution under the null
    
    plt.hist(null_vals); #Here is the sampling distribution of the difference under the null
    

    相关文章

      网友评论

        本文标题:模拟零假设

        本文链接:https://www.haomeiwen.com/subject/cpwvlqtx.html