美文网首页
5.22 ② distributions and samplin

5.22 ② distributions and samplin

作者: 钊钖 | 来源:发表于2018-05-23 11:31 被阅读0次

    odds = p/(1-p)

    Find how many standard deviations away from the mean of large_sample .18 is. Assign the result todeviations_from_mean.

    Find how many probabilities in large sample are greater than or equal to .18. Assign the result to over_18_count.

    import numpy
    large_sample_std = numpy.std(large_sample)
    avg = numpy.mean(large_sample)
    
    deviations_from_mean = (.18 - avg)/ large_sample_std
    
    over_18_count = len([p for p in large_sample if p >= .18])
    
    

    sample counties

    Use the select_random_sample function to pick 1000 random samples of 100 counties each from the income data. Find the mean of the median_income column for each sample.

    Plot a histogram with 20 bins of all the mean median incomes.

    import pandas as pd
    import matplotlib.pyplot as plt
    import random
    income = pd.read_csv('us_income.csv')
    # print(income.head())
    
    # this is the mean median income in anh US county
    mean_median_income = income['median_income'].mean()
    
    # one section
    def get_sample_mean(start,end):
        return income['median_income'][start:end].mean()
    
    # sample by some step every time,iterate
    # starting at 0 ,and counting in blocks of row_step
    # (0,row_step,row_step*2,etc.)
    def find_mean_incomes(row_step):
        mean_median_sample_incomes=[]
        for i in range(0,income.shape[0],row_step):
            mean_median_sample_incomes.append(
                get_sample_mean(i, i+ row_step))
        return mean_median_sample_incomes
    
    
    non_random_sample = find_mean_incomes(100)
    plt.hist(non_random_sample,20)
    plt.show()
    
    # What you're seeing above is the result of biased sampling.
    # Instead of selecting randomly, we selected counties that were 
    # next to each other in the data.
    
    # This picked counties in the same state more often that not, 
    # and created means that didn't represent the whole country.
    
    # This is the danger of not using random sampling -- 
    # you end up with samples that don't reflect 
    # the entire population.
    
    # This gives you a distribution that isn't normal.
    
    
    
    # random sample at one time.  make a series contain 100 values.
    # one random sample.
    def select_random_sample(count):
        # make 100 indexes once once once .
        random_indices = random.sample(range(0, income.shape[0]),count)
        # make 100 values once  once .
        return income.iloc[random_indices]
    
    
    # Use the select_random_sample function to pick 1000 random samples 
    # of 100 counties each from the income data. Find the mean of the
    #median_income column for each sample.
    random.seed(1)
    # make 1000 rancom samples 
    random_sample = [select_random_sample(100)['median_income'].mean() for _ in range (1000)]
    plt.hist(random_sample,20)
    plt.show()
        
    

    An experiment

    def select_random_sample(count):
        random_indices = random.sample(range(0,income.shape[0]),count)
        return income.iloc[random_indices]
    
    
    # Select 1000 random samples of 100 counties each 
    # from the income data using the
    #  select_random_sample method.
    random.seed(1)
    mean_ratios=[]
    
    # For each sample:
    Divide the median_income_hs column by 
    # median_income_college to get ratios.
    # Then, find the mean of all the ratios in the sample.
    # Add it to the list, mean_ratio
    for i in range(1000):
        sample = select_random_sample(100)
        ratios= sample['median_income_hs'
                      ] / sample['median_income_college']
        mean_ratios.append(ratios.mean())
    
    plt.hist(mean_ratios,20)
    plt.show()
    

    Statistical significance

    After 5 years, we determine that the mean ratio in our random sample of 100 counties is .675 -- that is, high school graduates on average earn 67.5% of what college graduates do.

    Now that we have our result, how do we know if our hypothesis is correct? Remember, our hypothesis was about the whole population, not about the sample.

    Statistical significance is used to determine if a result is valid for a population or not. You usually set a significance level beforehand that will determine if your hypothesis is true or not. After conducting the experiment, you check against the significance level to determine.

    A common significance level is .05. This means: "only 5% or less of the time will the result have been due to chance".

    In our case, chance could be that the high school graduates in the county changed income some way other than through our program -- maybe some higher paying factory jobs came to town, or there were some other educational initiatives around.

    In order to test for significance, we compare our result ratio with the mean ratios we found in the last section.

    Determine how many values in mean_ratios are greater than or equal to .675.
    Divide by the total number of items in mean_ratios to get the significance level.
    Assign the result to significance_value.

    significance_value = None
    
    mean_higher = len([m for m in mean_ratios if m >= .675])
    significance_value = mean_higher / len(mean_ratios)
    
    

    Final result

    Our significance value was .014. Based on the entire population, only 1.4% of the time will the wage results we saw have occurred on their own. So our experiment exceeded our significance level (lower means more significant). Thus, our experiment showed that the program did improve the wages of high school graduates relative to college graduates.

    You may have noticed earlier that the more samples in our trials, the "steeper" the histograms of outcomes get (look back on the probability of rolling one with the die if you need a refresher). This "steepness" arose because the more trials we have, the less likely the value is to vary from the "true" value.

    This same principle applies to significance testing. You need a larger deviation from the mean to have something be "significant" if your sample size is smaller. The larger the trial, the smaller the deviation needs to be to get a significant result.

    You may be asking at this point how we can determine statistical significance without knowing the population values upfront. In a lot of cases, like drug trials, you don't have the capability to measure everyone in the world to compare against your sample.

    Statistics gives us tools to deal with this, and we'll learn about them in the next missions.

    # This is "steeper" than the graph from before, because it has 500 items in each sample.
    random.seed(1)
    mean_ratios = []
    for i in range(1000):
        sample = select_random_sample(500)
        ratios = sample["median_income_hs"] / sample["median_income_college"]
        mean_ratios.append(ratios.mean())
        
    plt.hist(mean_ratios, 20)
    plt.show()
    

    相关文章

      网友评论

          本文标题:5.22 ② distributions and samplin

          本文链接:https://www.haomeiwen.com/subject/eafljftx.html