5.22 ② distributions and samplin

作者: 钊钖 | 来源:发表于2018-05-23 11:31 被阅读0次

5.22 ② distributions and samplin
5.21 ① distributions and samplin
Probability Distributions
UC Berkeley Probability CS70 --
Space Manipulation of Linux
gradle下载地址
gradle 各个版本下载链接
gradle各版本下载地址
下载gradle
5.31 Probability distributions

odds = p/(1-p)

Find how many standard deviations away from the mean of large_sample .18 is. Assign the result todeviations_from_mean.

Find how many probabilities in large sample are greater than or equal to .18. Assign the result to over_18_count.

import numpy
large_sample_std = numpy.std(large_sample)
avg = numpy.mean(large_sample)

deviations_from_mean = (.18 - avg)/ large_sample_std

over_18_count = len([p for p in large_sample if p >= .18])

sample counties

Use the select_random_sample function to pick 1000 random samples of 100 counties each from the income data. Find the mean of the median_income column for each sample.

Plot a histogram with 20 bins of all the mean median incomes.

import pandas as pd
import matplotlib.pyplot as plt
import random
income = pd.read_csv('us_income.csv')
# print(income.head())

# this is the mean median income in anh US county
mean_median_income = income['median_income'].mean()

# one section
def get_sample_mean(start,end):
    return income['median_income'][start:end].mean()

# sample by some step every time,iterate
# starting at 0 ,and counting in blocks of row_step
# (0,row_step,row_step*2,etc.)
def find_mean_incomes(row_step):
    mean_median_sample_incomes=[]
    for i in range(0,income.shape[0],row_step):
        mean_median_sample_incomes.append(
            get_sample_mean(i, i+ row_step))
    return mean_median_sample_incomes


non_random_sample = find_mean_incomes(100)
plt.hist(non_random_sample,20)
plt.show()

# What you're seeing above is the result of biased sampling.
# Instead of selecting randomly, we selected counties that were 
# next to each other in the data.

# This picked counties in the same state more often that not, 
# and created means that didn't represent the whole country.

# This is the danger of not using random sampling -- 
# you end up with samples that don't reflect 
# the entire population.

# This gives you a distribution that isn't normal.



# random sample at one time.  make a series contain 100 values.
# one random sample.
def select_random_sample(count):
    # make 100 indexes once once once .
    random_indices = random.sample(range(0, income.shape[0]),count)
    # make 100 values once  once .
    return income.iloc[random_indices]


# Use the select_random_sample function to pick 1000 random samples 
# of 100 counties each from the income data. Find the mean of the
#median_income column for each sample.
random.seed(1)
# make 1000 rancom samples 
random_sample = [select_random_sample(100)['median_income'].mean() for _ in range (1000)]
plt.hist(random_sample,20)
plt.show()

An experiment

def select_random_sample(count):
    random_indices = random.sample(range(0,income.shape[0]),count)
    return income.iloc[random_indices]


# Select 1000 random samples of 100 counties each 
# from the income data using the
#  select_random_sample method.
random.seed(1)
mean_ratios=[]

# For each sample:
Divide the median_income_hs column by 
# median_income_college to get ratios.
# Then, find the mean of all the ratios in the sample.
# Add it to the list, mean_ratio
for i in range(1000):
    sample = select_random_sample(100)
    ratios= sample['median_income_hs'
                  ] / sample['median_income_college']
    mean_ratios.append(ratios.mean())

plt.hist(mean_ratios,20)
plt.show()

Statistical significance

After 5 years, we determine that the mean ratio in our random sample of 100 counties is .675 -- that is, high school graduates on average earn 67.5% of what college graduates do.

Now that we have our result, how do we know if our hypothesis is correct? Remember, our hypothesis was about the whole population, not about the sample.

Statistical significance is used to determine if a result is valid for a population or not. You usually set a significance level beforehand that will determine if your hypothesis is true or not. After conducting the experiment, you check against the significance level to determine.

A common significance level is .05. This means: "only 5% or less of the time will the result have been due to chance".

In our case, chance could be that the high school graduates in the county changed income some way other than through our program -- maybe some higher paying factory jobs came to town, or there were some other educational initiatives around.

In order to test for significance, we compare our result ratio with the mean ratios we found in the last section.

Determine how many values in mean_ratios are greater than or equal to .675.
Divide by the total number of items in mean_ratios to get the significance level.
Assign the result to significance_value.

significance_value = None

mean_higher = len([m for m in mean_ratios if m >= .675])
significance_value = mean_higher / len(mean_ratios)

Final result

Our significance value was .014. Based on the entire population, only 1.4% of the time will the wage results we saw have occurred on their own. So our experiment exceeded our significance level (lower means more significant). Thus, our experiment showed that the program did improve the wages of high school graduates relative to college graduates.

You may have noticed earlier that the more samples in our trials, the "steeper" the histograms of outcomes get (look back on the probability of rolling one with the die if you need a refresher). This "steepness" arose because the more trials we have, the less likely the value is to vary from the "true" value.

This same principle applies to significance testing. You need a larger deviation from the mean to have something be "significant" if your sample size is smaller. The larger the trial, the smaller the deviation needs to be to get a significant result.

You may be asking at this point how we can determine statistical significance without knowing the population values upfront. In a lot of cases, like drug trials, you don't have the capability to measure everyone in the world to compare against your sample.

Statistics gives us tools to deal with this, and we'll learn about them in the next missions.

# This is "steeper" than the graph from before, because it has 500 items in each sample.
random.seed(1)
mean_ratios = []
for i in range(1000):
    sample = select_random_sample(500)
    ratios = sample["median_income_hs"] / sample["median_income_college"]
    mean_ratios.append(ratios.mean())
    
plt.hist(mean_ratios, 20)
plt.show()

网友评论

本文标题：5.22 ② distributions and samplin

本文链接：https://www.haomeiwen.com/subject/eafljftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

5.22 ② distributions and samplin

sample counties

Statistical significance

Final result

相关文章