总体平均数的置信区间
问题:
- 样本中喝咖啡的人的比例是多少?不喝咖啡的人的比例是多少?
- 在喝咖啡的人中,他们的平均身高是多少?在不喝咖啡的人中,他们的平均身高是多少?
- 模拟来自200个原始样本的200个“新”个体。在该有放回抽样样本(bootstrap sample)中,喝咖啡的人的比例是多少?不喝咖啡的人的比例是多少?
- 现在模拟10,000次有放回抽样,并取每个样本中不喝咖啡的人的平均身高。每个有放回抽样样本应该是从200个数据点中取出的第一个样本。绘制分布图,并拉出95%置信区间所需的值。在这个例子中,关于平均数的抽样分布,你发现了什么?
- 你的区间是否记录了人群中不喝咖啡的人的实际平均身高?看一看人口中的平均数和95%置信区间提供的两个界限,然后回答下面的最后一个测试题目。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(42)
coffee_full = pd.read_csv('coffee_dataset.csv')
coffee_red = coffee_full.sample(200) #this is the only data you might actually get in the real world.
coffee_red.head()
- What is the proportion of coffee drinkers in the sample? What is the proportion of individuals that don't drink coffee?
coffee_red['drinks_coffee'].mean() # Drink Coffee
1 - coffee_red['drinks_coffee'].mean() # Don't Drink Coffee
- Of the individuals who do not drink coffee, what is the average height?
coffee_red[coffee_red['drinks_coffee'] == False]['height'].mean()
- Simulate 200 "new" individuals from your original sample of 200. What are the proportion of coffee drinkers in your bootstrap sample? How about individuals that don't drink coffee?
bootsamp = coffee_red.sample(200, replace = True)
bootsamp['drinks_coffee'].mean() # Drink Coffee and 1 minus gives the don't drink
- Now simulate your bootstrap sample 10,000 times and take the mean height of the non-coffee drinkers in each sample. Plot the distribution, and pull the values necessary for a 95% confidence interval. What do you notice about the sampling distribution of the mean in this example?
boot_means = []
for _ in range(10000):
bootsamp = coffee_red.sample(200, replace = True)
boot_mean = bootsamp[bootsamp['drinks_coffee'] == False]['height'].mean()
boot_means.append(boot_mean)
plt.hist(boot_means); # Looks pretty normal
np.percentile(boot_means, 2.5), np.percentile(boot_means, 97.5)
- Did your interval capture the actual average height of coffee drinkers in the population? Look at the average in the population and the two bounds provided by your 95% confidence interval, and then answer the final quiz question below.
coffee_full[coffee_full['drinks_coffee'] == False]['height'].mean()
网友评论