
作者: 兀o | 来源:发表于2018-12-22 12:57 被阅读6次



  1. 样本中喝咖啡的人的比例是多少?不喝咖啡的人的比例是多少?
  2. 在喝咖啡的人中,他们的平均身高是多少?在不喝咖啡的人中,他们的平均身高是多少?
  3. 模拟来自200个原始样本的200个“新”个体。在该有放回抽样样本(bootstrap sample)中,喝咖啡的人的比例是多少?不喝咖啡的人的比例是多少?
  4. 现在模拟10,000次有放回抽样,并取每个样本中不喝咖啡的人的平均身高。每个有放回抽样样本应该是从200个数据点中取出的第一个样本。绘制分布图,并拉出95%置信区间所需的值。在这个例子中,关于平均数的抽样分布,你发现了什么?
  5. 你的区间是否记录了人群中不喝咖啡的人的实际平均身高?看一看人口中的平均数和95%置信区间提供的两个界限,然后回答下面的最后一个测试题目。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


coffee_full = pd.read_csv('coffee_dataset.csv')
coffee_red = coffee_full.sample(200) #this is the only data you might actually get in the real world.
  1. What is the proportion of coffee drinkers in the sample? What is the proportion of individuals that don't drink coffee?
coffee_red['drinks_coffee'].mean() # Drink Coffee
1 - coffee_red['drinks_coffee'].mean() # Don't Drink Coffee
  1. Of the individuals who do not drink coffee, what is the average height?
coffee_red[coffee_red['drinks_coffee'] == False]['height'].mean()
  1. Simulate 200 "new" individuals from your original sample of 200. What are the proportion of coffee drinkers in your bootstrap sample? How about individuals that don't drink coffee?
bootsamp = coffee_red.sample(200, replace = True)
bootsamp['drinks_coffee'].mean() # Drink Coffee and 1 minus gives the don't drink
  1. Now simulate your bootstrap sample 10,000 times and take the mean height of the non-coffee drinkers in each sample. Plot the distribution, and pull the values necessary for a 95% confidence interval. What do you notice about the sampling distribution of the mean in this example?
boot_means = []
for _ in range(10000):
    bootsamp = coffee_red.sample(200, replace = True)
    boot_mean = bootsamp[bootsamp['drinks_coffee'] == False]['height'].mean()
plt.hist(boot_means); # Looks pretty normal
np.percentile(boot_means, 2.5), np.percentile(boot_means, 97.5)
  1. Did your interval capture the actual average height of coffee drinkers in the population? Look at the average in the population and the two bounds provided by your 95% confidence interval, and then answer the final quiz question below.
coffee_full[coffee_full['drinks_coffee'] == False]['height'].mean() 



