卡方独立检验-判断两类因子彼此相关或相互独立的假设检验
Step1: Chi-Square Independence Test - What Is It?
The chi-square independence test is a procedure for testing if two categorical variables are related in some population.
Example: a scientist wants to know if education level and marital status are related for all people in some country
Name | Marit | Edu |
---|---|---|
Cameron | Never Married | PhD or higher |
Benjamin | Married | Middle school or lower |
Camden | Divorced | Bachelors |
Brody | Widowed | Masters |
Connor | Married | PhD or higher |
Step2: Chi-Square Test - Observed Frequencies
A good first step for these data is inspecting the contingency table of marital status by education. Such a table -shown below- displays the frequency distribution of marital status for each education category separately. So let's take a look at it.
4种婚姻状态,5种教育水平-分类方面
Marial Status | Middle School or Lower | High School | Bachelor's | Masters | PhD or Higher | Total |
---|---|---|---|---|---|---|
Never Married | 18 | 36 | 21 | 9 | 6 | 90 |
Married | 12 | 36 | 45 | 36 | 21 | 150 |
Divorced | 6 | 9 | 9 | 3 | 3 | 30 |
Widowed | 3 | 9 | 9 | 6 | 3 | 30 |
Total | 39 | 90 | 84 | 54 | 33 | 300 |
Marial Status | Middle School or Lower | High School | Bachelor's | Masters | PhD or Higher | Total |
---|---|---|---|---|---|---|
Never Married | 46% | 40% | 25% | 17% | 18% | 30% |
Married | 31% | 40% | 54% | 67% | 64% | 50% |
Divorced | 15% | 10% | 11% | 6% | 9% | 10% |
Widowed | 8% | 10% | 11% | 11% | 9% | 10% |
Total | 39 | 90 | 84 | 54 | 33 | 300 |
more highly educated respondents marry more often than less educated respondents Chi-Square Test - Stacked Bar Chart
Step3: Chi-Square Test - Null Hypothesis
The null hypothesis for a chi-square independence test is that
two categorical variables are independent in some population.
Independence means that the relative frequencies of one variable are identical over all levels of some other variable.
Step4: Expected Frequencies
Expected frequencies are the frequencies we expect in our sample if the null hypothesis holds.
These expected frequencies are calculated as
eij-is an expected frequency
oi-is a marginal column frequency;
oj-is a marginal row frequency;
N-is the total sample size.
a contingency table with observed frequencies we found in our sample;
a contingency table with expected frequencies we should have found in our sample if the variables are really independent.
我们在样本中发现的具有观察频率的列联表;
如果变量真的独立,我们应该在样本中找到一个带有预期频率的列联表。
计算的期望值:
Marial Status | Middle School or Lower | High School | Bachelor's | Masters | PhD or Higher | Total |
---|---|---|---|---|---|---|
Never Married | 11.7 | 27 | 25.2 | 16.2 | 9.9 | 90 |
Married | 19.5 | 45 | 42 | 27 | 16.5 | 150 |
Divorced | 3.9 | 9 | 8.4 | 5.4 | 3.3 | 30 |
Widowed | 3.9 | 9 | 8.4 | 5.4 | 3.3 | 30 |
Total | 39 | 90 | 84 | 54 | 33 | 300 |
Step5: Residuals
For our example, this results in (5 * 4 =) 20 residuals. Larger (absolute) residuals indicate a larger difference between our data and the null hypothesis. We basically add up all residuals, resulting in a single number: the χ2 (pronounce “chi-square”) test statistic.
Step6: Test Statistic
The chi-square test statistic is calculated as:
so for our data :
Marial Status | Middle School or Lower | High School | Bachelor's | Masters | PhD or Higher | Total |
---|---|---|---|---|---|---|
Never Married | 3.392307692 | 3 | 0.7 | 3.2 | 1.53636364 | |
Married | 2.884615385 | 1.8 | 0.21428571 | 3 | 1.22727273 | |
Divorced | 1.130769231 | 0 | 0.04285714 | 1.06666667 | 0.02727273 | |
Widowed | 0.207692308 | 0 | 0.04285714 | 0.06666667 | 0.02727273 | |
Total | 23.56689977 |
Step7: Chi-Square Test Assumptions
The assumptions for a chi-square independence test are:
1.independent observations.
2.For a 2 by 2 table, all expected frequencies > 5.*
For a larger table, all expected frequencies > 1 and no more than 20% of all cells may have expected frequencies < 5.
If these assumptions hold, our χ2 test statistic follows a χ2 distribution. It's this distribution that tells us the probability of finding χ2 = 23.57.
Step8: Chi-Square Test - Degrees of Freedom
We'll get the p-value we're after from the chi-square distribution if we give it 2 numbers:
1. the value(23.57)
2. the degrees of freedom(df)
i is the number of rows in our contingency table
j is the number of columns
The degrees of freedom is basically a number that determines the exact shape of our distribution. The figure below illustrates this point.
Chi-Square Distributions with Different DFAnd with df = 12, the probability of finding χ2 ≥ 23.57 ≈ 0.023.* This is our 1-tailed significance. It basically means, there's a 0.023 (or 2.3%) chance of finding this association in our sample if it is zero in our population.
Chi-Square Distribution with 1-Tailed P-Value**Since this is a small chance, we no longer believe our null hypothesis of our variables being independent in our population. **
Now, keep in mind that our p-value of 0.023 only tells us that the association between our variables is probably not zero. It doesn't say anything about the strength of this association: the effect size.
Cramér’s V - Formula
克莱姆公式:
Cramér’s V 是一个介于 0 和 1 之间的数字,表示两个分类变量的关联程度。
如果我们想知道 2 个分类变量是否相关,我们的第一个选项是卡方独立性检验。接近于零的 p 值意味着我们的变量在某些人群中不太可能完全不相关。然而,这并不意味着这些变量是强相关的。大样本量中的弱关联也可能导致 p = 0.000。
A measure that does indicate the strength of the association is Cramér’s V, defined as:
denotes Cramer's V, refers to the "phi coeffeicient", a special case of Cramer's V
() is the Pearson chi-square statistic from the test
N is the sample size involved in the test
is the less number of categories of either variable
上述例子中:
综上:有关联,但是关联比较弱
https://zhuanlan.zhihu.com/p/158156773
插入公式
https://www.spss-tutorials.com/chi-square-independence-test/
网友评论