Chi-Square & Cramér’s V

作者: 7f0a92cda77c | 来源:发表于2021-07-05 15:00 被阅读0次

卡方独立检验-判断两类因子彼此相关或相互独立的假设检验

Step1: Chi-Square Independence Test - What Is It?

The chi-square independence test is a procedure for testing if two categorical variables are related in some population.

Example: a scientist wants to know if education level and marital status are related for all people in some country

Name	Marit	Edu
Cameron	Never Married	PhD or higher
Benjamin	Married	Middle school or lower
Camden	Divorced	Bachelors
Brody	Widowed	Masters
Connor	Married	PhD or higher

Step2: Chi-Square Test - Observed Frequencies

A good first step for these data is inspecting the contingency table of marital status by education. Such a table -shown below- displays the frequency distribution of marital status for each education category separately. So let's take a look at it.

4种婚姻状态，5种教育水平-分类方面

Marial Status	Middle School or Lower	High School	Bachelor's	Masters	PhD or Higher	Total
Never Married	18	36	21	9	6	90
Married	12	36	45	36	21	150
Divorced	6	9	9	3	3	30
Widowed	3	9	9	6	3	30
Total	39	90	84	54	33	300

Marial Status	Middle School or Lower	High School	Bachelor's	Masters	PhD or Higher	Total
Never Married	46%	40%	25%	17%	18%	30%
Married	31%	40%	54%	67%	64%	50%
Divorced	15%	10%	11%	6%	9%	10%
Widowed	8%	10%	11%	11%	9%	10%
Total	39	90	84	54	33	300

more highly educated respondents marry more often than less educated respondents Chi-Square Test - Stacked Bar Chart

Step3: Chi-Square Test - Null Hypothesis

The null hypothesis for a chi-square independence test is that

two categorical variables are independent in some population.

Independence means that the relative frequencies of one variable are identical over all levels of some other variable.

Step4: Expected Frequencies

Expected frequencies are the frequencies we expect in our sample if the null hypothesis holds.

These expected frequencies are calculated as
$eij = \frac{oi*oj}{N}$

eij-is an expected frequency

oi-is a marginal column frequency;

oj-is a marginal row frequency;

N-is the total sample size.

a contingency table with observed frequencies we found in our sample;
a contingency table with expected frequencies we should have found in our sample if the variables are really independent.

我们在样本中发现的具有观察频率的列联表；
如果变量真的独立，我们应该在样本中找到一个带有预期频率的列联表。

计算的期望值：

Marial Status	Middle School or Lower	High School	Bachelor's	Masters	PhD or Higher	Total
Never Married	11.7	27	25.2	16.2	9.9	90
Married	19.5	45	42	27	16.5	150
Divorced	3.9	9	8.4	5.4	3.3	30
Widowed	3.9	9	8.4	5.4	3.3	30
Total	39	90	84	54	33	300

Step5: Residuals

$rij = oij - eij$

For our example, this results in (5 * 4 =) 20 residuals. Larger (absolute) residuals indicate a larger difference between our data and the null hypothesis. We basically add up all residuals, resulting in a single number: the χ2 (pronounce “chi-square”) test statistic.

Step6: Test Statistic

The chi-square test statistic is calculated as:
$\chi^2=\sum \frac{(oij-eij)^2}{eij}$

so for our data :
$\chi^2=\frac {(18-11.7)^2}{11.7} +\frac{(36-27)^2}{27}+...+\frac{(6-5.4)^2}{5.4}=23.57$

Marial Status	Middle School or Lower	High School	Bachelor's	Masters	PhD or Higher	Total
Never Married	3.392307692	3	0.7	3.2	1.53636364
Married	2.884615385	1.8	0.21428571	3	1.22727273
Divorced	1.130769231	0	0.04285714	1.06666667	0.02727273
Widowed	0.207692308	0	0.04285714	0.06666667	0.02727273
Total						23.56689977

Step7: Chi-Square Test Assumptions

The assumptions for a chi-square independence test are:

1.independent observations.

2.For a 2 by 2 table, all expected frequencies > 5.*

For a larger table, all expected frequencies > 1 and no more than 20% of all cells may have expected frequencies < 5.

If these assumptions hold, our χ2 test statistic follows a χ2 distribution. It's this distribution that tells us the probability of finding χ2 = 23.57.

Step8: Chi-Square Test - Degrees of Freedom

We'll get the p-value we're after from the chi-square distribution if we give it 2 numbers:

1. the $\chi^2$ value(23.57)

2. the degrees of freedom(df)

$df=(i−1)⋅(j−1)=(4-1)*(5-1)=12$

i is the number of rows in our contingency table
j is the number of columns

The degrees of freedom is basically a number that determines the exact shape of our distribution. The figure below illustrates this point.

Chi-Square Distributions with Different DF

And with df = 12, the probability of finding χ2 ≥ 23.57 ≈ 0.023.* This is our 1-tailed significance. It basically means, there's a 0.023 (or 2.3%) chance of finding this association in our sample if it is zero in our population.

Chi-Square Distribution with 1-Tailed P-Value

**Since this is a small chance, we no longer believe our null hypothesis of our variables being independent in our population. **

Now, keep in mind that our p-value of 0.023 only tells us that the association between our variables is probably not zero. It doesn't say anything about the strength of this association: the effect size.

Cramér’s V - Formula

克莱姆公式：

Cramér’s V 是一个介于 0 和 1 之间的数字，表示两个分类变量的关联程度。

如果我们想知道 2 个分类变量是否相关，我们的第一个选项是卡方独立性检验。接近于零的 p 值意味着我们的变量在某些人群中不太可能完全不相关。然而，这并不意味着这些变量是强相关的。大样本量中的弱关联也可能导致 p = 0.000。

A measure that does indicate the strength of the association is Cramér’s V, defined as:

$\sqrt\frac{\chi^2}{N(\kappa-1)}$
$\phi~c$ denotes Cramer's V, refers to the "phi coeffeicient", a special case of Cramer's V

( $\chi^2$ ) is the Pearson chi-square statistic from the test

N is the sample size involved in the test

$\kappa$ is the less number of categories of either variable

上述例子中：

$\phi~c= \sqrt\frac{\chi^2}{N(\kappa-1)}=\sqrt\frac{23.57}{300*(4-1)}=0.162=Cramér’s V$

综上：有关联，但是关联比较弱

https://zhuanlan.zhihu.com/p/158156773

插入公式

https://www.spss-tutorials.com/chi-square-independence-test/

网友评论

本文标题：Chi-Square & Cramér’s V

本文链接：https://www.haomeiwen.com/subject/sgfiultx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Chi-Square & Cramér’s V

Step1: Chi-Square Independence Test - What Is It?