Chi-Square Test of Independence: AI & ML Applications

Understand the Chi-Square test of independence for analyzing relationships between categorical variables in AI & Machine Learning. Learn when to apply this non-parametric test.

22.4.3 Chi-Square Test of Independence

The Chi-Square Test of Independence is a non-parametric statistical test used to determine if there is a statistically significant association between two categorical variables. It assesses whether the observed distribution of data in a contingency table differs from the distribution that would be expected if the variables were independent.

When to Use the Chi-Square Test of Independence

This test is appropriate when:

  • You have two categorical variables. For example, Gender (Male, Female) and Preference (Option A, Option B).
  • You want to test the hypothesis that these two variables are associated (i.e., not independent) or independent of each other.
  • The data is typically presented in a contingency table (also known as a cross-tabulation or two-way table).

How the Chi-Square Test of Independence Works

The core idea of the test is to compare the observed frequencies in each cell of the contingency table with the expected frequencies that would occur if the null hypothesis of independence were true.

  1. State the Hypotheses:

    • Null Hypothesis ($H_0$): The two categorical variables are independent.
    • Alternative Hypothesis ($H_1$): The two categorical variables are dependent (associated).
  2. Calculate Expected Frequencies: For each cell in the contingency table, the expected frequency ($E$) is calculated assuming independence:

    $E = \frac{(\text{Row Total}) \times (\text{Column Total})}{\text{Grand Total}}$

  3. Calculate the Chi-Square Statistic ($\chi^2$): The test statistic is calculated by summing the squared differences between observed ($O$) and expected ($E$) frequencies, divided by the expected frequencies, across all cells:

    $\chi^2 = \sum \frac{(O - E)^2}{E}$

  4. Determine Degrees of Freedom (df): The degrees of freedom are calculated based on the dimensions of the contingency table:

    $df = (r - 1) \times (c - 1)$

    Where:

    • $r$ = number of rows in the contingency table
    • $c$ = number of columns in the contingency table
  5. Determine the P-value: Using the calculated $\chi^2$ statistic and the degrees of freedom, a p-value is obtained from the Chi-square distribution. This p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true.

Interpretation of Results

The p-value is compared to a pre-determined significance level (alpha, commonly denoted as $\alpha$, usually set at 0.05).

  • If p-value < $\alpha$: Reject the null hypothesis ($H_0$). There is statistically significant evidence to conclude that the two categorical variables are associated (dependent).
  • If p-value $\ge \alpha$: Fail to reject the null hypothesis ($H_0$). There is not enough statistically significant evidence to conclude that the two categorical variables are associated; they appear to be independent.

Example: Smoking Status and Lung Disease

Scenario: A researcher wants to investigate if there is an association between smoking status and the presence of lung disease.

  • Variable 1: Smoking Status (Categorical: Smoker, Non-smoker)
  • Variable 2: Lung Disease (Categorical: Yes, No)

A contingency table is constructed with the observed counts:

Lung Disease: YesLung Disease: NoRow Total
Smoker602080
Non-smoker154560
Column Total7565140

Steps:

  1. Hypotheses:

    • $H_0$: Smoking status and lung disease are independent.
    • $H_1$: Smoking status and lung disease are dependent.
  2. Calculate Expected Frequencies:

    • Expected (Smoker, Yes) = (80 * 75) / 140 = 42.86
    • Expected (Smoker, No) = (80 * 65) / 140 = 37.14
    • Expected (Non-smoker, Yes) = (60 * 75) / 140 = 32.14
    • Expected (Non-smoker, No) = (60 * 65) / 140 = 27.86
  3. Calculate $\chi^2$ Statistic: $\chi^2 = \frac{(60-42.86)^2}{42.86} + \frac{(20-37.14)^2}{37.14} + \frac{(15-32.14)^2}{32.14} + \frac{(45-27.86)^2}{27.86}$ $\chi^2 \approx 7.00 + 8.04 + 9.59 + 11.01 \approx 35.64$

  4. Degrees of Freedom: $df = (2 - 1) \times (2 - 1) = 1 \times 1 = 1$

  5. P-value: Using a Chi-square distribution table or statistical software with $\chi^2 = 35.64$ and $df = 1$, the p-value is extremely small (much less than 0.05).

Interpretation: Since the p-value is less than 0.05, we reject the null hypothesis. This suggests that there is a statistically significant association between smoking status and the presence of lung disease.

Key Considerations and Assumptions

  • Independence of Observations: Each observation should be independent of all other observations.
  • Categorical Data: Both variables must be categorical (nominal or ordinal).
  • Expected Cell Counts: Most statisticians recommend that all expected cell counts should be 5 or greater. If more than 20% of the cells have an expected count less than 5, the Chi-square approximation may not be accurate, and alternative tests (like Fisher's Exact Test) might be considered, especially for 2x2 tables.
  • Sample Size: A sufficiently large sample size is generally needed for the Chi-square distribution to be a good approximation.