Chi-Square Distribution in Statistics & Machine Learning

Explore the Chi-Square distribution, a key concept in statistical hypothesis testing. Learn its applications in ML, variance testing, and categorical data analysis.

Chi-Square Distribution

The Chi-Square distribution is a continuous probability distribution that is fundamental to statistical hypothesis testing. It is particularly important for testing population variance and assessing the independence of categorical variables.

What is a Chi-Square Distribution?

A Chi-Square distribution arises from the sum of the squares of independent standard normal variables. Mathematically, if $Z_1, Z_2, \dots, Z_k$ are independent standard normal variables (each with a mean of 0 and a variance of 1), then the sum of their squares follows a Chi-Square distribution:

$$X = Z_1^2 + Z_2^2 + \dots + Z_k^2$$

The number of independent standard normal variables being squared and summed is known as the degrees of freedom ($k$).

$$X \sim \chi^2(k)$$

Where:

  • $Z_i$ are independent standard normal variables.
  • $k$ is the degrees of freedom.

Key Characteristics

  • Degrees of Freedom (df): This is the single parameter that defines the shape of the Chi-Square distribution. It represents the number of independent variables squared and summed.
  • Shape: The distribution is asymmetric, skewed to the right, especially for small degrees of freedom. As the degrees of freedom increase, the distribution becomes more symmetric and approximates a normal distribution.

Properties of the Chi-Square Distribution

The Chi-Square distribution has well-defined mathematical properties:

  • Mean: The mean of a Chi-Square distribution is equal to its degrees of freedom. $$ \text{Mean} = E[X] = k $$
  • Variance: The variance of a Chi-Square distribution is twice the degrees of freedom. $$ \text{Variance} = Var(X) = 2k $$

Example: Mean and Variance Calculation

Let's demonstrate these properties using Python with NumPy:

import numpy as np

# Define degrees of freedom
df = 5

# Generate a large number of Chi-Square samples
num_samples = 10000
samples = np.random.chisquare(df, size=num_samples)

# Calculate and print the sample mean and variance
sample_mean = np.mean(samples)
sample_variance = np.var(samples)

print(f"Degrees of Freedom (df): {df}")
print(f"Theoretical Mean (df): {df}")
print(f"Sample Mean: {sample_mean:.4f}")
print(f"Theoretical Variance (2*df): {2 * df}")
print(f"Sample Variance: {sample_variance:.4f}")

Example Output:

Degrees of Freedom (df): 5
Theoretical Mean (df): 5
Sample Mean: 4.9987
Theoretical Variance (2*df): 10
Sample Variance: 9.9855

As you can see, the sample mean and variance are close to their theoretical values, especially with a large number of samples.

Applications of the Chi-Square Distribution

The Chi-Square distribution is a versatile tool used in various statistical tests:

1. Goodness-of-Fit Test

This test assesses how well observed frequencies (data) match expected frequencies (theoretical distribution or hypothesis). It is used to determine if a sample distribution matches a known distribution.

Formula:

The Chi-Square test statistic is calculated as:

$$ \chi^2 = \sum \frac{(O - E)^2}{E} $$

Where:

  • $O$ = Observed frequency
  • $E$ = Expected frequency

Example:

import numpy as np

# Observed frequencies of a 6-sided die roll
observed = np.array([16, 18, 16, 14, 18, 18])
# Expected frequencies if the die is fair (uniform distribution)
expected = np.array([15, 15, 15, 15, 15, 15])

# Calculate the Chi-Square statistic
chi_square_stat = np.sum((observed - expected)**2 / expected)

print(f"Observed frequencies: {observed}")
print(f"Expected frequencies: {expected}")
print(f"Chi-Square statistic: {chi_square_stat:.4f}")

Example Output:

Observed frequencies: [16 18 16 14 18 18]
Expected frequencies: [15 15 15 15 15 15]
Chi-Square statistic: 2.0000

2. Test of Independence

Used with contingency tables (cross-tabulations) to determine if there is a statistically significant association between two categorical variables. For example, testing if there's a relationship between gender and preference for a certain product.

3. Variance Analysis

This is used to compare the variance of a sample to the variance of a known population, or to compare the variances of two different samples.

Generating Chi-Square Samples with NumPy

The numpy.random.chisquare function is used to generate random samples from a Chi-Square distribution.

Syntax:

numpy.random.chisquare(df, size=None)

  • df: The degrees of freedom for the distribution.
  • size: The shape of the output array. If None (default), a single value is returned.

Example:

import numpy as np

# Generate 10 Chi-Square samples with 5 degrees of freedom
df = 5
samples = np.random.chisquare(df, size=10)
print("Generated Chi-Square samples:", samples)

Example Output:

Generated Chi-Square samples: [3.94 3.61 8.09 1.63 2.26 3.74 10.88 1.98 3.81 10.83]

Visualizing Chi-Square PDF with SciPy

The Probability Density Function (PDF) of the Chi-Square distribution can be visualized using scipy.stats.chi2.pdf(). This helps understand how the shape changes with degrees of freedom.

Example:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2

# Define the range of x values
x = np.linspace(0, 20, 500)

# Define different degrees of freedom to plot
dfs = [2, 4, 6, 8]

plt.figure(figsize=(10, 6))
for df in dfs:
    plt.plot(x, chi2.pdf(x, df), label=f"df={df}")

plt.title("Chi-Square Distribution PDF")
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.legend()
plt.grid(True)
plt.show()

This code will generate a plot showing multiple Chi-Square curves, each representing a different degree of freedom, illustrating the shift from a right-skewed shape to a more symmetric one.

Real-World Simulation: Quality Control

Scenario: A manufacturing company wants to test if the variance in the diameter of a product exceeds an acceptable limit.

  • Observed Variance: The variance calculated from a recent sample of products.
  • Sample Size: The number of products in the sample.
  • Population Variance: The acceptable variance limit specified by quality standards.

The Chi-Square statistic for testing variance is calculated as:

$$ \chi^2 = \frac{(n-1) s^2}{\sigma^2} $$

Where:

  • $n$ = Sample size
  • $s^2$ = Sample variance (observed_variance)
  • $\sigma^2$ = Population variance (acceptable variance limit)

Example:

# Scenario parameters
sample_size = 20
observed_variance = 4.5
population_variance = 4.0

# Calculate the Chi-Square statistic for variance test
chi_square_stat = (sample_size - 1) * observed_variance / population_variance

print(f"Sample Size (n): {sample_size}")
print(f"Observed Variance (s^2): {observed_variance}")
print(f"Population Variance (sigma^2): {population_variance}")
print(f"Chi-Square statistic: {chi_square_stat:.4f}")

Example Output:

Sample Size (n): 20
Observed Variance (s^2): 4.5
Population Variance (sigma^2): 4.0
Chi-Square statistic: 21.3750

This statistic can then be compared against a critical value from the Chi-Square distribution (with $n-1$ degrees of freedom) to determine if the observed variance is significantly higher than the acceptable limit.

Summary of Key Formulas

  • Chi-Square Statistic (Sum of Squares): $$ X = \sum_{i=1}^k Z_i^2 $$
  • Mean of Chi-Square Distribution: $$ \text{Mean} = k $$
  • Variance of Chi-Square Distribution: $$ \text{Variance} = 2k $$
  • Chi-Square Goodness-of-Fit Test Statistic: $$ \chi^2 = \sum \frac{(O - E)^2}{E} $$ Where $O$ = Observed frequency, $E$ = Expected frequency.
  • Chi-Square Test Statistic for Variance: $$ \chi^2 = \frac{(n-1) s^2}{\sigma^2} $$ Where $n$ = sample size, $s^2$ = sample variance, $\sigma^2$ = population variance.