Chi-Square Distribution in Statistics & Machine Learning
Explore the Chi-Square distribution, a key concept in statistical hypothesis testing. Learn its applications in ML, variance testing, and categorical data analysis.
Chi-Square Distribution
The Chi-Square distribution is a continuous probability distribution that is fundamental to statistical hypothesis testing. It is particularly important for testing population variance and assessing the independence of categorical variables.
What is a Chi-Square Distribution?
A Chi-Square distribution arises from the sum of the squares of independent standard normal variables. Mathematically, if $Z_1, Z_2, \dots, Z_k$ are independent standard normal variables (each with a mean of 0 and a variance of 1), then the sum of their squares follows a Chi-Square distribution:
$$X = Z_1^2 + Z_2^2 + \dots + Z_k^2$$
The number of independent standard normal variables being squared and summed is known as the degrees of freedom ($k$).
$$X \sim \chi^2(k)$$
Where:
- $Z_i$ are independent standard normal variables.
- $k$ is the degrees of freedom.
Key Characteristics
- Degrees of Freedom (df): This is the single parameter that defines the shape of the Chi-Square distribution. It represents the number of independent variables squared and summed.
- Shape: The distribution is asymmetric, skewed to the right, especially for small degrees of freedom. As the degrees of freedom increase, the distribution becomes more symmetric and approximates a normal distribution.
Properties of the Chi-Square Distribution
The Chi-Square distribution has well-defined mathematical properties:
- Mean: The mean of a Chi-Square distribution is equal to its degrees of freedom. $$ \text{Mean} = E[X] = k $$
- Variance: The variance of a Chi-Square distribution is twice the degrees of freedom. $$ \text{Variance} = Var(X) = 2k $$
Example: Mean and Variance Calculation
Let's demonstrate these properties using Python with NumPy:
import numpy as np
# Define degrees of freedom
df = 5
# Generate a large number of Chi-Square samples
num_samples = 10000
samples = np.random.chisquare(df, size=num_samples)
# Calculate and print the sample mean and variance
sample_mean = np.mean(samples)
sample_variance = np.var(samples)
print(f"Degrees of Freedom (df): {df}")
print(f"Theoretical Mean (df): {df}")
print(f"Sample Mean: {sample_mean:.4f}")
print(f"Theoretical Variance (2*df): {2 * df}")
print(f"Sample Variance: {sample_variance:.4f}")
Example Output:
Degrees of Freedom (df): 5
Theoretical Mean (df): 5
Sample Mean: 4.9987
Theoretical Variance (2*df): 10
Sample Variance: 9.9855
As you can see, the sample mean and variance are close to their theoretical values, especially with a large number of samples.
Applications of the Chi-Square Distribution
The Chi-Square distribution is a versatile tool used in various statistical tests:
1. Goodness-of-Fit Test
This test assesses how well observed frequencies (data) match expected frequencies (theoretical distribution or hypothesis). It is used to determine if a sample distribution matches a known distribution.
Formula:
The Chi-Square test statistic is calculated as:
$$ \chi^2 = \sum \frac{(O - E)^2}{E} $$
Where:
- $O$ = Observed frequency
- $E$ = Expected frequency
Example:
import numpy as np
# Observed frequencies of a 6-sided die roll
observed = np.array([16, 18, 16, 14, 18, 18])
# Expected frequencies if the die is fair (uniform distribution)
expected = np.array([15, 15, 15, 15, 15, 15])
# Calculate the Chi-Square statistic
chi_square_stat = np.sum((observed - expected)**2 / expected)
print(f"Observed frequencies: {observed}")
print(f"Expected frequencies: {expected}")
print(f"Chi-Square statistic: {chi_square_stat:.4f}")
Example Output:
Observed frequencies: [16 18 16 14 18 18]
Expected frequencies: [15 15 15 15 15 15]
Chi-Square statistic: 2.0000
2. Test of Independence
Used with contingency tables (cross-tabulations) to determine if there is a statistically significant association between two categorical variables. For example, testing if there's a relationship between gender and preference for a certain product.
3. Variance Analysis
This is used to compare the variance of a sample to the variance of a known population, or to compare the variances of two different samples.
Generating Chi-Square Samples with NumPy
The numpy.random.chisquare
function is used to generate random samples from a Chi-Square distribution.
Syntax:
numpy.random.chisquare(df, size=None)
df
: The degrees of freedom for the distribution.size
: The shape of the output array. IfNone
(default), a single value is returned.
Example:
import numpy as np
# Generate 10 Chi-Square samples with 5 degrees of freedom
df = 5
samples = np.random.chisquare(df, size=10)
print("Generated Chi-Square samples:", samples)
Example Output:
Generated Chi-Square samples: [3.94 3.61 8.09 1.63 2.26 3.74 10.88 1.98 3.81 10.83]
Visualizing Chi-Square PDF with SciPy
The Probability Density Function (PDF) of the Chi-Square distribution can be visualized using scipy.stats.chi2.pdf()
. This helps understand how the shape changes with degrees of freedom.
Example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
# Define the range of x values
x = np.linspace(0, 20, 500)
# Define different degrees of freedom to plot
dfs = [2, 4, 6, 8]
plt.figure(figsize=(10, 6))
for df in dfs:
plt.plot(x, chi2.pdf(x, df), label=f"df={df}")
plt.title("Chi-Square Distribution PDF")
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.legend()
plt.grid(True)
plt.show()
This code will generate a plot showing multiple Chi-Square curves, each representing a different degree of freedom, illustrating the shift from a right-skewed shape to a more symmetric one.
Real-World Simulation: Quality Control
Scenario: A manufacturing company wants to test if the variance in the diameter of a product exceeds an acceptable limit.
- Observed Variance: The variance calculated from a recent sample of products.
- Sample Size: The number of products in the sample.
- Population Variance: The acceptable variance limit specified by quality standards.
The Chi-Square statistic for testing variance is calculated as:
$$ \chi^2 = \frac{(n-1) s^2}{\sigma^2} $$
Where:
- $n$ = Sample size
- $s^2$ = Sample variance (observed_variance)
- $\sigma^2$ = Population variance (acceptable variance limit)
Example:
# Scenario parameters
sample_size = 20
observed_variance = 4.5
population_variance = 4.0
# Calculate the Chi-Square statistic for variance test
chi_square_stat = (sample_size - 1) * observed_variance / population_variance
print(f"Sample Size (n): {sample_size}")
print(f"Observed Variance (s^2): {observed_variance}")
print(f"Population Variance (sigma^2): {population_variance}")
print(f"Chi-Square statistic: {chi_square_stat:.4f}")
Example Output:
Sample Size (n): 20
Observed Variance (s^2): 4.5
Population Variance (sigma^2): 4.0
Chi-Square statistic: 21.3750
This statistic can then be compared against a critical value from the Chi-Square distribution (with $n-1$ degrees of freedom) to determine if the observed variance is significantly higher than the acceptable limit.
Summary of Key Formulas
- Chi-Square Statistic (Sum of Squares): $$ X = \sum_{i=1}^k Z_i^2 $$
- Mean of Chi-Square Distribution: $$ \text{Mean} = k $$
- Variance of Chi-Square Distribution: $$ \text{Variance} = 2k $$
- Chi-Square Goodness-of-Fit Test Statistic: $$ \chi^2 = \sum \frac{(O - E)^2}{E} $$ Where $O$ = Observed frequency, $E$ = Expected frequency.
- Chi-Square Test Statistic for Variance: $$ \chi^2 = \frac{(n-1) s^2}{\sigma^2} $$ Where $n$ = sample size, $s^2$ = sample variance, $\sigma^2$ = population variance.
NumPy Byte Swapping: Ensure Data Integrity in ML
Master NumPy's byteswap() for seamless data interoperability in AI and machine learning. Learn about endianness and cross-platform data compatibility.
NumPy Element-Wise Array Comparisons for ML Data Filtering
Master NumPy element-wise array comparisons for efficient data filtering in Machine Learning. Learn how to compare array elements and scalars for powerful data manipulation.