Choose the Right Statistical Test for ML Data Analysis

Learn to select the correct statistical tests for your machine learning data. Understand hypothesis testing assumptions like skewness for accurate AI model evaluation.

22. Choosing the Right Statistical Test

This document outlines various statistical tests and their applications, providing guidance on selecting the appropriate method for your data analysis.

22.1 Assumptions of Hypothesis Testing

Before conducting most hypothesis tests, it's crucial to understand and verify certain assumptions about your data. Violating these assumptions can lead to inaccurate results.

22.1.1 Skewness

Skewness measures the asymmetry of the probability distribution of a real-valued random variable about its mean.

  • Positive Skew (Right Skew): The tail on the right side of the distribution is longer or fatter than the left side. The mean is typically greater than the median.
  • Negative Skew (Left Skew): The tail on the left side of the distribution is longer or fatter than the right side. The mean is typically less than the median.
  • Zero Skew: The distribution is perfectly symmetrical.

Understanding skewness helps in deciding if data transformations are needed or if non-parametric tests might be more appropriate.

22.1.2 Kurtosis

Kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. It describes whether the shape of the data distribution is like a normal distribution in terms of its peakedness and tails.

  • Mesokurtic: A distribution with kurtosis similar to a normal distribution (kurtosis = 3, or excess kurtosis = 0).
  • Leptokurtic: A distribution with heavier tails and a sharper peak than a normal distribution (kurtosis > 3, or excess kurtosis > 0). This indicates more outliers.
  • Platykurtic: A distribution with lighter tails and a flatter peak than a normal distribution (kurtosis < 3, or excess kurtosis < 0). This indicates fewer outliers.

Kurtosis helps assess the likelihood of extreme values and informs decisions about the robustness of statistical methods.

22.2 Correlation

Correlation measures the statistical relationship between two variables. It indicates the extent to which two variables change together.

  • Correlation Coefficient: A numerical value that quantifies the strength and direction of a linear relationship between two variables. It ranges from -1 to +1.
    • +1: Perfect positive linear correlation. As one variable increases, the other increases proportionally.
    • -1: Perfect negative linear correlation. As one variable increases, the other decreases proportionally.
    • 0: No linear correlation.
  • Correlation vs. Causation: It's a common misconception that correlation implies causation. Correlation simply indicates an association, while causation implies that one variable directly influences another. There might be a third, unobserved variable (confounder) causing both to change.
  • Pearson Correlation: A common measure of linear correlation between two continuous variables. It assumes that the data is approximately normally distributed.
  • Covariance vs. Correlation:
    • Covariance: Measures the direction of the linear relationship between two variables. Its value is not standardized and can range from negative infinity to positive infinity, making it difficult to compare across different datasets.
    • Correlation: Is a standardized version of covariance. It normalizes the covariance by the product of the standard deviations of the two variables, resulting in a value between -1 and +1. This standardization makes it easier to interpret the strength of the relationship irrespective of the variables' scales.

22.3 Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables.

22.3.1 t-Test

While often used independently, t-tests are fundamental in regression analysis for testing the significance of individual coefficients.

  • Purpose: To determine if the mean of a group differs significantly from a specific value or the mean of another group.
  • Application in Regression: In linear regression, t-tests are used to assess whether the regression coefficients (slopes and intercept) are statistically significantly different from zero. A significant t-test indicates that the independent variable has a significant impact on the dependent variable.

22.3.2 ANOVAs (Analysis of Variance)

ANOVA is a statistical technique used to test the differences between the means of two or more groups.

22.3.2..3 ANOVA in R

# Example: One-way ANOVA in R
# Assuming 'data' is a data frame with a 'group' column and a 'value' column
# install.packages("dplyr") # if not already installed
library(dplyr)

# Example data
data <- data.frame(
  group = rep(c("A", "B", "C"), each = 10),
  value = c(rnorm(10, mean = 5), rnorm(10, mean = 6), rnorm(10, mean = 7))
)

# Perform one-way ANOVA
anova_result <- aov(value ~ group, data = data)

# Summarize the results
summary(anova_result)

# Post-hoc test (e.g., Tukey's HSD) if ANOVA is significant
if (summary(anova_result)[[1]][["Pr(>F)"]][1] < 0.05) {
  print(TukeyHSD(anova_result))
}

# Example: Two-way ANOVA in R
# Assuming 'data' has 'group1', 'group2', and 'value' columns
# Example data
data_two_way <- data.frame(
  group1 = rep(c("X", "Y"), each = 15),
  group2 = rep(c("P", "Q", "R"), each = 5, times = 2),
  value = c(rnorm(5, mean = 5), rnorm(5, mean = 6), rnorm(5, mean = 7),
            rnorm(5, mean = 6), rnorm(5, mean = 7), rnorm(5, mean = 8))
)

# Perform two-way ANOVA
anova_two_way_result <- aov(value ~ group1 + group2 + group1:group2, data = data_two_way)

# Summarize the results
summary(anova_two_way_result)

22.3.2.1 One-Way ANOVA

  • Purpose: To compare the means of three or more independent groups based on one factor.
  • Example: Comparing the average test scores of students from three different teaching methods.

22.3.21 Two-Way ANOVA

  • Purpose: To compare the means of groups based on two factors (independent variables) simultaneously. It also allows for the examination of interaction effects between the factors.
  • Example: Analyzing the effect of both fertilizer type and watering frequency on plant growth.

22.4 Chi-Square ($\chi^2$) Tests

Chi-square tests are non-parametric tests used to analyze categorical data. They are commonly used to determine if there is a significant association between two categorical variables or if observed frequencies differ from expected frequencies.

22.4.1 Overview of the Chi-Square Test

The $\chi^2$ test works by comparing the observed frequencies in different categories to the frequencies that would be expected under a null hypothesis of no association or no difference.

22.4.2 Chi-Square Goodness-of-Fit Test

  • Purpose: To determine if the observed frequency distribution of a single categorical variable matches an expected frequency distribution.
  • Example: Testing if a die is fair by comparing the observed counts of each face appearing after rolling it many times against the expected equal frequency for each face.

22.4.3 Chi-Square Test of Independence

  • Purpose: To determine if there is a statistically significant association between two categorical variables in a population.
  • Example: Testing if there is a relationship between a person's smoking status (smoker/non-smoker) and their likelihood of developing a certain respiratory illness (yes/no). The test assesses whether the proportion of people with the illness is the same across smokers and non-smokers.