Hypergeometric Mean & Variance | Probability

Learn the mean and variance of the Hypergeometric Distribution. Understand probability without replacement, crucial for AI/ML data analysis.

15.2 Mean and Variance of the Hypergeometric Distribution

The Hypergeometric Distribution is a discrete probability distribution that describes the probability of obtaining a specific number of successes in a sample drawn without replacement from a finite population. Understanding its mean and variance is crucial for analyzing the central tendency and spread of potential outcomes in such sampling scenarios.

I. Mean (Expected Value)

The mean, often referred to as the expected value, of the Hypergeometric Distribution represents the average number of successes you can anticipate in a sample of a given size. It's calculated based on the proportion of successes in the entire population.

Formula for Mean:

$$ \mu = n \times \frac{k}{N} $$

Where:

  • $\mu$ (mu): The mean, or expected number of successes in the sample.
  • $n$: The sample size (the number of items drawn from the population).
  • $k$: The total number of "success" items in the population.
  • $N$: The total population size.

Explanation:

The formula essentially states that the expected number of successes in your sample is the sample size multiplied by the probability of drawing a success from the population on any single draw (which is $k/N$).

Example:

Suppose a bag contains 50 marbles ($N=50$), and 10 of them are red ($k=10$). If you draw a sample of 5 marbles without replacement ($n=5$), the expected number of red marbles in your sample is:

$$ \mu = 5 \times \frac{10}{50} = 5 \times 0.2 = 1 $$

So, you would expect to draw, on average, 1 red marble.

II. Variance

The variance quantifies the spread or dispersion of the possible number of successes in a sample around the mean. A higher variance indicates that the number of successes in the sample is likely to deviate more significantly from the expected value.

Formula for Variance:

$$ \sigma^2 = n \times \frac{k}{N} \times \frac{N-k}{N} \times \frac{N-n}{N-1} $$

Where:

  • $\sigma^2$ (sigma squared): The variance of the Hypergeometric Distribution.
  • $n$: The sample size.
  • $k$: The total number of successes in the population.
  • $N$: The total population size.

Explanation:

The variance formula is an extension of the variance of a Binomial distribution, incorporating a finite population correction factor of $\frac{N-n}{N-1}$.

  • The first three terms ($n \times \frac{k}{N} \times \frac{N-k}{N}$) represent the variance of a Binomial distribution, where $\frac{k}{N}$ is the probability of success ($p$) and $\frac{N-k}{N}$ is the probability of failure ($1-p$).
  • The finite population correction factor $\frac{N-n}{N-1}$ accounts for the fact that sampling is done without replacement. As the sample size ($n$) approaches the population size ($N$), this factor approaches 0, reducing the variance. This makes sense because if you sample the entire population, there's no variability in the number of successes – you'll get exactly $k$ successes.

Example (Continuing from above):

Using the same bag of 50 marbles with 10 red ones, and drawing a sample of 5 marbles ($N=50, k=10, n=5$):

$$ \sigma^2 = 5 \times \frac{10}{50} \times \frac{50-10}{50} \times \frac{50-5}{50-1} $$ $$ \sigma^2 = 5 \times 0.2 \times \frac{40}{50} \times \frac{45}{49} $$ $$ \sigma^2 = 1 \times 0.8 \times \frac{45}{49} $$ $$ \sigma^2 = 0.8 \times 0.918367... $$ $$ \sigma^2 \approx 0.7347 $$

This variance indicates the typical spread of the number of red marbles you might find in a sample of 5.

Applications

These formulas are fundamental for analyzing situations involving sampling without replacement from a finite population. Common applications include:

  • Quality Control: Determining the probability of finding defective items in a batch inspected without replacement.
  • Lottery Selection: Calculating the chances of winning based on drawn numbers from a finite set.
  • Biological Studies: Analyzing population genetics, where samples are taken from a limited gene pool.
  • Survey Sampling: Estimating population characteristics from samples drawn without replacement.
  • Card Games: Calculating probabilities related to drawing specific cards from a shuffled deck.

Understanding the mean and variance of the Hypergeometric Distribution naturally leads to broader questions in statistics and probability:

  • What is the formula for the mean of a Hypergeometric Distribution?
    • $\mu = n \times (k/N)$
  • How do you calculate the expected number of successes in a Hypergeometric Distribution?
    • By using the mean formula $\mu = n \times (k/N)$.
  • What is the significance of variance in the Hypergeometric Distribution?
    • It measures the spread or variability of the number of successes in a sample around the mean, indicating how much outcomes are likely to deviate from the expectation.
  • How does sampling without replacement affect the mean and variance?
    • It doesn't affect the mean formula. However, it necessitates the finite population correction factor in the variance formula, which reduces the variance compared to sampling with replacement.
  • Compare the variance of the Hypergeometric and Binomial Distributions.
    • The Hypergeometric variance is the Binomial variance multiplied by the finite population correction factor $(\frac{N-n}{N-1})$, which is always less than or equal to 1. Thus, Hypergeometric variance is generally smaller, especially when the sample size is a significant fraction of the population size.
  • In which real-world scenarios is the Hypergeometric mean formula applicable?
    • Any scenario where you draw a sample from a finite, defined population without putting items back, and you're interested in the expected number of items of a specific type. Examples include quality checks on a limited production run, analyzing lottery draws, or sampling from a finite group of individuals.
  • Why does the Hypergeometric variance include the correction factor $(\frac{N-n}{N-1})$?
    • This factor accounts for the dependency between draws introduced by sampling without replacement from a finite population. As the sample size increases relative to the population size, the probability of success on subsequent draws changes more significantly, reducing the overall variance.
  • What are the key parameters needed to calculate mean and variance in the Hypergeometric Distribution?
    • Population size ($N$), number of successes in the population ($k$), and sample size ($n$).
  • How does the population size ($N$) influence the variance of the Hypergeometric Distribution?
    • A larger population size ($N$) relative to the sample size ($n$) means the finite population correction factor $(\frac{N-n}{N-1})$ will be closer to 1, making the Hypergeometric variance closer to the Binomial variance. Conversely, a smaller $N$ relative to $n$ significantly reduces the variance.
  • Can you explain the impact of increasing sample size on the expected value and variance?
    • Expected Value: Increases proportionally to the sample size ($n$).
    • Variance: Increases with $n$, but the rate of increase is dampened by the finite population correction factor as $n$ grows larger relative to $N$. The variance will eventually approach zero as $n$ approaches $N$.