20.8 Confidence Intervals Explained: Statistics for AI

Understand 20.8 confidence intervals in inferential statistics. Quantify uncertainty in AI model parameter estimates & population means with this key concept.

20.8 Confidence Intervals

A confidence interval (CI) is a range of values, derived from sample statistics, that is likely to contain the true value of a population parameter (such as the mean or proportion). It is a fundamental concept in inferential statistics, used to quantify the uncertainty associated with an estimate derived from a sample.

Definition

A confidence interval provides a range within which we expect the true population parameter to lie, based on our sample data. It tells us how confident we are that this estimated range captures the actual population value.

For example, a 95% confidence interval means: "We are 95% confident that the calculated interval contains the true population parameter."

It is crucial to understand that a confidence interval does not imply a 95% probability that the parameter itself lies within that specific interval. The population parameter is a fixed, albeit unknown, value. Instead, the confidence refers to the long-run frequency of the method used to construct the interval. If we were to repeatedly draw samples and construct confidence intervals, approximately 95% of those intervals would capture the true population parameter.

Formulas

The calculation of a confidence interval depends on whether the population standard deviation is known or unknown.

1. When Population Standard Deviation ($\sigma$) is Known

When the population standard deviation ($\sigma$) is known, we use the Z-distribution.

CI = x̄ ± Z * (σ / √n)

2. When Population Standard Deviation ($\sigma$) is Unknown

When the population standard deviation ($\sigma$) is unknown, we use the sample standard deviation ($s$) and the t-distribution.

CI = x̄ ± t * (s / √n)

Key Components:

  • x̄ (x-bar): The sample mean, which is our best point estimate of the population mean.
  • $\sigma$ (sigma): The population standard deviation. This measures the dispersion of the population.
  • $s$: The sample standard deviation. This measures the dispersion of the sample data.
  • $n$: The sample size. A larger sample size generally leads to a narrower and more precise confidence interval.
  • Z: The Z-score from the standard normal distribution. This value corresponds to the desired confidence level.
  • t: The t-score from Student’s t-distribution. This value is dependent on the confidence level and the degrees of freedom (usually $n-1$).

Z-Scores for Common Confidence Levels

Confidence LevelZ-Score
90%1.645
95%1.96
99%2.576

Example: Confidence Interval Calculation

Let's calculate a confidence interval for the population mean when the population standard deviation is known.

Given:

  • Sample Mean ($\bar{x}$) = 100
  • Population Standard Deviation ($\sigma$) = 15
  • Sample Size ($n$) = 36
  • Confidence Level = 95%

From the table above, the Z-score for a 95% confidence level is 1.96.

Calculation:

  1. Calculate the Standard Error: Standard Error = $\sigma / \sqrt{n}$ = $15 / \sqrt{36}$ = $15 / 6$ = 2.5

  2. Calculate the Margin of Error: Margin of Error = Z * Standard Error = $1.96 * 2.5$ = 4.9

  3. Construct the Confidence Interval: CI = $\bar{x}$ ± Margin of Error CI = $100$ ± $4.9$

    Lower Bound = $100 - 4.9 = 95.1$ Upper Bound = $100 + 4.9 = 104.9$

Confidence Interval = [95.1, 104.9]

Interpretation of the Example

We are 95% confident that the true population mean lies between 95.1 and 104.9. This interval accounts for the variability observed in our sample data.

Summary Table

TermExplanation
Confidence IntervalThe range of values within which the true population parameter is likely to lie.
Sample Mean ($\bar{x}$)The central value of the sample data; used as a point estimate.
Z or t ScoreA multiplier determined by the confidence level and sample size (via degrees of freedom for t).
Margin of ErrorThe "plus or minus" part of the confidence interval ($Z \times \frac{\sigma}{\sqrt{n}}$ or $t \times \frac{s}{\sqrt{n}}$).

When to Use Z-Distribution vs. t-Distribution

  • Use the Z-distribution: When the population standard deviation ($\sigma$) is known, or when the sample size is very large (typically $n > 30$), even if $\sigma$ is unknown, as the t-distribution approximates the Z-distribution for large $n$.
  • Use the t-distribution: When the population standard deviation ($\sigma$) is unknown and must be estimated using the sample standard deviation ($s$), especially for smaller sample sizes ($n \le 30$). The t-distribution accounts for the additional uncertainty introduced by estimating $\sigma$.

Frequently Asked Questions (FAQs)

  • What is a confidence interval and why is it important in statistics? A confidence interval is a range that likely contains the true population parameter. It's important because it quantifies the uncertainty of our estimates derived from sample data, providing a more informative picture than a single point estimate.

  • How do you interpret a 95% confidence interval? You interpret it as: "We are 95% confident that the true population parameter falls within this calculated range."

  • What is the difference between using a Z-distribution and a t-distribution for confidence intervals? The Z-distribution is used when the population standard deviation is known. The t-distribution is used when the population standard deviation is unknown and is estimated by the sample standard deviation. The t-distribution has fatter tails than the Z-distribution, accounting for the extra uncertainty from estimating the population standard deviation.

  • When should you use the t-distribution instead of the Z-distribution? You should use the t-distribution when the population standard deviation is unknown and you are using the sample standard deviation to estimate it, particularly with smaller sample sizes.

  • How do you calculate the margin of error in a confidence interval? The margin of error is calculated by multiplying the appropriate critical value (Z-score or t-score) by the standard error of the statistic (e.g., $\sigma/\sqrt{n}$ or $s/\sqrt{n}$).

  • What does the sample mean represent in the confidence interval formula? The sample mean ($\bar{x}$) is the center point of the confidence interval and serves as the best point estimate for the unknown population mean.

  • How does sample size affect the width of a confidence interval? As the sample size ($n$) increases, the standard error ($\sigma/\sqrt{n}$ or $s/\sqrt{n}$) decreases. This leads to a smaller margin of error and a narrower confidence interval, indicating a more precise estimate.

  • Can you explain why the confidence interval does not imply the probability of the parameter being within the interval? The population parameter is a fixed value. It either is or is not within a specific interval. The confidence level (e.g., 95%) refers to the long-run success rate of the method used to construct the interval. If we were to take many samples and construct many intervals, about 95% of those intervals would contain the true parameter.