Python Statistics Module Tutorial for AI & Data Science

Master Python's `statistics` module for AI, ML, and data analysis. Learn to calculate averages, medians, standard deviations & more with this essential guide.

5.6 Python statistics Module Tutorial

The Python statistics module provides a suite of functions for performing statistical operations such as calculating averages, medians, standard deviations, and more. It is particularly useful for data analysis and quick statistical evaluations without requiring external libraries.

Importing the Module

To use any function from the statistics module, you must first import it:

import statistics

1. Measures of Central Tendency

These functions help identify the central or "typical" value in a dataset.

statistics.mean(data)

Returns the arithmetic mean (average) of the given data.

values = [10, 15, 25, 30]
print(statistics.mean(values))
# Output: 20.0

statistics.fmean(data) (Python 3.8+)

Returns the mean as a floating-point number for better precision. This is generally faster than mean() for floating-point data.

print(statistics.fmean([1, 2, 3, 4]))
# Output: 2.5

statistics.median(data)

Returns the median or middle value. For datasets with an even number of elements, it returns the average of the two middle values.

print(statistics.median([5, 2, 8, 1, 7]))
# Output: 5

print(statistics.median([1, 3, 5, 7]))
# Output: 4.0

statistics.median_low(data)

Returns the lower of the two middle values in an even-sized dataset.

print(statistics.median_low([1, 3, 5, 7]))
# Output: 3

statistics.median_high(data)

Returns the higher of the two middle values in an even-sized dataset.

print(statistics.median_high([1, 3, 5, 7]))
# Output: 5

statistics.mode(data)

Returns the most frequently occurring value.

print(statistics.mode([2, 3, 3, 5, 7, 3, 2]))
# Output: 3

Note: For datasets with multiple modes, mode() will raise a StatisticsError. Use multimode() instead for such cases.

statistics.multimode(data)

Returns a list of all modes (values that occur with the highest frequency).

print(statistics.multimode([1, 1, 2, 3, 3]))
# Output: [1, 3]

2. Measures of Spread (Dispersion)

These functions indicate how much the data varies from the central value.

statistics.variance(data, xbar=None)

Returns the sample variance of the data. Sample variance is used when the data is a sample from a larger population.

data = [4, 7, 13, 16]
print(statistics.variance(data))
# Output: 30.0

statistics.pvariance(data, xbar=None)

Returns the population variance. Population variance is used when the data represents the entire population.

print(statistics.pvariance([4, 7, 13, 16]))
# Output: 22.5

statistics.stdev(data, xbar=None)

Returns the sample standard deviation, which is the square root of the sample variance.

print(statistics.stdev([4, 7, 13, 16]))
# Output: 5.477225575051661

statistics.pstdev(data, xbar=None)

Returns the population standard deviation, which is the square root of the population variance.

print(statistics.pstdev([4, 7, 13, 16]))
# Output: 4.743416490252569

3. Other Statistical Functions

statistics.harmonic_mean(data)

Useful for rates and ratios, this function returns the harmonic mean. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the data.

print(statistics.harmonic_mean([40, 60]))
# Output: 48.0

statistics.geometric_mean(data)

Used for multiplicative or exponential datasets. The geometric mean is the n-th root of the product of n numbers.

print(statistics.geometric_mean([10, 100]))
# Output: 31.622776601683793

statistics.quantiles(data, *, n=4, method='exclusive')

Divides data into n intervals (default is quartiles, n=4) and returns the cut points.

data = [10, 20, 30, 40, 50, 60]
print(statistics.quantiles(data, n=4))
# Output: [25.0, 40.0, 50.0]

Methods for quantiles:

  • 'exclusive' (default): For larger datasets, it calculates quantiles such that all data points are included in the intervals.
  • 'inclusive': Recommended for smaller datasets. It includes the endpoints in the intervals.

Error Handling

All functions in the statistics module raise a StatisticsError for invalid or empty inputs.

Functions accept any iterable, including lists and tuples.

Summary Table of Key Functions

FunctionPurpose
mean(data)Arithmetic mean
fmean(data)Fast float mean (Python 3.8+)
median(data)Median value
median_low(data)Lower median value
median_high(data)Higher median value
mode(data)Most frequent value
multimode(data)All most frequent values
variance(data)Sample variance
pvariance(data)Population variance
stdev(data)Sample standard deviation
pstdev(data)Population standard deviation
harmonic_mean(data)Harmonic mean (for rates/ratios)
geometric_mean(data)Geometric mean (for multiplicative datasets)
quantiles(data, n)Quantiles (e.g., quartiles)

Example: Basic Statistical Summary in Python

import statistics

values = [15, 20, 25, 20, 30]

print("Mean:", statistics.mean(values))
print("Median:", statistics.median(values))
print("Mode:", statistics.mode(values))
print("Standard Deviation:", statistics.stdev(values))

# Output:
# Mean: 22.0
# Median: 20
# Mode: 20
# Standard Deviation: 5.477225575051661

Conclusion

The statistics module is a powerful built-in tool in Python, allowing you to perform comprehensive statistical analysis without external dependencies. It is ideal for quick insights, teaching statistics, and prototyping data processing scripts.

For larger datasets and more advanced analytical capabilities, consider using libraries like NumPy or Pandas. However, for most basic to intermediate statistical needs, the statistics module is more than sufficient.

SEO Keywords

Python statistics module, Calculate mean Python, Python median function, Find mode in Python, Python standard deviation, Variance calculation Python, Harmonic mean Python, Geometric mean Python, Python quantiles example, Python statistics error handling.

Interview Questions

  • What is the Python statistics module used for?
  • How do you calculate the mean and median using the statistics module?
  • What is the difference between mode() and multimode() functions?
  • How do you compute variance and standard deviation in Python using this module?
  • Explain the difference between sample variance and population variance in the context of the module's functions.
  • What is the harmonic mean and when would you typically use it?
  • How does geometric_mean() differ from the arithmetic mean()?
  • How can you compute quantiles or quartiles using the statistics module?
  • What kind of errors does the statistics module raise for invalid inputs?
  • When should you consider using NumPy or Pandas instead of the statistics module?