Python Statistics Module Tutorial for AI & Data Science
Master Python's `statistics` module for AI, ML, and data analysis. Learn to calculate averages, medians, standard deviations & more with this essential guide.
5.6 Python statistics
Module Tutorial
The Python statistics
module provides a suite of functions for performing statistical operations such as calculating averages, medians, standard deviations, and more. It is particularly useful for data analysis and quick statistical evaluations without requiring external libraries.
Importing the Module
To use any function from the statistics
module, you must first import it:
import statistics
1. Measures of Central Tendency
These functions help identify the central or "typical" value in a dataset.
statistics.mean(data)
Returns the arithmetic mean (average) of the given data.
values = [10, 15, 25, 30]
print(statistics.mean(values))
# Output: 20.0
statistics.fmean(data)
(Python 3.8+)
Returns the mean as a floating-point number for better precision. This is generally faster than mean()
for floating-point data.
print(statistics.fmean([1, 2, 3, 4]))
# Output: 2.5
statistics.median(data)
Returns the median or middle value. For datasets with an even number of elements, it returns the average of the two middle values.
print(statistics.median([5, 2, 8, 1, 7]))
# Output: 5
print(statistics.median([1, 3, 5, 7]))
# Output: 4.0
statistics.median_low(data)
Returns the lower of the two middle values in an even-sized dataset.
print(statistics.median_low([1, 3, 5, 7]))
# Output: 3
statistics.median_high(data)
Returns the higher of the two middle values in an even-sized dataset.
print(statistics.median_high([1, 3, 5, 7]))
# Output: 5
statistics.mode(data)
Returns the most frequently occurring value.
print(statistics.mode([2, 3, 3, 5, 7, 3, 2]))
# Output: 3
Note: For datasets with multiple modes, mode()
will raise a StatisticsError
. Use multimode()
instead for such cases.
statistics.multimode(data)
Returns a list of all modes (values that occur with the highest frequency).
print(statistics.multimode([1, 1, 2, 3, 3]))
# Output: [1, 3]
2. Measures of Spread (Dispersion)
These functions indicate how much the data varies from the central value.
statistics.variance(data, xbar=None)
Returns the sample variance of the data. Sample variance is used when the data is a sample from a larger population.
data = [4, 7, 13, 16]
print(statistics.variance(data))
# Output: 30.0
statistics.pvariance(data, xbar=None)
Returns the population variance. Population variance is used when the data represents the entire population.
print(statistics.pvariance([4, 7, 13, 16]))
# Output: 22.5
statistics.stdev(data, xbar=None)
Returns the sample standard deviation, which is the square root of the sample variance.
print(statistics.stdev([4, 7, 13, 16]))
# Output: 5.477225575051661
statistics.pstdev(data, xbar=None)
Returns the population standard deviation, which is the square root of the population variance.
print(statistics.pstdev([4, 7, 13, 16]))
# Output: 4.743416490252569
3. Other Statistical Functions
statistics.harmonic_mean(data)
Useful for rates and ratios, this function returns the harmonic mean. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the data.
print(statistics.harmonic_mean([40, 60]))
# Output: 48.0
statistics.geometric_mean(data)
Used for multiplicative or exponential datasets. The geometric mean is the n-th root of the product of n numbers.
print(statistics.geometric_mean([10, 100]))
# Output: 31.622776601683793
statistics.quantiles(data, *, n=4, method='exclusive')
Divides data into n
intervals (default is quartiles, n=4
) and returns the cut points.
data = [10, 20, 30, 40, 50, 60]
print(statistics.quantiles(data, n=4))
# Output: [25.0, 40.0, 50.0]
Methods for quantiles
:
'exclusive'
(default): For larger datasets, it calculates quantiles such that all data points are included in the intervals.'inclusive'
: Recommended for smaller datasets. It includes the endpoints in the intervals.
Error Handling
All functions in the statistics
module raise a StatisticsError
for invalid or empty inputs.
Functions accept any iterable, including lists and tuples.
Summary Table of Key Functions
Function | Purpose |
---|---|
mean(data) | Arithmetic mean |
fmean(data) | Fast float mean (Python 3.8+) |
median(data) | Median value |
median_low(data) | Lower median value |
median_high(data) | Higher median value |
mode(data) | Most frequent value |
multimode(data) | All most frequent values |
variance(data) | Sample variance |
pvariance(data) | Population variance |
stdev(data) | Sample standard deviation |
pstdev(data) | Population standard deviation |
harmonic_mean(data) | Harmonic mean (for rates/ratios) |
geometric_mean(data) | Geometric mean (for multiplicative datasets) |
quantiles(data, n) | Quantiles (e.g., quartiles) |
Example: Basic Statistical Summary in Python
import statistics
values = [15, 20, 25, 20, 30]
print("Mean:", statistics.mean(values))
print("Median:", statistics.median(values))
print("Mode:", statistics.mode(values))
print("Standard Deviation:", statistics.stdev(values))
# Output:
# Mean: 22.0
# Median: 20
# Mode: 20
# Standard Deviation: 5.477225575051661
Conclusion
The statistics
module is a powerful built-in tool in Python, allowing you to perform comprehensive statistical analysis without external dependencies. It is ideal for quick insights, teaching statistics, and prototyping data processing scripts.
For larger datasets and more advanced analytical capabilities, consider using libraries like NumPy or Pandas. However, for most basic to intermediate statistical needs, the statistics
module is more than sufficient.
SEO Keywords
Python statistics module, Calculate mean Python, Python median function, Find mode in Python, Python standard deviation, Variance calculation Python, Harmonic mean Python, Geometric mean Python, Python quantiles example, Python statistics error handling.
Interview Questions
- What is the Python
statistics
module used for? - How do you calculate the mean and median using the
statistics
module? - What is the difference between
mode()
andmultimode()
functions? - How do you compute variance and standard deviation in Python using this module?
- Explain the difference between sample variance and population variance in the context of the module's functions.
- What is the harmonic mean and when would you typically use it?
- How does
geometric_mean()
differ from the arithmeticmean()
? - How can you compute quantiles or quartiles using the
statistics
module? - What kind of errors does the
statistics
module raise for invalid inputs? - When should you consider using NumPy or Pandas instead of the
statistics
module?
Python Random Module: Essential for ML & Simulations
Explore Python's random module for AI, ML, simulations, and games. Learn to generate pseudo-random numbers for statistical testing and more.
Python Sys Module: Essential Interpreter Functions
Explore the Python `sys` module for system interaction. Learn to manage arguments, exit programs, and access interpreter data, crucial for LLM development.