Discrete Probability Distributions & SciPy for ML

Explore discrete probability distributions with SciPy for Machine Learning. Model random variables & events with Python's powerful statistical tools.

Discrete Probability Distributions with SciPy

Discrete probability distributions are statistical models that describe random variables taking on a finite or countable number of values, typically integers. They are fundamental tools in various fields, including computer science, engineering, data analysis, and operations research, for modeling events such as experimental successes, random occurrences, or sampling outcomes.

The scipy.stats module in Python offers a comprehensive suite of tools for working with discrete probability distributions. This library enables users to efficiently compute Probability Mass Functions (PMF), Cumulative Distribution Functions (CDF), generate random variates, and perform in-depth statistical analysis.

Key Discrete Probability Distributions in SciPy

This section details several commonly used discrete probability distributions available in scipy.stats.

1. Binomial Distribution

The Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials, where each trial has the same probability of success.

Use Cases:

  • Coin toss experiments
  • Quality control testing (e.g., number of defective items in a batch)

SciPy Object: scipy.stats.binom

Python Example: Binomial PMF & CDF

from scipy.stats import binom
import numpy as np
import matplotlib.pyplot as plt

# Parameters
n, p = 10, 0.5  # n = number of trials, p = probability of success

# Generate x values
x_values = np.arange(0, n + 1)

# Compute PMF and CDF
pmf_values = binom.pmf(x_values, n, p)
cdf_values = binom.cdf(x_values, n, p)

# Plotting
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x_values, pmf_values, alpha=0.7, color='blue')
plt.title('Binomial Distribution - PMF')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x_values, cdf_values, color='red', where='mid')
plt.title('Binomial Distribution - CDF')
plt.xlabel('Number of Successes')
plt.ylabel('Cumulative Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

2. Poisson Distribution

The Poisson distribution describes the probability of a given number of events occurring in a fixed interval of time or space, assuming events happen independently at a constant average rate.

Use Cases:

  • Queueing theory (e.g., number of customers arriving at a service point)
  • Telecommunications (e.g., number of calls received by a call center)
  • Traffic flow analysis (e.g., number of cars passing a point on a highway)

SciPy Object: scipy.stats.poisson

Python Example: Poisson PMF & CDF

from scipy.stats import poisson
import numpy as np
import matplotlib.pyplot as plt

# Parameters
mu = 3  # lambda (average rate of events)

# Generate x values
x_values = np.arange(0, 15)

# Compute PMF and CDF
pmf_values = poisson.pmf(x_values, mu)
cdf_values = poisson.cdf(x_values, mu)

# Plotting
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x_values, pmf_values, alpha=0.7, color='blue')
plt.title('Poisson Distribution - PMF')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x_values, cdf_values, color='red', where='mid')
plt.title('Poisson Distribution - CDF')
plt.xlabel('Number of Events')
plt.ylabel('Cumulative Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

3. Geometric Distribution

The Geometric distribution models the number of independent Bernoulli trials required to achieve the first success.

Use Cases:

  • Reliability testing (e.g., number of tests until a component fails)
  • Survival analysis (e.g., number of trials until an event occurs)

SciPy Object: scipy.stats.geom

Python Example: Geometric PMF & CDF

from scipy.stats import geom
import numpy as np
import matplotlib.pyplot as plt

# Parameters
p = 0.3  # Probability of success on a single trial

# Generate x values (number of trials to first success, starts from 1)
x_values = np.arange(1, 11)

# Compute PMF and CDF
pmf_values = geom.pmf(x_values, p)
cdf_values = geom.cdf(x_values, p)

# Plotting
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x_values, pmf_values, alpha=0.7, color='blue')
plt.title('Geometric Distribution - PMF')
plt.xlabel('Number of Trials to First Success')
plt.ylabel('Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x_values, cdf_values, color='red', where='mid')
plt.title('Geometric Distribution - CDF')
plt.xlabel('Number of Trials to First Success')
plt.ylabel('Cumulative Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

Working with Discrete Distributions in SciPy

The scipy.stats library provides a consistent interface for various probability distributions, including discrete ones. Key functions include:

FunctionPurpose
pmf(x, params)Computes the Probability Mass Function (probability at a specific value x).
cdf(x, params)Computes the Cumulative Distribution Function (probability of observing a value less than or equal to x).
rvs(params, size=N)Generates N random samples from the distribution.
mean()Computes the mean (expected value) of the distribution.
var()Computes the variance of the distribution.
std()Computes the standard deviation of the distribution.
median()Computes the median of the distribution.

params refers to the distribution-specific parameters (e.g., n, p for binomial, mu for Poisson).

Example: Calculating Mean and Variance of a Poisson Distribution

from scipy.stats import poisson

# Parameters
mu = 3  # Average rate of events

# Calculate mean and variance
mean_val = poisson.mean(mu)
variance_val = poisson.var(mu)

print(f"Mean of Poisson Distribution (mu={mu}): {mean_val}")
print(f"Variance of Poisson Distribution (mu={mu}): {variance_val}")

Output:

Mean of Poisson Distribution (mu=3): 3.0
Variance of Poisson Distribution (mu=3): 3.0

Conclusion

The scipy.stats module is an invaluable resource for anyone working with discrete probability distributions in Python. It offers a flexible and intuitive set of tools for modeling, analyzing, and generating data from various distributions. Whether you are modeling the number of successes in trials, waiting times, or random occurrences, SciPy empowers you to perform these tasks efficiently and accurately, making it essential for researchers, data scientists, and engineers engaged in statistical modeling.