Histogram: Visualize Data Distribution with Matplotlib

Learn to visualize numerical data distribution with histograms in Python using Matplotlib. Understand bins, frequencies, and customize your plots for machine learning insights.

Histogram with Matplotlib

A histogram is a powerful graphical representation used to visualize the distribution of numerical data. It partitions the data into a series of intervals, called "bins," and then counts how many data points fall into each bin. The height of each bar in the histogram corresponds to the frequency or count of data points within that bin.

Matplotlib's hist() function provides a versatile way to create and customize histograms, enabling users to:

  • Visualize Data Distributions: Understand the shape, center, and spread of your data.
  • Customize Binning: Control the number and range of bins to refine the data representation.
  • Adjust Aesthetics: Modify colors, edge styles, and transparency for clarity and visual appeal.
  • Create Advanced Histograms: Generate cumulative and stacked histograms for comparative analysis.

1. Creating a Basic Histogram

The most fundamental histogram can be created by providing your dataset to the plt.hist() function.

Syntax:

plt.hist(x, bins=None, range=None, density=False, cumulative=False, color=None, edgecolor=None, ...)

Key Parameters:

  • x: The input data for the histogram. This can be a 1D array-like object.
  • bins:
    • If an integer, it defines the number of equal-width bins.
    • If a sequence, it specifies the bin edges.
    • If None (default), Matplotlib automatically determines the number of bins.
  • range: The lower and upper range of the bins. If None (default), the range is determined from the minimum and maximum values of the input data x.
  • density: If True, the histogram is normalized to form a probability density. The integral of the histogram over the range will sum to 1.
  • cumulative: If True, each bin's value will be the sum of the counts of all preceding bins, plus its own count.
  • color: The fill color for the histogram bars.
  • edgecolor: The color of the edges of the histogram bars.

Example: Creating a Basic Vertical Histogram

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 1, 2, 3, 4, 1, 3, 4, 5, 2, 3, 4, 5, 5, 1]

# Create a histogram
plt.hist(x, bins=5, edgecolor='black') # Specifying 5 bins for clarity

# Add labels and title
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Basic Vertical Histogram')

# Display the plot
plt.show()

Output: This code will display a vertical histogram showing the frequency of each distinct value in the x list.


2. Creating a Customized Histogram with Density

When density=True, the histogram represents the probability density function (PDF) of the data, making it useful for comparing distributions irrespective of the number of data points.

Example: Normalized Histogram with Custom Colors

import matplotlib.pyplot as plt
import numpy as np

# Generate random data from a normal distribution
np.random.seed(42) # for reproducibility
data = np.random.randn(1000)

# Create a histogram with density, custom color, and edge color
plt.hist(data, bins=30, density=True, color='green', edgecolor='black', alpha=0.7)

# Add labels and title
plt.xlabel('Values')
plt.ylabel('Probability Density')
plt.title('Customized Histogram with Density')

# Display the plot
plt.show()

Output: This will render a histogram where the y-axis represents probability density. The bars are green with black edges and have a slight transparency (alpha=0.7).


3. Creating a Cumulative Histogram

A cumulative histogram displays the number of data points that are less than or equal to a given value. This is achieved by setting cumulative=True.

Example: Cumulative Histogram of Exam Scores

import matplotlib.pyplot as plt
import numpy as np

# Generate random exam scores
np.random.seed(42)
exam_scores = np.random.randint(0, 100, 150)

# Create a cumulative histogram
plt.hist(exam_scores, bins=10, cumulative=True, color='orange', edgecolor='black', alpha=0.7)

# Add labels and title
plt.xlabel('Exam Scores')
plt.ylabel('Cumulative Number of Students')
plt.title('Cumulative Histogram of Exam Scores')

# Display the plot
plt.show()

Output: The resulting plot will show bars where the height of each bar represents the total count of exam scores up to and including that bin's upper edge.


4. Customizing Histogram Colors and Edge Colors

You can easily control the appearance of the histogram bars by specifying color for the fill and edgecolor for the bar borders.

Example: Histogram with Custom Fill and Edge Colors

import matplotlib.pyplot as plt
import numpy as np

# Generate random data
np.random.seed(42)
data = np.random.randn(1000)

# Create a histogram with custom colors
plt.hist(data, bins=25, color='purple', edgecolor='blue')

# Add labels and title
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram with Different Color and Edge Color')

# Display the plot
plt.show()

Output: This example demonstrates a histogram with purple-filled bars and blue edges, providing a distinct visual style.


5. Using Colormaps for Histogram Colors

For more advanced visualization, you can apply a colormap to the histogram bars, where the color of each bar is determined by its value or position.

Example: Applying a Colormap to Histogram Bars

import numpy as np
from matplotlib import pyplot as plt

# Generate random data
np.random.seed(42)
data = np.random.random(1000)

# Create histogram and get the bin counts and patches
n, bins, patches = plt.hist(data, bins=25, density=True, color='red', rwidth=0.75)

# Normalize the bin counts to scale them for the colormap
col = (n - n.min()) / (n.max() - n.min())
# Get a colormap (e.g., RdYlBu - Red to Yellow to Blue)
cm = plt.cm.get_cmap('RdYlBu')

# Apply the colormap to each bar
for c, p in zip(col, patches):
    plt.setp(p, 'facecolor', cm(c))

# Add labels and title
plt.xlabel('Values')
plt.ylabel('Probability Density')
plt.title('Histogram with Colormap')

# Display the plot
plt.show()

Output: This code will display a histogram where the bars are colored according to a colormap, creating a gradient effect based on the density of data in each bin.


6. Assigning Random Colors to Histogram Bars

You can assign a unique random color to each bar of the histogram for a more visually distinct representation.

Example: Setting Random Colors for Each Bar

import numpy as np
import matplotlib.pyplot as plt
import random
import string

# Set figure size for better display
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True

# Generate random data
np.random.seed(42)
data = np.random.rand(100)

# Create histogram
fig, ax = plt.subplots()
N, bins, patches = ax.hist(data, bins=10, edgecolor='black', linewidth=1)

# Function to generate a random hex color
def random_color():
    return "#" + ''.join(random.choices("0123456789ABCDEF", k=6))

# Assign random colors to bars
for patch in patches:
    patch.set_facecolor(random_color())

# Add labels and title
ax.set_xlabel('Values')
ax.set_ylabel('Frequency')
ax.set_title('Histogram with Randomly Colored Bars')

# Display the plot
plt.show()

Output: The histogram will be displayed with each bar having a randomly assigned color.


7. Creating a Stacked Histogram with Multiple Datasets

Stacked histograms are ideal for comparing the distributions of multiple datasets simultaneously. The bars for each dataset are stacked on top of each other within each bin.

Example: Stacked Histogram Comparing Two Datasets

import matplotlib.pyplot as plt
import numpy as np

# Sample data for two datasets
np.random.seed(42)
data1 = np.random.normal(loc=5, scale=2, size=200)
data2 = np.random.normal(loc=8, scale=3, size=250)

# Create a stacked histogram
plt.hist([data1, data2], bins=15, stacked=True, color=['skyblue', 'salmon'], edgecolor='black')

# Add labels, title, and legend
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Stacked Histogram with Multiple Datasets')
plt.legend(['Dataset 1', 'Dataset 2'])

# Display the plot
plt.show()

Output: This will generate a histogram where the frequencies of data1 and data2 are stacked for each bin, allowing for a visual comparison of their distributions.


8. Use Cases for Histograms

Histograms are widely used across various fields for their ability to reveal underlying data patterns:

  • Data Distribution Analysis: Quickly understand the shape (e.g., symmetric, skewed, unimodal, bimodal), central tendency, and variability of a dataset.
  • Identifying Patterns and Trends: Spot clusters, outliers, and unusual patterns in data frequencies.
  • Comparing Multiple Datasets: Stacked or side-by-side histograms are effective for comparing distributions of different groups or conditions.
  • Probability Density Estimation: Normalized histograms (density=True) serve as an estimate of the probability density function, crucial in statistical modeling and machine learning.
  • Quality Control: Monitor process variations and identify deviations from expected distributions.

9. Further Customization Options

Beyond basic aesthetics, Matplotlib offers deeper customization:

  • Bin Size and Range: Experiment with different numbers of bins or explicitly define bin edges (bins parameter) to control the granularity and focus on specific data ranges.
  • Color and Transparency: Utilize the color and alpha (transparency) parameters to adjust the visual density and distinguish overlapping data.
  • Edge Styles: Modify edgecolor and linewidth for bar borders to improve visual separation.
  • Logarithmic Scale: For data spanning several orders of magnitude, consider using a logarithmic scale on the y-axis (e.g., plt.yscale('log')) to better visualize frequencies in lower bins.
  • Orientation: While vertical is standard, you can create horizontal histograms using orientation='horizontal'.

Conclusion

Matplotlib's hist() function is a versatile tool for exploring and presenting numerical data distributions. By leveraging its various parameters for binning, coloring, stacking, and normalization, you can create informative and visually appealing histograms tailored to your specific analytical needs.