Hierarchical Clustering in SciPy: A Data Science Guide
Master hierarchical clustering with SciPy. Learn this unsupervised ML technique for nested clusters, exploratory data analysis, and pattern recognition.
Hierarchical Clustering in SciPy: A Comprehensive Guide
Hierarchical Clustering is a powerful unsupervised learning technique used in data science and machine learning to build nested clusters. This method iteratively merges or splits data points based on their similarities, making it ideal for exploratory data analysis, pattern recognition, and grouping datasets without a predefined number of clusters.
The scipy.cluster.hierarchy
module in SciPy provides robust and efficient functions for implementing hierarchical clustering in Python.
What is Hierarchical Clustering?
Hierarchical clustering constructs a hierarchy of clusters in one of two ways:
- Agglomerative Approach (Bottom-Up): This approach starts with each data point as its own cluster. Clusters are then progressively merged based on their similarity until a single cluster containing all data points is formed.
- Divisive Approach (Top-Down): This approach begins with all data points in a single cluster. The algorithm then recursively splits clusters into smaller ones until each data point is in its own cluster or a stopping criterion is met.
A key advantage of hierarchical clustering is that it does not require the number of clusters to be specified in advance. The results are often visualized using a dendrogram, a tree-like diagram that illustrates the sequence of merges or splits.
Types of Hierarchical Clustering in SciPy
SciPy primarily supports the agglomerative approach to hierarchical clustering. While the divisive approach is not directly implemented, it can be achieved through manual implementation.
1. Agglomerative Hierarchical Clustering
This is the most common form of hierarchical clustering.
- Approach: Bottom-up.
- Process:
- Each data point starts as a separate cluster.
- Clusters are merged step-by-step based on a chosen distance metric and linkage criterion.
- SciPy Implementation: Primarily uses the
linkage()
function.
2. Divisive Hierarchical Clustering
This approach is less common and not directly provided by SciPy's hierarchy
module.
- Approach: Top-down.
- Process:
- All data points start in a single cluster.
- The algorithm recursively splits clusters until a desired structure is achieved.
- SciPy Implementation: Not directly available; requires manual implementation.
Agglomerative Hierarchical Clustering in SciPy
The process of agglomerative hierarchical clustering in SciPy involves three main steps: linkage computation, dendrogram visualization, and forming flat clusters.
Step 1: Linkage Computation
The linkage()
function computes the hierarchical clustering. It calculates the distances between clusters and merges them based on a specified linkage method. Common linkage methods include:
- Single Linkage: Merges clusters based on the minimum distance between points in the two clusters.
- Complete Linkage: Merges clusters based on the maximum distance between points in the two clusters.
- Average Linkage: Merges clusters based on the average distance between all pairs of points in the two clusters.
- Ward's Method: Merges clusters to minimize the total within-cluster variance.
Syntax:
scipy.cluster.hierarchy.linkage(Y, method='ward', metric='euclidean')
Y
: A condensed distance matrix or a NumPy array of data points.method
: The linkage criterion to use (e.g., 'single', 'complete', 'average', 'ward'). Defaults to 'ward'.metric
: The distance metric to use (e.g., 'euclidean', 'cityblock', 'cosine'). Defaults to 'euclidean'.
Example:
import numpy as np
from scipy.cluster.hierarchy import linkage
# Sample data (10 samples, 2 features)
data = np.random.rand(10, 2)
# Compute linkage matrix using Ward's method
Z = linkage(data, method='ward')
print("Linkage Matrix (Z):")
print(Z)
The linkage matrix Z
is a NumPy array where each row represents a merge: [cluster_i, cluster_j, distance, num_points_in_new_cluster]
.
Step 2: Dendrogram Visualization
The dendrogram()
function visualizes the hierarchical clustering results as a dendrogram. This plot shows how clusters are merged at different levels of similarity or distance.
Syntax:
scipy.cluster.hierarchy.dendrogram(Z, **kwargs)
Z
: The linkage matrix computed bylinkage()
.**kwargs
: Additional arguments for customization (e.g.,orientation
,labels
).
Example:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
import numpy as np
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [1.5, 2.5], [5.5, 6.5]])
# Compute linkage matrix
Z = linkage(data, method='ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z,
leaf_rotation=90., # Rotate x-axis labels for better readability
leaf_font_size=8., # Font size for x-axis labels
)
plt.title('Dendrogram of Hierarchical Clustering')
plt.xlabel('Sample Index or Cluster Size')
plt.ylabel('Distance')
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show()
Step 3: Forming Flat Clusters
The fcluster()
function allows you to extract flat clusters from the hierarchical structure. You can specify a criterion to cut the dendrogram, such as a maximum distance threshold (t
) or a desired number of clusters (t
with criterion='maxclust'
).
Syntax:
scipy.cluster.hierarchy.fcluster(Z, t, criterion='maxclust')
Z
: The linkage matrix.t
: The threshold or number of clusters.criterion
: The criterion for forming flat clusters. Common values:'maxclust'
: Form exactlyt
clusters.'distance'
: Form clusters where the distance between any two clusters is less than or equal tot
.'monocrit'
: Form clusters based on a monotone criterion.
Example:
from scipy.cluster.hierarchy import linkage, fcluster
import numpy as np
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [1.5, 2.5], [5.5, 6.5]])
# Compute linkage matrix
Z = linkage(data, method='ward')
# Form flat clusters: example 1 - form 3 clusters
clusters_maxclust = fcluster(Z, t=3, criterion='maxclust')
print(f"Cluster assignments (maxclust=3): {clusters_maxclust}")
# Form flat clusters: example 2 - cut at distance 3.0
clusters_distance = fcluster(Z, t=3.0, criterion='distance')
print(f"Cluster assignments (distance=3.0): {clusters_distance}")
Example Output:
Cluster assignments (maxclust=3): [1 1 2 3 1 2]
Cluster assignments (distance=3.0): [1 1 2 3 1 2]
(Note: Actual cluster assignments might vary slightly depending on the linkage
and fcluster
implementation details and tie-breaking).
Divisive Hierarchical Clustering (Top-Down Clustering)
Divisive clustering starts with all data points in a single cluster and recursively splits them into smaller clusters. While SciPy's hierarchy
module does not offer a direct function for this, it can be implemented manually. A common strategy involves splitting a cluster into two based on some criterion, such as separating points furthest from the cluster centroid.
Manual Implementation Example in Python:
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
def split_cluster(cluster):
"""
Splits a cluster into two based on distance from the centroid.
A simple heuristic: points closer to the median distance from centroid go to one cluster.
"""
if len(cluster) < 2:
return cluster, np.array([]) # Cannot split if only one point
# Calculate centroid
centroid = np.mean(cluster, axis=0)
# Calculate distances of each point to the centroid
distances = cdist(cluster, [centroid], metric='euclidean').ravel()
# Find the median distance
median_distance = np.median(distances)
# Split based on whether distance is less than or equal to the median
cluster1 = cluster[distances <= median_distance]
cluster2 = cluster[distances > median_distance]
# Handle cases where split results in empty cluster
if len(cluster1) == 0:
cluster1 = cluster[:len(cluster)//2]
cluster2 = cluster[len(cluster)//2:]
elif len(cluster2) == 0:
cluster1 = cluster[:len(cluster)//2]
cluster2 = cluster[len(cluster)//2:]
return cluster1, cluster2
def divisive_clustering(data, n_clusters):
"""
Performs divisive clustering by recursively splitting the largest cluster.
"""
clusters = [data] # Start with all data in one cluster
# Keep splitting until we have the desired number of clusters
while len(clusters) < n_clusters:
# Find the largest cluster to split
largest_cluster_index = np.argmax([len(c) for c in clusters])
largest_cluster = clusters.pop(largest_cluster_index)
# Split the largest cluster
c1, c2 = split_cluster(largest_cluster)
# Add the new clusters back if they are not empty
if len(c1) > 0:
clusters.append(c1)
if len(c2) > 0:
clusters.append(c2)
return clusters
# --- Example Usage ---
np.random.seed(42) # for reproducibility
# Generate some sample data
data = np.random.rand(100, 2) * 10
# Perform divisive clustering to get 4 clusters
num_target_clusters = 4
resulting_clusters = divisive_clustering(data, num_target_clusters)
# Visualize the results
plt.figure(figsize=(8, 6))
colors = plt.cm.get_cmap('viridis', len(resulting_clusters))
for i, cluster in enumerate(resulting_clusters):
plt.scatter(cluster[:, 0], cluster[:, 1], color=colors(i), label=f'Cluster {i+1}')
plt.title('Divisive Hierarchical Clustering (Manual Implementation)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Conclusion: Why Use Hierarchical Clustering in SciPy?
Hierarchical clustering in SciPy offers a flexible and powerful approach to data grouping, particularly when the number of clusters is unknown or when exploring nested relationships within data. Its key advantages include:
- No Predefined Cluster Count: You don't need to specify the number of clusters beforehand.
- Visual Insights: Dendrograms provide excellent visual aids to understand the clustering process and the relationships between clusters.
- Nested Groupings: It's ideal for scenarios where data has inherent hierarchical or nested structures.
- Customizable Metrics: Users can choose from various distance metrics and linkage methods to suit their specific data and problem.
- Efficiency: SciPy's implementation is optimized for performance in Python.
SciPy FFTpack: Fast Fourier Transforms for AI & ML
Explore SciPy's FFTpack for efficient Fast Fourier Transforms (FFT) and IFFT. Analyze signal frequency components, essential for AI and Machine Learning applications.
SciPy Basics: Intro to Python's Scientific Library
Discover SciPy, the essential Python library for scientific computing. Explore its core functions, modules, and applications, crucial for ML & AI.