K-Means Clustering Explained with SciPy for ML

Learn K-Means clustering for data partitioning & analysis with SciPy. Discover how this unsupervised ML algorithm groups data points into distinct clusters.

K-Means Clustering with SciPy

K-Means clustering is a powerful algorithm used to partition a dataset into a predefined number of distinct clusters, denoted by K. In this method, each data point is assigned to the cluster whose mean (or centroid) is closest to it. The primary objective of K-Means is to minimize the within-cluster variance, aiming to create clusters that are as compact and internally homogeneous as possible.

The scipy.cluster.vq module in SciPy provides efficient implementations for K-Means clustering and related techniques. It offers two key functions:

  • kmeans: This function performs the core K-Means clustering, identifying cluster centroids.
  • vq: This function assigns each data point to its nearest cluster centroid based on the results from kmeans.

While SciPy's clustering tools may not be as feature-rich as libraries like scikit-learn, they are highly optimized for speed and simplicity, making them an excellent choice for lightweight clustering tasks and foundational understanding.

Types of K-Means Clustering in SciPy

SciPy supports two primary approaches to K-Means clustering:

1. Standard K-Means Clustering

The standard K-Means algorithm operates through an iterative process of assigning data points to clusters and subsequently updating the cluster centroids.

How It Works: Step-by-Step

  1. Initialization:

    • Determine K: Specify the desired number of clusters, K.
    • Centroid Initialization: Randomly select K initial centroids from the data points, or utilize more sophisticated initialization methods like K-Means++ (explained below) for better performance.
  2. Assignment Step:

    • Distance Calculation: For each data point, compute its distance to every centroid. The Euclidean distance is commonly used.
    • Cluster Assignment: Assign each data point to the cluster whose centroid is the nearest.
  3. Update Step:

    • Centroid Recalculation: Recompute the position of each centroid by calculating the mean of all data points currently assigned to that cluster.
    • Iteration: Repeat the Assignment and Update steps until the centroids no longer change significantly between iterations, indicating convergence.

Example: Standard K-Means Clustering with SciPy

import numpy as np
from scipy.cluster.vq import kmeans, vq
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(0)
data = np.vstack([
    np.random.normal(0, 0.5, (50, 2)),
    np.random.normal(3, 0.5, (50, 2)),
    np.random.normal(6, 0.5, (50, 2))
])

# Number of clusters
k = 3

# Apply K-means clustering
# kmeans returns centroids and the distortion (sum of squared distances to nearest centroid)
centroids, distortion = kmeans(data, k)

# Assign each data point to a cluster
# vq returns cluster labels and distances to centroids
labels, _ = vq(data, centroids)

# Visualize the result
plt.figure(figsize=(8, 6))
for i in range(k):
    # Plot data points belonging to each cluster
    plt.scatter(data[labels == i, 0], data[labels == i, 1], label=f'Cluster {i+1}')

# Plot the centroids
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')

plt.title('Standard K-Means Clustering with SciPy')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

print("Centroids:\n", centroids)
print("Distortion:", distortion)

Sample Output:

Centroids:
 [[0.0084 0.0514]
 [5.9534 5.9873]
 [2.9906 3.0913]]
Distortion: 0.6258

2. K-Means++ Clustering

K-Means++ is an advanced initialization technique that significantly improves upon the standard K-Means by addressing the sensitivity to initial centroid placement. It aims to spread out the initial centroids more effectively, which generally leads to faster convergence and more accurate clustering results.

How K-Means++ Works:

  1. First Centroid Selection: Randomly select one data point as the first centroid.
  2. Subsequent Centroid Selection: For each remaining data point, calculate the distance to the nearest already-chosen centroid. Select the next centroid from the data points with a probability that is proportional to the square of this minimum distance. This "weighted" selection favors points that are farther away from existing centroids.
  3. Iteration: Repeat step 2 until K centroids have been selected.
  4. Standard K-Means Execution: Once K initial centroids are established using K-Means++, proceed with the standard K-Means algorithm (Assignment and Update steps) until convergence.

Example: K-Means++ Using SciPy

SciPy facilitates K-Means++ initialization by using the kmeans2 function with the minit='++' parameter.

from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
import numpy as np # Assuming 'data' and 'k' are defined as in the previous example

# Assuming 'data' and 'k' are already defined from the previous example
# Generate sample data if not already present
if 'data' not in locals():
    np.random.seed(0)
    data = np.vstack([
        np.random.normal(0, 0.5, (50, 2)),
        np.random.normal(3, 0.5, (50, 2)),
        np.random.normal(6, 0.5, (50, 2))
    ])
    k = 3

# Perform K-means++ clustering
# kmeans2 returns centroids and labels. minit='++' enables K-Means++ initialization.
centroids_plusplus, labels_plusplus = kmeans2(data, k, minit='++')

# Plot the results
plt.figure(figsize=(8, 6))
for i in range(k):
    plt.scatter(data[labels_plusplus == i, 0], data[labels_plusplus == i, 1], label=f'Cluster {i+1}')

plt.scatter(centroids_plusplus[:, 0], centroids_plusplus[:, 1], c='red', marker='X', s=200, label='Centroids')

plt.title('K-Means++ Clustering with SciPy')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

print("Centroids (K-Means++):\n", centroids_plusplus)

Vector Quantization (VQ) with SciPy

Vector Quantization (VQ) is a technique primarily used for data compression and pattern recognition. It works by mapping high-dimensional data vectors to a smaller, representative set of vectors known as codebook vectors. This process effectively reduces the amount of data needed to represent the original information.

VQ finds applications in:

  • Data Compression: Reducing the storage or transmission size of data.
  • Pattern Recognition: Identifying and classifying patterns in data.
  • Signal Processing: Quantizing and representing signals efficiently.

How Vector Quantization Works:

  1. Initialization: Establish an initial set of codebook vectors. A common and effective method for this is to use the centroids obtained from K-Means clustering.
  2. Assignment: Each data vector in the dataset is compared against all codebook vectors. The data vector is then assigned to the codebook vector that is closest to it (typically using Euclidean distance).
  3. Update: The codebook vectors are updated by recalculating them as the mean of all data vectors that were assigned to them.
  4. Iteration: The assignment and update steps are repeated until the codebook vectors stabilize or a predefined convergence criterion is met.

Example: Vector Quantization using K-Means

Here, we use K-Means to generate the codebook vectors for Vector Quantization.

import numpy as np
from scipy.cluster.vq import kmeans, vq
import matplotlib.pyplot as plt

# Generate synthetic data for VQ example
np.random.seed(0)
data_vq = np.vstack([
    np.random.normal(0, 0.5, (100, 2)),
    np.random.normal(3, 0.5, (100, 2)),
    np.random.normal(6, 0.5, (100, 2))
])

# Number of codebook vectors (clusters)
k_vq = 3

# Train VQ using K-means to find codebook vectors (centroids)
codebook_vectors, distortion_vq = kmeans(data_vq, k_vq)

# Assign each data point to its nearest codebook vector
# This assigns each point to a cluster represented by a codebook vector
labels_vq, _ = vq(data_vq, codebook_vectors)

# Plot the quantized vectors and clusters
plt.figure(figsize=(8, 6))
for i in range(k_vq):
    plt.scatter(data_vq[labels_vq == i, 0], data_vq[labels_vq == i, 1], label=f'Cluster {i+1}')

# Highlight the codebook vectors
plt.scatter(codebook_vectors[:, 0], codebook_vectors[:, 1], c='red', marker='X', s=200, label='Codebook Vectors')

plt.title('Vector Quantization with K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

print("Codebook Vectors:\n", codebook_vectors)
print("Distortion:", distortion_vq)

Output Sample:

Codebook Vectors:
 [[-0.0004  0.0713]
 [ 5.9438  5.9484]
 [ 2.9284  2.9435]]
Distortion: 0.6249

Conclusion

K-Means clustering in SciPy offers a robust and straightforward approach for uncovering underlying patterns and grouping similar data points. Whether you are leveraging:

  • Standard K-Means for foundational clustering tasks,
  • K-Means++ for enhanced initialization and more reliable clustering outcomes, or
  • Vector Quantization for efficient data compression and representation,

SciPy provides efficient and accessible tools to embark on your unsupervised learning journey in Python.