Unsupervised learning#

Learning objectives#

  • Recap what unsupervised learning is

  • Explain how this relates to supervised learning, especially the main differences

  • Perform K-means clustering, a simple unsupervised learning algorithm, on sample data

Recap#

  • In unsupervised learning, models learn from unlabelled data (i.e. without explicit supervision).

  • The goal is to learn (often hidden) patterns or structures in the data.

  • We use unsupervised learning for clustering, dimensionality reduction, anomaly detection, and more.

K-means clustering#

  • This type of algorithm tries to assign class labels (i.e. generate clusters) from unlabelled data.

  • While there are many types of algorithms that generate clusters, K-means is a great place to start due to its simplicity (much like how we defaulted to linear regressions in previous notebooks).

How does it work?#

The algorithm is quite simple. To generate k clusters from a dataset, it minimises the variance within each cluster.

  1. Create k cluster centroids randomly

  2. Assign each data point to the nearest centroid

  3. Compute new centroids as the mean of the assigned points

  4. Repeat steps 2 and 3 until the centroids stabilise (i.e. they do not move significantly)

omput

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate some data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/74eb11d8a298883e0babcdd9632bcaa23aabec27c1d622a3108ffc838b8e20f0.png
def kmeans_and_plot(X, n_clusters, title):
    # Apply k-means
    kmeans = KMeans(n_clusters, random_state=42)
    kmeans.fit(X)
    y_kmeans = kmeans.predict(X)

    # Plot data coloured by k-means
    plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=10, cmap="viridis")
    plt.scatter(
        kmeans.cluster_centers_[:, 0],
        kmeans.cluster_centers_[:, 1],
        c="red",
        marker="X",
        s=100,
        label="Centroids",
    )
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.show()
n_clusters = 4
title="K-means: Raw data coloured by predicted cluster"
kmeans_and_plot(X, n_clusters, title)
../../_images/0e344afeef13a3cac01a3ddd1b27ab9f8b5994b3450fa3287f20191812770cb0.png

What happens if we have the wrong k?#

  • Here we could easily see 4 clusters, and specify k = 4.

  • But what if we provide the wrong k?

n_clusters = 2
title="K-means: wrong number of clusters (too few)"
kmeans_and_plot(X, n_clusters, title)
../../_images/4e3fb9d734c5152da64d99442e1d6e076c88b1331e655afdfe25f88afea253ca.png
n_clusters = 10
title="K-means: wrong number of clusters (too many)"
kmeans_and_plot(X, n_clusters, title)
../../_images/d29b687b5332bc5440a1026c3308bc0eb228da7687eaf14eb39c81b38d292791.png
  • Finally, what happens if we have a less distinct data set?

# Generate some data
X, y = make_blobs(n_samples=500, centers=10, cluster_std=2.0, random_state=42)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/8f0c19cda3e669ea4b1d22a2e0e6d1ef0eaf4ae84bc29c2c09529ed8cf4c68b1.png
n_clusters = 10
title="K-means: correct number of clusters, less distinct clusters"
kmeans_and_plot(X, n_clusters, title)
../../_images/de26652ce8dd90ba0792af181fe14a6d2f442158f87f018f39018700fd13b93a.png

Determining the number of clusters#

  • We can attempt to calculate the number of clusters using the elbow method.

  • This is not perfect, as shown below

# Generate some data
X, y = make_blobs(n_samples=500, centers=6, cluster_std=1., random_state=32)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/7ab23e73271a78dd2ba0d9081f6309c111a16513c5cecd61d53c4a5ccaa26c17.png
def elbow_method_and_plot(X, k_range, title, random_state=42):

    # Store within cluster sum of squares (inertia)
    wcss = []
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=random_state)
        kmeans.fit(X)

        # Store inertia
        wcss.append(kmeans.inertia_)

    # Plot the elbow curve
    plt.figure(figsize=(8, 5))
    plt.plot(k_range, wcss, marker='o', linestyle='-')
    plt.xlabel("Number of clusters k")
    plt.ylabel("Within cluster sum of squares (WCSS)")
    plt.title(title)
    plt.xticks(k_range)
    plt.show()
k_range = range(1, 11)
title = "Elbow method for finding k"
elbow_method_and_plot(X, k_range, title, random_state=42)
../../_images/662393cf1ab5a01325a57bdbd26ac8a902c255d0d18b46fd0005624580d363c8.png
n_clusters = 5
title="K-means: data coloured by elbow method"
kmeans_and_plot(X, n_clusters, title)
../../_images/506ff81e9fc981c161f40d46537947122dad108f118669dfdea381716c478df4.png
  • When the clusters are less distinct, the elbow method might be less helpful:

# Generate some data
X, y = make_blobs(n_samples=500, centers=6, cluster_std=2, random_state=32)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data: 6 clusters (apparently)")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/747a2dbcb723aa48b33beb1fe5cda18f01d74f08522135c88952d0220c366ef8.png
k_range = range(1, 11)
title = "Elbow method for finding k"
elbow_method_and_plot(X, k_range, title, random_state=42)
../../_images/54b0ad6953098aa749bfc839f38d42a454d73e00617132bb15e4af6eb608d779.png
n_clusters = 4
title = "K-means: data coloured by elbow method"
kmeans_and_plot(X, n_clusters, title)
../../_images/371681ec768acc63ab5b921553a8a936de155744866cab53be0ada526eb37407.png

K-means Animation#

Below is a animation of K-means, showing each iteration of the algorithm. Notice how the centroids get closer and closer to what a human might label as a ‘blob centre’ with each step. You should also notice how the centroids move in each iteration, they move to the centre of the points assigned to them.

The iframe contains an animated scatter illustrating K-Means over 10 iterations. Points update to their nearest centroid and centroids (red X) move each step.