Unsupervised learning#

Learning objectives#

  • Recap what unsupervised learning is

  • Explain how this relates to supervised learning, especially the main differences

  • Perform K-means clustering, a simple unsupervised learning algorithm, on sample data

Recap#

  • In unsupervised learning, models learn from unlabelled data (i.e. without explicit supervision).

  • The goal is to learn (often hidden) patterns or structures in the data.

  • We use unsupervised learning for clustering, dimensionality reduction, anomaly detection, and more.

K-means clustering#

  • This type of algorithm tries to assign class labels (i.e. generate clusters) frm unlabelled data.

  • There are many types of algorithm, but K-means is a great place to start.

How does it work?#

The algorithm is quite simple. To generate k clusters from a dataset, it minimises the variance within each cluster.

  1. Create k cluster centroids randomly

  2. Assign each data point to the nearest centroid

  3. Computer new centroids as the mean of the assigned points

  4. Repeat steps 2 and 3 until the centroids stabilise (i.e. they do not move significantly)

Note: k must be provided (we can calculate this with the elbow method)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate some data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/c226de6e83897823fa15bf73ee2626dc2a985513af0b2d995a4c7466ab034737.png
def kmeans_and_plot(X, n_clusters, title):
    # Apply k-means
    kmeans = KMeans(n_clusters, random_state=42)
    kmeans.fit(X)
    y_kmeans = kmeans.predict(X)

    # Plot data coloured by k-means
    plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=10, cmap="viridis")
    plt.scatter(
        kmeans.cluster_centers_[:, 0],
        kmeans.cluster_centers_[:, 1],
        c="red",
        marker="X",
        s=100,
        label="Centroids",
    )
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.show()
n_clusters = 4
title="K-means: Raw data coloured by predicted cluster"
kmeans_and_plot(X, n_clusters, title)
../../_images/b436f9f5b5207a5d43ac4f9f28c923197884e4bc9a594bb101a8007d7d8fe0de.png

What happens if we have the wrong k?#

  • Here we could easily see 4 clusters, and specify k = 4.

  • But what if we provide the wrong k?

n_clusters = 2
title="K-means: wrong number of clusters (too few)"
kmeans_and_plot(X, n_clusters, title)
../../_images/b6241e2609e041ab285aee8ffc3d04d783f436883de333a47a34cd4d90fe1a58.png
n_clusters = 10
title="K-means: wrong number of clusters (too many)"
kmeans_and_plot(X, n_clusters, title)
../../_images/2663f6c97ecedf66750990a614e4fe1f9ecdcb33921413640d05b82a1958b0fc.png
  • Finally, what happens if we have a less distinct data set?

# Generate some data
X, y = make_blobs(n_samples=500, centers=10, cluster_std=2.0, random_state=42)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/de5f930f6508734495ac3ac44825bdf9f71d5fc7ba2636623b8f8aafcf7c7d0d.png
n_clusters = 10
title="K-means: correct number of clusters, less distinct clusters"
kmeans_and_plot(X, n_clusters, title)
../../_images/f995d9df195936b6b1a9ffbbbbcf2774163dfa5a912be7af2ded60ddf7caf3b2.png

Determining the number of clusters#

  • We can attempt to calculate the number of clusters using the elbow method.

  • This is not perfect, as shown below

# Generate some data
X, y = make_blobs(n_samples=500, centers=6, cluster_std=1., random_state=32)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/58bb19799459e353389ba9a07ac8c54cf3a9e6e94c155f89d5349e6957bb80a2.png
def elbow_method_and_plot(X, k_range, title, random_state=42):

    # Store within cluster sum of squares (inertia)
    wcss = []
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=random_state)
        kmeans.fit(X)

        # Store inertia
        wcss.append(kmeans.inertia_)

    # Plot the elbow curve
    plt.figure(figsize=(8, 5))
    plt.plot(k_range, wcss, marker='o', linestyle='-')
    plt.xlabel("Number of clusters k")
    plt.ylabel("Within cluster sum of squares (WCSS)")
    plt.title(title)
    plt.xticks(k_range)
    plt.show()
k_range = range(1, 11)
title = "Elbow method for finding k"
elbow_method_and_plot(X, k_range, title, random_state=42)
../../_images/b4a8ac42be41c713f1a20242f893465187185f39f235adb3ac06f228b450cb66.png
n_clusters = 5
title="K-means: data colorued by elbow method"
kmeans_and_plot(X, n_clusters, title)
../../_images/ead79dfc0b76ad31e003fa2826d270d701b7d20612be753c931be2c6f3dbd969.png
  • When the clusters are less distinct, the elbow method might be less helpful:

# Generate some data
X, y = make_blobs(n_samples=500, centers=6, cluster_std=2, random_state=32)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data: 6 clusters (apparently)")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/5d2fd1d1ffc2965ee81887894729ec97105d74dadc505c633768134765c4b1ea.png
k_range = range(1, 11)
title = "Elbow method for finding k"
elbow_method_and_plot(X, k_range, title, random_state=42)
../../_images/a5343b13b8183bf14584b3f763022f0a125ae2476d160cb1b320d28496f76065.png
n_clusters = 4
title = "K-means: data coloured by elbow method"
kmeans_and_plot(X, n_clusters, title)
../../_images/83c2bba2026d2f5503032a69e0cd96667beee561fd99224eb25d81d6fa656f65.png