Unsupervised learning#

Learning objectives#

  • Recap what unsupervised learning is

  • Explain how this relates to supervised learning, especially the main differences

  • Perform K-means clustering, a simple unsupervised learning algorithm, on sample data

Recap#

  • In unsupervised learning, models learn from unlabelled data (i.e. without explicit supervision).

  • The goal is to learn (often hidden) patterns or structures in the data.

  • We use unsupervised learning for clustering, dimensionality reduction, anomaly detection, and more.

K-means clustering#

  • This type of algorithm tries to assign class labels (i.e. generate clusters) from unlabelled data.

  • While there are many types of algorithms that generate clusters, K-means is a great place to start due to its simplicity (much like how we defaulted to linear regressions in previous notebooks).

How does it work?#

The algorithm is quite simple. To generate k clusters from a dataset, it minimises the variance within each cluster.

  1. Create k cluster centroids randomly

  2. Assign each data point to the nearest centroid

  3. Compute new centroids as the mean of the assigned points

  4. Repeat steps 2 and 3 until the centroids stabilise (i.e. they do not move significantly)

omput

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate some data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/74eb11d8a298883e0babcdd9632bcaa23aabec27c1d622a3108ffc838b8e20f0.png
def kmeans_and_plot(X, n_clusters, title):
    # Apply k-means
    kmeans = KMeans(n_clusters, random_state=42)
    kmeans.fit(X)
    y_kmeans = kmeans.predict(X)

    # Plot data coloured by k-means
    plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=10, cmap="viridis")
    plt.scatter(
        kmeans.cluster_centers_[:, 0],
        kmeans.cluster_centers_[:, 1],
        c="red",
        marker="X",
        s=100,
        label="Centroids",
    )
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.show()
n_clusters = 4
title="K-means: Raw data coloured by predicted cluster"
kmeans_and_plot(X, n_clusters, title)
../../_images/0e344afeef13a3cac01a3ddd1b27ab9f8b5994b3450fa3287f20191812770cb0.png

What happens if we have the wrong k?#

  • Here we could easily see 4 clusters, and specify k = 4.

  • But what if we provide the wrong k?

n_clusters = 2
title="K-means: wrong number of clusters (too few)"
kmeans_and_plot(X, n_clusters, title)
../../_images/4e3fb9d734c5152da64d99442e1d6e076c88b1331e655afdfe25f88afea253ca.png
n_clusters = 10
title="K-means: wrong number of clusters (too many)"
kmeans_and_plot(X, n_clusters, title)
../../_images/d29b687b5332bc5440a1026c3308bc0eb228da7687eaf14eb39c81b38d292791.png
  • Finally, what happens if we have a less distinct data set?

# Generate some data
X, y = make_blobs(n_samples=500, centers=10, cluster_std=2.0, random_state=42)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/8f0c19cda3e669ea4b1d22a2e0e6d1ef0eaf4ae84bc29c2c09529ed8cf4c68b1.png
n_clusters = 10
title="K-means: correct number of clusters, less distinct clusters"
kmeans_and_plot(X, n_clusters, title)
../../_images/de26652ce8dd90ba0792af181fe14a6d2f442158f87f018f39018700fd13b93a.png

Determining the number of clusters#

  • We can attempt to calculate the number of clusters using the elbow method.

  • This is not perfect, as shown below

# Generate some data
X, y = make_blobs(n_samples=500, centers=6, cluster_std=1., random_state=32)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/7ab23e73271a78dd2ba0d9081f6309c111a16513c5cecd61d53c4a5ccaa26c17.png
def elbow_method_and_plot(X, k_range, title, random_state=42):

    # Store within cluster sum of squares (inertia)
    wcss = []
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=random_state)
        kmeans.fit(X)

        # Store inertia
        wcss.append(kmeans.inertia_)

    # Plot the elbow curve
    plt.figure(figsize=(8, 5))
    plt.plot(k_range, wcss, marker='o', linestyle='-')
    plt.xlabel("Number of clusters k")
    plt.ylabel("Within cluster sum of squares (WCSS)")
    plt.title(title)
    plt.xticks(k_range)
    plt.show()
k_range = range(1, 11)
title = "Elbow method for finding k"
elbow_method_and_plot(X, k_range, title, random_state=42)
../../_images/662393cf1ab5a01325a57bdbd26ac8a902c255d0d18b46fd0005624580d363c8.png
n_clusters = 5
title="K-means: data coloured by elbow method"
kmeans_and_plot(X, n_clusters, title)
../../_images/506ff81e9fc981c161f40d46537947122dad108f118669dfdea381716c478df4.png
  • When the clusters are less distinct, the elbow method might be less helpful:

# Generate some data
X, y = make_blobs(n_samples=500, centers=6, cluster_std=2, random_state=32)

plt.scatter(X[:, 0], X[:, 1], s=10)
plt.title("Raw data: 6 clusters (apparently)")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
../../_images/747a2dbcb723aa48b33beb1fe5cda18f01d74f08522135c88952d0220c366ef8.png
k_range = range(1, 11)
title = "Elbow method for finding k"
elbow_method_and_plot(X, k_range, title, random_state=42)
../../_images/54b0ad6953098aa749bfc839f38d42a454d73e00617132bb15e4af6eb608d779.png
n_clusters = 4
title = "K-means: data coloured by elbow method"
kmeans_and_plot(X, n_clusters, title)
../../_images/371681ec768acc63ab5b921553a8a936de155744866cab53be0ada526eb37407.png

K-means Animation#

Below is a animation of K-means, showing each iteration of the algorithm. Notice how the centroids get closer and closer to what a human might label as a ‘blob centre’ with each step. You should also notice how the centroids move in each iteration, they move to the centre of the points assigned to them.