Model selection

Model selection#

Learning objectives#

Select different machine learning models in scikit-learn and train them
Understand that models have different characteristics, including flexibility, parameterisation, and more, which affects model selection and efficacy
Explain the model flexibility/over-fit trade off
Explain what model train and test errors are, and how these relate to overfitting

Logistic regression#

Here is a simple binary classification task with a linear model.
First, we will create some binary classified data.
Then we will train a model that can classify new points into one of the two classes.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

# Generate binary classification dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=0)

print(f"X shape: {X.shape}, y shape: {y.shape}")

X shape: (1000, 2), y shape: (1000,)

print(f"X feature vector: {X[0]}")

X feature vector: [0.4666179  3.86571303]

print(f"y label: {y[0]}")

y label: 0

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Dataset for binary classification (coloured by y)')
plt.show()

../../_images/fe59060161054da361659c10ff054cd90806810e9b8e9e4679ce412e4ef1332a.png

So its time to train a model?#

Not quite. Lets split our X and y into X_train and X_test and y_train and y_test.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

plt.scatter(X_train[:, 0], X_train[:, 1], label="X_train")
plt.scatter(X_test[:, 0], X_test[:, 1], label="X_test")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Dataset split into train (80%) and test (20%)')
plt.legend()
plt.show()

../../_images/cca828086a35cd531ecdd2e470f045bd0d740b43ab0e4afcec12da7f8e712972.png

We will now train our model with X_train and y_train.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now that we have a trained model, we can evaluate how the model performs on the test set that we held out at the start.
Use the model to create predictions for the test data:

y_pred = model.predict(X_test)

Then assess the accuracy against the true labels

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.95

Lets create some plots to show what is happening

plt.scatter(X_test[:, 0], X_test[:, 1], label="X_test")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test dataset')
plt.legend()
plt.show()

../../_images/2e2032e53808a3a5cd478daff5e7c0e707b9a79008d5f8f7e60b54717a7fc243.png

plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test dataset coloured by model prediction')
plt.legend()
plt.show()

/tmp/ipykernel_3106/2276081007.py:5: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend()

../../_images/65bc2b7f27b6d0313a8c844905d137e9ff302e20517987f656590507cfba7f0c.png

Which points were incorrectly classified? We know that ~5% of 1000 were.

misclassified = X_test[y_test != y_pred]

plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, edgecolors='k')
plt.scatter(misclassified[:, 0], misclassified[:, 1], color='red', marker='x', label='Misclassified')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test dataset coloured by model prediction')
plt.legend()
plt.show()

../../_images/419b5b6bfb4eaca56c2d1de8586d6f9883845441116b49259c4d570952178a1f.png

plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k')
plt.scatter(misclassified[:, 0], misclassified[:, 1], color='red', marker='x', label='Misclassified')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test dataset coloured by true labels')
plt.legend()
plt.show()

../../_images/3c271848e049f966b7c94cd8f474848ab0d4a8ec2e926054e1366b7c2cecbd36.png

Summary#

We have used a linear model to classify points into two classes.
We achieved a 95% accuracy score on the test set: this means that given a new (representative) point, we should have a 95% chance of this point being accurately classified.

What happens when our data is not representative?#

If our input data distribution changes, our model performance will suffer.
We can change our toy dataset fairly easily with scikit-learn.
This is still a binary classification task. We can assess the classification performance of the model on the new dataset.

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.50

Our model is now as good (or bad) as flipping a coin.
This makes sense: our data distribution has changed, and we need to re-train our model. Generalising to a random unseen dataset was never going to work!
Lets create a linear model, train it on our new dataset, and assess its accuracy on the holdout test set.

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.86

Much better!

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Moon dataset for binary classification task')
plt.show()

../../_images/668e41ceaef0d998da5861e2fc7f20bacdb72660cb2e6bf6911b5d54b8e025aa.png

Lets plot the misclassified points, but also the decision boundary
For a binary logistic regression, this is the line where the probability of belonging to a class is 0.5

def plot_boundary(X, y, model, misclassified, title):
    xx, yy = np.meshgrid(
        np.linspace(X[:, 0].min(), X[:, 0].max(), 200),
        np.linspace(X[:, 1].min(), X[:, 1].max(), 200),
    )
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
    plt.scatter(
        misclassified[:, 0],
        misclassified[:, 1],
        color="red",
        marker="x",
        label="Misclassified",
    )
    plt.title(title)
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.legend()
    plt.show()

# Find the misclassified points as before
misclassified = X_test[y_test != y_pred]

title="Decision boundary and misclassified points for moon dataset and logistic regression"
plot_boundary(X_test, y_test, model, misclassified, title)

../../_images/d39857d79189008cf699cde048d10702bbf1358a83e11d0f92c9622dbfb341ec.png

Changing the model#

We can potentially get improved performance if we select a new model
Lets use an SVM (Support Vector Machine) with a radial basis function kernel.
This can model non-linear decision boundaries.

from sklearn.svm import SVC

# Train SVM with RBF kernel
model = SVC(kernel='rbf', gamma='scale')
model.fit(X_train, y_train)

SVC()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Predict and evaluate as before
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.98

Great! We improved our prediction accuracy by 11%.
Lets visualise the classification on the test set again, to show the non-linear decision boundary

# Identify misclassified points as before
misclassified = X_test[y_test != y_pred]

title="Decision boundary and misclassified points for moon dataset and SVM"
plot_boundary(X_test, y_test, model, misclassified, title)

../../_images/26ba67a70951ecaa136c0116dfd1373fdce5ce693f052c2cc653245887083e45.png

	penalty	'l2'
	dual	False
	tol	0.0001
	C	1.0
	fit_intercept	True
	intercept_scaling	1
	class_weight	None
	random_state	None
	solver	'lbfgs'
	max_iter	100
	multi_class	'deprecated'
	verbose	0
	warm_start	False
	n_jobs	None
	l1_ratio	None

	penalty	'l2'
	dual	False
	tol	0.0001
	C	1.0
	fit_intercept	True
	intercept_scaling	1
	class_weight	None
	random_state	None
	solver	'lbfgs'
	max_iter	100
	multi_class	'deprecated'
	verbose	0
	warm_start	False
	n_jobs	None
	l1_ratio	None

	C	1.0
	kernel	'rbf'
	degree	3
	gamma	'scale'
	coef0	0.0
	shrinking	True
	probability	False
	tol	0.001
	cache_size	200
	class_weight	None
	verbose	False
	max_iter	-1
	decision_function_shape	'ovr'
	break_ties	False
	random_state	None

Model selection

Contents

Model selection#

Learning objectives#

Logistic regression#

So its time to train a model?#

Summary#

What happens when our data is not representative?#

Changing the model#