Scikit-Learn#

Learning Objectives#

  • Learn how to import and use Scikit-Learn in Python

  • Understand the basic concepts and purpose of Scikit-Learn in machine learning tasks

  • Implement linear regression, k-means clustering, and decision tree models using Scikit-Learn

  • Understand the applications and limitations of each model

  • Learn how to clean and preprocess data using Pandas before feeding it into machine-learning models

  • Understand the importance of handling missing values and feature engineering

  • Learn how to evaluate the performance of different machine-learning models

  • Interpret the output of models to make informed decisions based on the data

Download Files#

Click Here to Download Data File to Local Space

# import the packages used before and read in the required data
import pandas as pd
import matplotlib.pyplot as plt
air_pollution_data_2023_complete_dataset = pd.read_csv("data/LEED_air_pollution_monitoring_station_2023_complete_dataset.csv", index_col=0)
air_pollution_data_2023_complete_dataset = air_pollution_data_2023_complete_dataset.dropna()
# scikit learn can be imported with the the following command
import sklearn

What is Scikit-Learn?#

Scikit-Learn is a popular Python package that provides a set of algorithms and tools for machine learning that are both easy to use and effective. The package includes support for various tasks, including classification, regression, clustering, dimensionality reduction and model selection and normalization.

In Scikit-Learn, the models that are available include a massive number of possible arguments, and so for the purpose of this course, the default arguments have been used.

air_pollution_data_2023_complete_dataset["date"] = pd.to_datetime(air_pollution_data_2023_complete_dataset["date"], format="%d/%m/%Y %H:%M")
air_pollution_data_2023_complete_dataset["Hour"] = air_pollution_data_2023_complete_dataset["date"].dt.hour
display(air_pollution_data_2023_complete_dataset.head())
date NO2 O3 NO Wind Speed Temperature site Year Hour
24664 2023-01-01 01:00:00 7.30306 76.61852 1.22702 4.9 7.2 Leeds Centre 2023.0 1
24665 2023-01-01 02:00:00 4.31351 79.67418 0.82507 7.0 7.5 Leeds Centre 2023.0 2
24666 2023-01-01 03:00:00 2.95539 81.50758 0.79333 7.3 7.3 Leeds Centre 2023.0 3
24667 2023-01-01 04:00:00 2.08340 81.45519 0.53947 7.2 7.2 Leeds Centre 2023.0 4
24669 2023-01-01 06:00:00 2.95643 82.71238 0.57120 6.8 7.1 Leeds Centre 2023.0 6

Linear Regression#

Linear regression is a simple model that is able to predict a dependent variable based on one or more independent variables, with the assumption of a linear relationship between the two.

from sklearn.linear_model import LinearRegression
import numpy as np

# Create some simple data
X = air_pollution_data_2023_complete_dataset[["Wind Speed"]]
y = air_pollution_data_2023_complete_dataset[["NO2"]]

# Train a linear regression model
model = LinearRegression()
model.fit(X, y)

# Plotting the data points
plt.scatter(X, y, color='blue', label='Observations', alpha=0.05)

# Predicting the values to draw the regression line
# We use minimum and maximum values of X to cover the whole range of data
X_new = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)  # Making it a column vector
y_predict = model.predict(X_new)

# Plotting the regression line
plt.plot(X_new, y_predict, color='red', linewidth=2, label='Regression Line')

# Adding title and labels
plt.title('Relationship Between Wind Speed and NO$_2$ Levels')
plt.xlabel('Wind Speed m/s')
plt.ylabel('NO$_2$ Concentration')
plt.legend()

# Show the plot
plt.show()


# Extracting and printing the intercept and slope
intercept = model.intercept_
slope = model.coef_

print("Intercept of the regression line:", intercept[0])  # Intercept is usually an array with a single element
print("Slope of the regression line:", slope[0][0])  # Slope is an array of arrays, each containing one element per feature
/home/runner/.cache/pypoetry/virtualenvs/coding_for_reproducible_research_courses-l5b2FSBI-py3.11/lib/python3.11/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
../../_images/0ec0dbdc8fbc27a28bfc62dd18720826af31be1a89200760b2a82eea627d6412.png
Intercept of the regression line: 25.914968461312014
Slope of the regression line: -1.8904584011012917

K-Means Clustering#

K-means clustering is a method for partitioning data into a specified number of clusters by minimizing the variance within each cluster.

from sklearn.cluster import KMeans
import numpy as np

# Set the number of clusters
k = 3  # Example number of clusters

# Create KMeans model
kmeans = KMeans(n_clusters=k, random_state=0)

# Fit the model
clusters = kmeans.fit_predict(air_pollution_data_2023_complete_dataset[['NO2', 'Temperature']])
# Assuming 'clusters' has been added to your DataFrame
air_pollution_data_2023_complete_dataset['cluster'] = clusters

# Define a list of colors or use a colormap with discrete colors
colors = plt.cm.get_cmap('tab10', len(air_pollution_data_2023_complete_dataset['cluster'].unique()))  # 'tab10' supports up to 10 unique categories

# Plotting clusters categorically
plt.figure(figsize=(10, 6))
scatter = plt.scatter(air_pollution_data_2023_complete_dataset['NO2'], air_pollution_data_2023_complete_dataset['Wind Speed'], 
                      c=air_pollution_data_2023_complete_dataset['cluster'], cmap=colors, alpha=0.5)

plt.title('Clusters of Air Pollution Data')
plt.xlabel('NO2 Concentration')
plt.ylabel('Wind Speed')  # Adjusted to match the y-axis label with your data

# Create a legend for clusters
# Create handles and labels for the legend
handles, labels = scatter.legend_elements(prop='colors')
plt.legend(handles, labels, title="Clusters")

plt.show()
/tmp/ipykernel_3453/3890792038.py:5: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
  colors = plt.cm.get_cmap('tab10', len(air_pollution_data_2023_complete_dataset['cluster'].unique()))  # 'tab10' supports up to 10 unique categories
../../_images/ad110ee0c1f0c8a8acd97a5d7cd0e18667dfaac80607e0043e0b005fd4c61933.png

Decision Tree Models#

Decision Trees are a model framework that splits data into branches based on the some input data - in our case wind speed. The model resembles a tree structure with decision points and leaf nodes where each decision leads to further splits, eventually ending on a final prediction based on the input features.

from sklearn.tree import DecisionTreeRegressor

# Initialize the model
decision_tree_regressor = DecisionTreeRegressor(random_state=42, max_depth=2)

# Train the model on the entire dataset
decision_tree_regressor.fit(air_pollution_data_2023_complete_dataset[["Wind Speed"]], air_pollution_data_2023_complete_dataset["NO2"])


import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Plot the decision tree
plt.figure(figsize=(50,10))
plot_tree(decision_tree_regressor, feature_names=['Wind Speed'], filled=True)
plt.show()
../../_images/265b99d1968583ec712cae31f5012e9be351739c65a5e6c03da30b3bd7a61578.png

Further Models#

This section of the course has only introduced a very small subset of models that are avaliable, alongside not diving into the different arguments that are available. If you are interested in learning more about the models themselves, and key practical aspects of deploying and ensuring the validity of models you develop, then please attend e course “Introduction to Machine Learning”.