{ "cells": [ { "cell_type": "markdown", "id": "b629b764-2854-40ce-ab2a-1192d0825b7e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Scikit-Learn \n", "## Learning Objectives\n", "- Learn how to import and use Scikit-Learn in Python\n", "- Understand the basic concepts and purpose of Scikit-Learn in machine learning tasks\n", "- Implement linear regression, k-means clustering, and decision tree models using Scikit-Learn\n", "- Understand the applications and limitations of each model\n", "- Learn how to clean and preprocess data using Pandas before feeding it into machine-learning models\n", "- Understand the importance of handling missing values and feature engineering\n", "- Learn how to evaluate the performance of different machine-learning models\n", "- Interpret the output of models to make informed decisions based on the data \n" ] }, { "cell_type": "code", "execution_count": 1, "id": "1fd7b159-8e6f-4894-b1a2-afd7c405ee08", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# import the packages used before and read in the required data\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "air_pollution_data_2023_complete_dataset = pd.read_csv(\"data/LEED_air_pollution_monitoring_station_2023_complete_dataset.csv\", index_col=0)\n", "air_pollution_data_2023_complete_dataset = air_pollution_data_2023_complete_dataset.dropna()" ] }, { "cell_type": "code", "execution_count": 2, "id": "da886804-40e5-445f-a61d-1a80a84a9558", "metadata": {}, "outputs": [], "source": [ "# scikit learn can be imported with the the following command\n", "import sklearn" ] }, { "cell_type": "markdown", "id": "601a8215-a645-4fe6-8334-1dc7434a589b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## What is Scikit-Learn?\n", "\n", "Scikit-Learn is a popular Python package that provides a set of algorithms and tools for machine learning that are both easy to use and effective. The package includes support for various tasks, including classification, regression, clustering, dimensionality reduction and model selection and normalization.\n", "\n", "In Scikit-Learn, the models that are available include a massive number of possible arguments, and so for the purpose of this course, the default arguments have been used. \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "46f26ec1-c4a5-42f3-9f50-1ad40b0021f5", "metadata": {}, "outputs": [], "source": [ "air_pollution_data_2023_complete_dataset[\"date\"] = pd.to_datetime(air_pollution_data_2023_complete_dataset[\"date\"], format=\"%d/%m/%Y %H:%M\")\n", "air_pollution_data_2023_complete_dataset[\"Hour\"] = air_pollution_data_2023_complete_dataset[\"date\"].dt.hour" ] }, { "cell_type": "code", "execution_count": 4, "id": "9950fa3b-8d1d-403d-960a-bf761f3a654a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | date | \n", "NO2 | \n", "O3 | \n", "NO | \n", "Wind Speed | \n", "Temperature | \n", "site | \n", "Year | \n", "Hour | \n", "
---|---|---|---|---|---|---|---|---|---|
24664 | \n", "2023-01-01 01:00:00 | \n", "7.30306 | \n", "76.61852 | \n", "1.22702 | \n", "4.9 | \n", "7.2 | \n", "Leeds Centre | \n", "2023.0 | \n", "1 | \n", "
24665 | \n", "2023-01-01 02:00:00 | \n", "4.31351 | \n", "79.67418 | \n", "0.82507 | \n", "7.0 | \n", "7.5 | \n", "Leeds Centre | \n", "2023.0 | \n", "2 | \n", "
24666 | \n", "2023-01-01 03:00:00 | \n", "2.95539 | \n", "81.50758 | \n", "0.79333 | \n", "7.3 | \n", "7.3 | \n", "Leeds Centre | \n", "2023.0 | \n", "3 | \n", "
24667 | \n", "2023-01-01 04:00:00 | \n", "2.08340 | \n", "81.45519 | \n", "0.53947 | \n", "7.2 | \n", "7.2 | \n", "Leeds Centre | \n", "2023.0 | \n", "4 | \n", "
24669 | \n", "2023-01-01 06:00:00 | \n", "2.95643 | \n", "82.71238 | \n", "0.57120 | \n", "6.8 | \n", "7.1 | \n", "Leeds Centre | \n", "2023.0 | \n", "6 | \n", "