3D Temperature Diffusion Model: Bringing it all together#

Learning Objectives#

By the end of this section, learners will be able to:

  • Implement a naive 3D temperature diffusion model in pure Python and understand its computational limitations.

  • Optimise the naive implementation using NumPy for CPU execution and CuPy for GPU acceleration.

  • Compare runtime performance across pure Python, NumPy, and CuPy implementations using scaling plots.

  • Profile temperature diffusion code to identify computational bottlenecks and quantify performance improvements.

  • Download, summarise, and visualise real-world ocean temperature data from NetCDF files.

  • Use static and interactive visualisation techniques (2D slices and 3D cubes) to explore temperature diffusion results.

  • Evaluate compute vs. memory trade-offs when running diffusion models on CPUs versus GPUs.

  • Design and carry out extended exercises, such as domain decomposition, multi-GPU strategies, or custom CUDA kernels, to explore advanced optimisation techniques.

  • Validate model results across different implementations to ensure numerical consistency and scientific correctness.

Resource Files#

The job submission scripts specifically configured for use on the University of Exeter ISCA HPC system are available here.

General-purpose job submission scripts, which can serve as a starting point for use on other HPC systems (with minor modifications required for this course), are available here.

The Python scripts used in this course can be downloaded here.

All supplementary files required for the course are available here.

The presentation slides for this course can be accessed here.

Overview#

In this section, we start with a straightforward, “naive” implementation of a 3D temperature‐diffusion model to demonstrate the core algorithm. Once you’ve verified that basic version, you’ll write two optimised variants, one using NumPy on the CPU and one using CuPy on the GPU. You’ll then generate scaling plots for each implementation and profile them to see exactly where time is spent and how performance diverges.

The goal of this section is to act as an analogous to a real-world research project, where a PI has given you some code that is scientifically accurate but painfully slow and asked us as software specialists to optimise the code and make it run quicker while producing the same output.

The following section “Data” concerns the data and helper functions included to help with your analysis, and the problem that you will be tackling is described in “Bringing It All Together”, with potential options for what you can tackle within “Exercise: Approaching The Problem You Want”.

Data#

For this task, starting data of 3-dimensional Ocean Temperatures are required, which we can download from the Copernicus Marine Data Service.

Downloading Data#

Note

If you are working on this project on the University of Exeter ISCA HPC as part of the RSA Team GPU Training Day, then you can run the following command to move the file for you:

cp /lustre/projects/Research_Project-RSATeam/data/cmems_mod_glo_phy-thetao_anfc_0.083deg_PT6H-i_thetao_13.83W-6.17E_46.83N-65.25N_0.49-5727.92m_2024-01-01-2024-01-01.nc /lustre/projects/Research_Project-RSATeam/$USER/GPU_Training/model_data/.

A utility function has been included with the repo for this course bundled with poetry, which will download the data for you:

poetry run download_data

will download the required dataset for this course into the ./data directory. The dataset that is downloaded is:

Description:
This dataset was downloaded from the Global Ocean Physics Analysis and Forecast service. It provides data for global ocean physics, focusing on sea water potential temperature.

  • Product Identifier: GLOBAL_ANALYSISFORECAST_PHY_001_024

  • Product Name: Global Ocean Physics Analysis and Forecast

  • Dataset Identifier: cmems_mod_glo_phy-thetao_anfc_0.083deg_PT6H-i

Variable Visualized:

  • Sea Water Potential Temperature (thetao): Measured in degrees Celsius [°C].

Geographical Area of Interest:

  • Region: Around the United Kingdom

  • Coordinates:

    • Northern Latitude: 65.312

    • Eastern Longitude: 6.1860

    • Southern Latitude: 46.829

    • Western Longitude: -13.90

Depth Range:

  • Minimum Depth: 0.49 meters

  • Maximum Depth: 5727.9 meters

File Size:

  • 267.5 MB

Visualising Data#

To make the process of visualising the data easier, three different utility functions have been created. The default output locations for the visualisations is within the output directory.

Visualise Slice (Static)#

Visualizing a 2D temperature slice. The depth that will be targetted is the surface, e.g. 0.49m.

poetry run visualise_slice_static --data-file <netcdf_filename.nc>

The output produced will be a .png file, such as:

Temperature Slice

Visualise Slice - Interactive HTML file#

Visualizing a 2D temperature slice in an interactive HTML file, allowing for a time series to be visualised.

poetry run visualise_slice --target_depth 0 --animation_speed 100 --data-file <netcdf_filename.nc>

The command above will create an interactive HTML file, that will have each timestep in the animation last for 100 milliseconds (--animation_speed) at the nearest depth to the closest depth (--target_depth), in this case 0.49m. For the above command the output produced will be:

When run within your own space the file produced will be output/original_temperature_2d_interactive.html.

Visualise Cube - Interactive HTML file#

Visualizing a 3D temperature slice in an interactive HTML file, allowing for a time series to be visualised.

poetry run visualise_cube --num_depths 5 --num_time_steps 3 --data-file <netcdf_filename.nc>

The command above will create an interactive HTML file, that will visualise the first 5 depths, for 3 time steps. For the above command the output produced will be:

When run within your own space the file produced will be `output/original_temperature_3d_interactive.html`.

Summarising Data#

Calculates and prints summary statistics for temperature data in a specified NetCDF file. Prints its mean, max, min, and standard deviation. Also provides information about the dataset’s dimensions and coordinates.

poetry run summary

The above command will print out the summary of the data on the original datafile downloaded from Copernicus. The above command will output the following:

The dimensions of the data is: (5, 50, 222, 241)
Temperature Summary Statistics:
Mean temperature: 8.56154727935791
Max temperature: 14.050389289855957
Min temperature: -2.591400146484375
Standard deviation: 3.1273183822631836

Dataset Dimensions and Coordinates:
<xarray.Dataset>
Dimensions:    (depth: 50, latitude: 222, longitude: 241, time: 5)
Coordinates:
  * depth      (depth) float32 0.494 1.541 2.646 ... 5.275e+03 5.728e+03
  * latitude   (latitude) float32 46.83 46.92 47.0 47.08 ... 65.08 65.17 65.25
  * longitude  (longitude) float32 -13.83 -13.75 -13.67 ... 6.0 6.083 6.167
  * time       (time) datetime64[ns] 2024-01-01 ... 2024-01-02
Data variables:
    thetao     (time, depth, latitude, longitude) float32 13.47 13.42 ... nan
Attributes: (12/14)
    Conventions:                   CF-1.6
    area:                          GLOBAL
    contact:                       servicedesk.cmems@mercator-ocean.eu
    credit:                        E.U. Copernicus Marine Service Information...
    institution:                   Mercator Ocean
    licence:                       http://marine.copernicus.eu/services-portf...
    ...                            ...
    product_user_manual:           http://marine.copernicus.eu/documents/PUM/...
    quality_information_document:  http://marine.copernicus.eu/documents/QUID...
    references:                    http://marine.copernicus.eu
    source:                        MERCATOR GLO12
    title:                         Instantaneous fields for product GLOBAL_AN...
    copernicusmarine_version:      1.3.4

Bringing It All Together#

The goal of this is to bring together what has been discussed so far in this course. You are provided with two options for what you want to tackle in this section; you could:

  • Develop a NumPy and CuPy version of the temperature diffusion model created.

  • Profile the NumPy and CuPy code provided with the tools discussed so far and understand where computation time is being spent and provide associated plots and a report on the performance of the different solutions to the problem.

Pseudocode#

The psuedocode that implements the diffusion loop is:

1. For each timestep from 1 to num_timesteps:
   2. Copy the current temperature values to a temporary array (temp_copy)
   3. Initialize arrays for neighbor sums and neighbor counts with zeros
   4. For each valid cell (ignoring boundaries):
      5. Calculate the sum of neighboring cells:
         - Add the value of the front neighbor if valid
         - Add the value of the back neighbor if valid
         - Add the value of the left neighbor if valid
         - Add the value of the right neighbor if valid
         - Add the value of the top neighbor if valid
         - Add the value of the bottom neighbor if valid
      6. Count the number of valid neighbors for each direction
   7. Update the cell's temperature:
      - New temperature = current temperature + diffusion coefficient * (neighbor_sum - 6 * current temperature) / neighbor_count
   8. Ensure invalid points (NaN) remain unchanged
   9. Update the main temperature array with the new values

The naive implementation of this problem is implemented within the function temperature_diffusion_purepython() in the file content/temperature_diffusion.py. A complete solution including reading and writing of .nc files is:

import xarray as xr
from pathlib import Path
import argparse
import time
import xarray as xr
import numpy as np
import cupy as cp
from tqdm import tqdm 
import math
import copy


# Define the root directory
ROOT_DIR = Path(__file__).resolve().parent.parent
DATA_DIR = ROOT_DIR / "model_data"
DATA_FILE = "cmems_mod_glo_phy-thetao_anfc_0.083deg_PT6H-i_thetao_13.83W-6.17E_46.83N-65.25N_0.49-5727.92m_2024-01-01-2024-01-01.nc"

OUTPUT_FILE_NUMPY = "predicted_temperatures_numpy.nc"
OUTPUT_FILE_CUPY = "predicted_temperatures_cupy.nc"
OUTPUT_FILE_PUREPYTHON = "predicted_temperatures_purepython.nc"


# Load the NetCDF data
def load_data():
    file_path = DATA_DIR / DATA_FILE
    return xr.open_dataset(file_path)


def save_to_netcdf(data, new_temperature, output_file_path, num_timesteps):
    # Adjust new_temperature to have only num_timesteps or fewer
    new_temperature = new_temperature[:num_timesteps]  # Only include the desired number of timesteps

    # Generate a new time coordinate as a sequence of numbers (1, 2, ..., num_timesteps)
    time_coord = range(1, num_timesteps + 1)

    # Create a new dataset with original depth, latitude, and longitude coordinates
    output_data = xr.Dataset(
        {'thetao': (('time', 'depth', 'latitude', 'longitude'), new_temperature)},
        coords={
            'time': time_coord,  # Sequential time coordinate
            'depth': data['depth'].values,  # Use original depth coordinates
            'latitude': data['latitude'].values,  # Use original latitude coordinates
            'longitude': data['longitude'].values  # Use original longitude coordinates
        },
    )
    output_data.to_netcdf(output_file_path, engine='netcdf4')

def temperature_diffusion_purepython(data, num_timesteps, diffusion_coeff=0.1):
    # Pull raw array and get dims
    raw = data['thetao'].values              # shape (time, depth, lat, lon)
    depth, lat, lon = raw.shape[1], raw.shape[2], raw.shape[3]

    # Initial snapshot at t=0, as Python lists
    initial = raw[0]                         # shape (depth, lat, lon)
    temperature = []
    for _ in range(num_timesteps):
        # deep copy initial for each timestep
        plane = []
        for d in range(depth):
            plane.append([ [ float(initial[d][i][j]) for j in range(lon) ]
                           for i in range(lat) ])
        temperature.append(plane)

    # Summary stats (very slow!)
    flat = []
    for t in range(num_timesteps):
        for d in range(depth):
            for i in range(lat):
                for j in range(lon):
                    v = temperature[t][d][i][j]
                    if not math.isnan(v):
                        flat.append(v)
    flat_sorted = sorted(flat)

    # Precompute mask of valid ocean points
    mask = [[[ [ not math.isnan(temperature[t][d][i][j])
                 for j in range(lon) ]
               for i in range(lat) ]
             for d in range(depth) ]
           for t in range(num_timesteps)]

    # Prepare output buffer
    new_temperature = copy.deepcopy(temperature)
    timestep_durations = []

    # Diffusion loop
    for t in range(num_timesteps):
        start = time.time()
        for d in range(1, depth-1):
            for i in range(1, lat-1):
                for j in range(1, lon-1):
                    if mask[t][d][i][j]:
                        center = temperature[t][d][i][j]
                        total = 0.0
                        count = 0
                        # 6 neighbors
                        for dd, ii, jj in (
                            (d-1,i,j), (d+1,i,j),
                            (d,i-1,j), (d,i+1,j),
                            (d,i,j-1), (d,i,j+1)
                        ):
                            if mask[t][dd][ii][jj]:
                                total += temperature[t][dd][ii][jj]
                                count += 1
                        # apply diffusion
                        if count > 0:
                            delta = diffusion_coeff * (total - count*center) / count
                        else:
                            delta = 0.0
                        new_temperature[t][d][i][j] = center + delta
        timestep_durations.append(time.time() - start)
        # copy new → temperature for next step
        temperature[t] = copy.deepcopy(new_temperature[t])

    final = _np.array(new_temperature)

    # Save result
    save_to_netcdf(data, final, DATA_DIR / OUTPUT_FILE_PUREPYTHON, num_timesteps)

    total = sum(timestep_durations)
    avg = total / num_timesteps

    print(f"PurePython model completed in {total:.4f} seconds. "
          f"Average time per timestep: {avg:.4f} seconds.")

def run_diffusion_purepython():
    parser = argparse.ArgumentParser(description="Run 3D Diffusion Model in pure Python")
    parser.add_argument("--num_timesteps", type=int, default=300, help="Number of Timesteps")
    args = parser.parse_args()
    temperature_diffusion_purepython(data=load_data(), num_timesteps=args.num_timesteps)


if __name__ == "__main__":
    run_diffusion_purepython()

Example Results#

Runtime vs. Number of Timesteps for Temperature Diffusion: Pure Python vs. NumPy (CPU) vs. CuPy (GPU)

GPU Name

CPU Name

Method

Time Steps

Mean Time (s)

Std Dev (s)

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

Pure Python

10

43.698821

0.08423

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

Pure Python

25

110.014916

0.939186

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

Pure Python

50

223.748059

1.287061

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

Pure Python

100

461.348654

7.234776

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

NumPy (CPU)

10

4.980454

0.173302

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

NumPy (CPU)

25

22.786639

0.205813

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

NumPy (CPU)

50

86.931572

0.502695

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

NumPy (CPU)

100

347.883550

3.980833

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

CuPy (GPU)

10

3.854066

0.140653

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

CuPy (GPU)

25

12.885241

0.022276

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

CuPy (GPU)

50

46.622783

0.178398

NVIDIA H100 NVL

AMD EPYC 9V84 96-Core Processor

CuPy (GPU)

100

182.261322

0.193946

Exercise: Approaching The Problem You Want#

Building on the “naive” pure-Python diffusion above, your task is to explore and extend the code in a way that resonates with you, some routes you could take include:

  1. Implement Optimized Variants

    • Rewrite the naive nested-loop Python implementation using NumPy vectorized operations to leverage contiguous array arithmetic.

    • Translate your NumPy version to CuPy so that the heavy array work runs on the GPU.

  2. Performance Profiling & Scaling

    • Use Python’s time, cProfile, NVIDIA NSight or some other tool to measure per-timestep costs for each implementation (naive, NumPy, CuPy).

    • Generate scaling plots to quantify speedup factors and identify bottlenecks.

  3. Memory Footprint & Out-of-Core Strategies

    • Experiment with different depth/latitude/longitude resolutions and observe GPU vs. CPU memory usage.

    • If you exhaust GPU memory, explore chunking or out-of-core techniques (e.g. processing one depth-slice at a time).

  4. Domain Decomposition & Halo Exchanges

    • Split the 3D grid into sub-domains (e.g. along latitude/longitude) for parallel processing.

    • Implement halo (ghost) cell exchanges between sub-domains to maintain correctness at boundaries when updating temperature.

  5. Multi-GPU Offloading

    • Use CuPy’s multi-GPU capabilities (for example, cupy.cuda.Device) to assign different sub-domains to separate GPUs.

  6. Advanced Kernel Optimizations

    • Investigate writing custom CUDA kernels (via CuPy RawKernels) or use Numba JIT to fuse loops and reduce launch overhead.

    • Profile memory access patterns; consider tiling or pitched memory layouts to improve cache efficiency.

  7. Visualization & Validation

    • Create interactive Plotly or static Matplotlib visuals of both the diffusion results and your profiling data.

    • Verify that all implementations produce numerically consistent outputs (e.g., compare final temperature fields).

Feel free to pick one or more of these directions or combine them—to deepen your understanding of high-performance computing with Python, NumPy, and CuPy. Note that not all of the topics above we have tackled in this course, so you may need to do some independent research if you choose some of the advanced topics. Good luck!