Performant Code

Performant Code#

Learning Objectives#

Understand basic strategies for writing high-performance Julia code
Learn about type stability and why it’s important for performance
Recognise the importance of using functions and avoiding global variables on performance
Appreciate the benefit of avoiding unnecessary allocations
Understand that arrays are in column-major order and implications for performance
Appreciate how to choose appropriate data structures for a task to improve performance
Measure running time and memory allocation of code and identify bottlenecks using simple tools

One of Julia’s major appeals is performance. You can often write code in Julia that is both high-level and also runs nearly as fast as lower-level languages. However, to fully unlock this performance, it’s good to be aware of a few tips and practices. Within this episode, we are going to introduce some key concepts: type stability, avoiding allocations, using efficient approaches, and basic profiling/timing.

Overview of Performance Tuning Strategies#

Performance tuning in Julia often comes down to writing code that is easy for the compiler to optimise, which includes:
Ensuring computations are type-stable: the types of variables don’t change unpredictably.
Avoiding global variables in tight loops or computations: use functions to encapsulate logic.
Reducing memory allocations when possible, for example, by modifying data in place or using views for subarrays instead of making copies.
Respecting the column-major order of arrays.
Choose the right data structure, e.g. using arrays, tuples, and dictionaries appropriately.
Measuring and profiling to find where the time is actually being spent, so you can optimise where it matters.

Measuring performance#

When we talk about ‘performance’, it can be helpful to think about this along two dimensions:

The execution time of the code
How much memory it needs to allocate

There are interactions between these: code that has to allocate a lot of memory will slower than code that performs the same computations but doesn’t allocate as much memory.

Note: because Julia performs JIT compilation, when we need to assess the execution time we usually want to ensure we don’t include the compilation time.

Memory allocations: the heap

When we talk about ‘allocations’ in Julia, we are referring to allocations of ‘heap’ memory. In general, there are two types of RAM memory available when running a Julia process: the stack and the heap. The stack is a region of memory that stores local variables and function call information. Accessing and managing the lifetime of variables whose data is stored on the stack is very quick because the stack is a simple ordered layout of data. But this requires that the size the object and its type are known at compile time (and, for technical reasons, that the size of the object is not too large). In contrast, the heap is a region of memory used for dynamic memory allocation e.g. for objects that could grow / shrink in size as the program runs, such as arrays, or whose type changes. The heap is more flexible, but it’s more expensive to allocate and manage memory on the heap. To manage the deallocation of data on the heap, Julia has a garbage collector that works in the background, which further adds to the cost of using the heap. Using the heap is not ‘bad’ per se: it’s often essential for all but the simplest programs! However, using the stack where possible will make code much more performant. Several of the tips below revolve around avoiding unnecessary heap allocations.

Julia provides some simple macros to measure execution time and memory:

@time expression runs expression once, printing both execution time (including compilation on the first run) and memory allocated. To see the “steady-state” performance (without compile-time overhead), run it a second time.
@benchmark expression (from BenchmarkTools.jl) runs expression many times, automatically “warming-up” the function. It reports statistics (min, median, mean, allocations, etc), so you avoid compile-time bias and see true variability.
@timed expression returns a Timing object containing fields like time and allocs rather than printing them, so you can programmatically inspect or log those metrics.
@allocated expression returns just the number of bytes allocated by running expression.

For these notes we’ll stick to using the built-in @time macro. Because it will include any JIT compilation, we typically need to run it twice:

function sum_squares(arr)
    return sum(map(x -> x^2, arr))
end

arr = rand(100_000)

# First time includes compiling `sum_squares` for Vector{Float64}
println("sum_squares -- first call:")
@time sum_squares(arr)

# Second time doen't include compilation because Julia uses cached
# compiled version
println("sum_squares -- second call:")
@time sum_squares(arr)

sum_squares -- first call:
  0.021500 seconds (103.41 k allocations: 6.013 MiB, 99.51% compilation time)
sum_squares -- second call:
  0.000090 seconds (4 allocations: 800.078 KiB)

33276.41922708816

Write functions and avoid global variables#

Julia compiles and optimises functions just-in-time (JIT) when you call them, but code that lives in the global scope, such as a top-level loop that uses global variables, cannot be optimised as effectively. This is because globals might change type or value at any moment.

Best Practice: encapsulate all performance-critical work inside a function, then invoke those functions, passing all required variables in as arguments. By doing so, you ensure the compiler sees only local variables with known, concrete types.

For instance, rather than writing at the top level:

# Global loop – avoids type instability but still limits optimisation
numbers = rand(1000)
total = 0.0
for x in numbers
    total += x
end
println(total)

You would instead want to define and call a function:

function sum_array(arr::Vector{FLoat64})
    total = 0.0            # local Float64
    for x in arr           # x is Float64
        total += x
    end
    return total
end

numbers = rand(1000)
total = sum_array(numbers)
println(total)

Inside sum_array, both total and x have fixed, known types, allowing the compiler to produce highly optimised code. In the global version, even though total is initialised as a Float64, its status as a global variable prevents the same level of optimisation.

You can annotate a globals type:

global_total::Float64 = 0.0

This forbids it from ever changing to another type, but this still won’t unlock all the optimisations you get inside functions. This is discussed further in the Julia documentation.

For truly constant values, always declare them with const at the global level.

const PI = 3.14

This signals to the compiler that PI will never change, enabling further speedups.

Type stability#

A function is type-stable if the (concrete) type of its return value can be determined from the types of its inputs at compile time i.e. without having to run the function in an actual program.

You can use the @code_warntype macro when calling a function to check how type-stable it is:

@code_warntype your_function(args)

Any areas in the function body where Julia cannot infer the types at compile time will be highlighted in orange/yellow or red, depending on the severity of the type instability. Let’s look at some examples.

Type-stable#

The following is an example of a type-stable function:

function squash_neg(x)
    if x < 0
        return zero(x)  # return 0 of same type same as typeof(x)
    else
        return x  # return type same as typeof(x)
    end
end

squash_neg (generic function with 1 method)

To see that this is type-stable, the question we need to ask is: “Suppose we’re given the type of x. Can we unambiguously determine the return type of the function as a concrete type, given this and the code in the function body?” In this case the answer is yes, because down each if branch, the type is uniquely determined by the type of x (indeed, in this case it happens to be the same as typeof(x)).

We can verify this using the @code_warntype macro:

@code_warntype squash_neg(1.2)

MethodInstance for squash_neg(::Float64)
  from squash_neg(x) @ Main In[2]:1
Arguments
  #self#::Core.Const(Main.squash_neg)
  x::Float64
Body::Float64
1 ─ %1 = Main.:<::Core.Const(<)
│   %2 = (%1)(x, 0)::Bool
└──      goto #3 if not %2
2 ─ %4 = Main.zero::Core.Const(zero)
│   %5 = (%4)(x)::Core.Const(0.0)
└──      return %5
3 ─ %7 = x::Float64
└──      return %7

The output from @code_warntype shows that the output will be a Float64 for the given Float64 argument; all the variable types are clearly inferred, and there are no concerning red or yellow types. Julia can compile optimised machine code with no dynamic checks, which is the ideal case for performance.

Note

Notice how we don’t need to specify the types of the arguments in the signature for a function to be type-stable. Remember that Julia infers the types of its arguments and then compiles a version of the function specific to those types: type-stability comes into play after that inference has taken place. Type annotations in signatures mainly help with dispatch, readability, and documentation; they rarely affect raw speed except in rare corner cases, as discussed within the Julia documentation.

Mildly Type-Unstable Function#

Let’s make a small tweak to squash_neg to give an example of a type-unstable function (taken from the Julia docs on performant code):

function squash_neg2(x)
    if x < 0
        return 0  # an Int
    else
        return x  # return type same as typeof(x)
    end
end

squash_neg2 (generic function with 1 method)

We’ve replaced the call to zero(x) with a literal 0. This has made the function type unstable:

Sometimes it returns an Int, sometimes typeof(x) (which could be Int, Int32, Float64,…)
The type returned depends on the value of x, not just its type.
Therefore the compiler can’t predict the output type just from the input types.
This would force Julia to insert expensive type checks at runtime for code that relies on this function.

We can see this in detail using @code_warntype:

@code_warntype squash_neg2(1.2)

MethodInstance for squash_neg2(::Float64)
  from squash_neg2(x) @ Main In[4]:1
Arguments
  #self#::Core.Const(Main.squash_neg2)
  x::Float64
Body::Union{Float64, Int64}
1 ─ %1 = (x < 0)::Bool
└──      goto #3 if not %1
2 ─      return 0
3 ─ %4 = x::Float64
└──      return %4

Above, you can see the mildly type-unstable function squash_neg2. In the @code_warntype squash_neg2(1.2) output, the return slot is highlighted in yellow as Body::Union{Float64, Int64}, indicating Julia infers it will return either a Float64 or a Int64 given the argument 1.2, but it can’t determine exactly which. At runtime, Julia must handle both possibilities. However, this situation is not as bad as it could be since there is only a choice of two types; the Julia compiler may still be able to perform some optimisations to ameliorate this.

Severely Type-Unstable Function: `sum_positive`#

Let’s now look at a function that exhibits more serious type-instability.

Note: Unfortunately the text below may not feature the red highlighting in the web version of these notes. Try running the code in your own Julia REPL to see the output colour highlighting!

"""Sum the positive elements of an array of real numbers"""
function sum_positive(arr)
    sum(squash_neg.(arr))  # using vectorised version of squash_neg
end

arr = Real[1, -2.1, π]
@code_warntype sum_positive(arr)

MethodInstance for sum_positive(::Vector{Real})
  from sum_positive(arr) @ Main In[6]:2
Arguments
  #self#::Core.Const(Main.sum_positive)
  arr::Vector{Real}
Body::Any
1 ─ %1 = Main.sum::Core.Const(sum)
│   %2 = Base.broadcasted(Main.squash_neg, arr)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(squash_neg), Tuple{Vector{Real}}}
│   %3 = Base.materialize(%2)::AbstractVector{<:Real}
│   %4 = (%1)(%3)::Any
└──      return %4

The crucial line in the output is the following:

Body::Any

Body::Any is highlighted in red, indicating Julia could only infer that the return value is of type Any. As the input vector arr is a Vector{Real}, Julia has no way to know at compile time exactly what (concrete) type each x in arr will be. As a result, dynamic type-checking is required for each x, and the returned value is boxed as Any, resulting in heap allocations and the need for garbage collection. Boxing values means they are stored in the heap with additional type information rather than on the stack.

Is it always that bad?#

What if we instead provide an array with concrete element type, such as arr = [1, -2.1, π] (a Vector{Float64})?

arr = [1, -2.1, π]
@code_warntype sum_positive(arr)

MethodInstance for sum_positive(::Vector{Float64})
  from sum_positive(arr) @ Main In[6]:2
Arguments
  #self#::Core.Const(Main.sum_positive)
  arr::Vector{Float64}
Body::Float64
1 ─ %1 = Main.sum::Core.Const(sum)
│   %2 = Base.broadcasted(Main.squash_neg, arr)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(squash_neg), Tuple{Vector{Float64}}}
│   %3 = Base.materialize(%2)::Vector{Float64}
│   %4 = (%1)(%3)::Float64
└──      return %4

We can see that now the return type has been inferred to be Float64 at compile time.

Here, sum_positive is still technically type‐unstable in the general sense, because its return type depends entirely on what you pass in. However, in practice, that doesn’t always translate to slow code. The line

Body::Float64

shows that, for a Vector{Float64}, the result must be Float64, so it emits specialised, fully‐typed code.

The takeaway is that, in this case, the element type of the actual argument matters just as much as your function. A function like sum_positive can be generally unstable, yet still perform optimally whenever you feed it collection with concrete‐typed elements (e.g. Vector{Int64} vs Vector{Real}).

Avoid containers with abstract elements#

In the example above, we supplied sum_positive with a Vector{Real}, i.e. an array where the element type can only be guaranteed to be the abstract Real type. Using abstract‐typed containers (e.g. Vector{Any}) forces Julia to fall back to run-time type checking and boxing. Keeping your collection parameterised over concrete element types is a crucial performance tip in Julia. So, for example, prefer to work with elements types Float64, Int, Float32, etc.

The following code shows how far more allocations are required to compute sum_positive on a Vector{Real} compared to Vector{Float64}, due to the boxing required to work with Real.

# Vector of Float64
arr = randn(1_000_000)

# Same values but considered as a Vector of Real
arr2 = Vector{Real}(undef, length(arr))
arr2 .= arr

# Run to compile
sum_positive(arr); sum_positive(arr2);

@time sum_positive(arr);
@time sum_positive(arr2);  # more allocations

  0.073071 seconds (4 allocations: 7.656 MiB, 98.86% gc time)
  0.010718 seconds (1.00 M allocations: 22.915 MiB)

Work with arrays in a column major way#

In Julia, arrays are stored in column-major order. This means that elements of a column (i.e. the first axis) are stored in contiguous memory locations. This has important performance implications when iterating through arrays.

When you loop through an array, accessing elements in the order they are stored in memory can significantly improve performance due to better cache utilization. In column-major order, this means making sure we iterate through the first index more frequently than the second index, (which in turn should be more frequent than the third index, etc.). This is achieved by iterating through the first index in the inner-most loop, followed by the second index in the next loop up, etc.

Let’s look at an example of summing over the elements of a matrix to illustrate this:

# Efficient: respects column-order
# inner-most loop runs over first axis
function sum_column_order(mat)
   sum = 0 
    for j in axes(mat, 2)
        for i in axes(mat, 1)
            sum += mat[i, j]
        end
    end
    return sum
end

# Inefficient: uses row-order instead
# inner-most loop runs over second axis
function sum_row_order(mat)
   sum = 0 
    for i in axes(mat, 1)
        for j in axes(mat, 2)
            sum += mat[i, j]
        end
    end
    return sum
end


# Run to compile
sum_column_order(rand(2, 2)); sum_row_order(rand(2, 2));

mat = rand(10_000, 10_000)
@time sum_column_order(mat);
@time sum_row_order(mat);  # slower!

  0.177636 seconds (1 allocation: 16 bytes)
  0.206866 seconds (1 allocation: 16 bytes)

Avoid unnecessary allocations#

Common sources of excessive allocation include type-stability issues (discussed above), which force Julia to box values on the heap, and creating many small temporary arrays, for example via repeated slicing or non-in-place broadcast, as discussed in the Julia documentation.

Here are some strategies to cut down heap usage. Recall that we can use @time or BenchmarkTools.@btime and watch the “allocations” count.

Pre-allocate and reuse arrays#

Instead of building a new array each iteration, or repeatedly extending an array, instead pre-allocate a single array of the correct size and populate it:

# Bad: extends array inside the loop
function squares(arr)
    results = []
    for x in arr
        push!(results, x^2)
    end
end

# Good: allocate once, then fill in place
function squares2(arr)
    results = similar(arr)
    for (i, x) in enumerate(arr)
        results[i] = x^2
    end
end

arr = rand(1_000_000)

# Run to compile
squares(arr); squares2(arr);

@time squares(arr)
@time squares2(arr)  # fewer allocations!

  0.007722 seconds (1.00 M allocations: 32.812 MiB)
  0.000686 seconds (2 allocations: 7.656 MiB)

Use views for subarrays#

Slicing an array creates a new copy of the data. This is often not required, so instead you can use the view function to create a lightweight window into an array without copying its data, thus avoiding a fresh allocation.

# Bad: sum applied to a slice, which involves allocation
function sum_at_even_indices(arr)
    n = length(arr)
    even_indices = 2:2:n
    sum(arr[even_indices])  # slicing copies
end

# Good: sum applied to a view -- no allocation
function sum_at_even_indices2(arr)
    n = length(arr)
    even_indices = 2:2:n
    sum(view(arr, even_indices))  # view doesn't copy
end

# Run to compile
sum_at_even_indices([1., 2.]); sum_at_even_indices2([1., 2.]);

arr = rand(10_000_000)
@time sum_at_even_indices(arr);
@time sum_at_even_indices2(arr);  # fewer allocations

  0.016086 seconds (4 allocations: 38.156 MiB, 23.95% gc time)
  0.003487 seconds (1 allocation: 16 bytes)

Favour in-place operations#

Modifying objects in-place avoids (or at least reduces) needing to allocated new memory. For example:

# Run to compile
sort!([2.0, 1.0]); sort([2.0, 1.0])

arr = rand(10_000_000)
@time sorted = sort(arr);  # returns new array
@time sort!(arr);    # rearranges in place

  0.113018 seconds (9 allocations: 152.602 MiB, 12.56% gc time)
  0.088945 seconds (6 allocations: 76.305 MiB)

Fuse vectorised (broadcast) operations#

Recall that we can vectorise (i.e. apply element-wise) any function using the ‘dot’ notation. When repeated functions need to be vectorised in a chain, then we can ‘fuse’ these together. This saves allocating temporary, intermediary arrays.

# Bad: intermediary variables involve allocations
function sin_of_cubic(x::Vector{Float64})
    cubed = x.^3
    squared = x.^2
    sin_cubed = sin.(cubed)
    sin_squared = sin.(squared)
    return sin_cubed .+ sin_squared
end

# Good: fusing multiple 'dot' calls avoids allocations
function sin_of_cubic2(x::Vector{Float64})
    return sin.(x.^3) .+ sin.(x.^2)
end

# Run to compile
sin_of_cubic([1.0]); sin_of_cubic2([1.0])

x = rand(1_000_000)
@time sin_of_cubic(x);
@time sin_of_cubic2(x);  # fused -> fewer allocations

  0.009816 seconds (15 allocations: 38.282 MiB)
  0.006306 seconds (3 allocations: 7.656 MiB)

Use appropriate data structures#

Some key considerations when determining which data structure to use include:

If you need random access to elements by index and the collection will grow/shrink, you require a Vector (Array).
If you need to look up values by keys, use a Dict instead of searching through an array each time.
If you have a fixed small set of values of heterogenous types, a Tuple can be helpful. They are immutable, and their types are part of their identity, making them very efficient for specific uses, like returning multiple values from a function.
If you need stack or queue behaviour, you can still use arrays (with push! or pop! for the stack and push! and popfirst! for the queue.
If you have binary data or bits, consider BitVector for large boolean arrays that are memory efficient.
For mathematical operations, using native numeric types (Int, Float64) is faster than arbitrary precisions or rational types, so only use BigInt, BigFloat and Rational when needed.

Clever algorithms#

Often, the most considerable speedups come not from micro-optimising memory, but from choosing better algorithms or mathematical shortcuts. For example, instead of summing the first N integers in a loop:

function sum1(N)
    total = 0
    for i in 1:N
        total += i
    end
    return total
end

You can use the closed-form formula:

sum2(N) = N * (N + 1) ÷ 2

This runs in constant time rather than being based on the size of the input, and allocates no extra memory. Whenever possible, look for algorithmic improvements or analytic formulas before resorting to low-level optimisations.

Profiling for bottlenecks#

If you have a complex program and you want to see where it spends time, you can use Julia’s standard library module Profile:

using Profile
@profile my_long_running_function()

Then use Profile.print() or a third party package to analyse the results. Profiling tells you which functions or lines are taking the most time. See https://docs.julialang.org/en/v1/manual/profile/ for more details on profiling.

Credits / further information#

The following guides were used for putting this episode together. They provide excellent reading if you want to dig further into performance optimisations and what’s going on under the hood in terms of using your computer’s hardware efficiently.

Exercise: Analysing the Performance of Code#

Given the three functions below, use what we’ve discussed so far about type stability, allocations, and performance to understand why they perform differently.

function method_1(N)
    arr = Int[]  
    for i in 1:N
        push!(arr, i)      
    end
    return sum(arr)        
end


function method_2(N)
    arr = collect(1:N)  # collect the range 1:N into a Vector{Int}
    return sum(arr)
end


function method_3(N)
    return N*(N+1)÷2
end

End of Section Quiz#

Which function is recommended for pre-allocating an array of zeros in Julia for better performance?

In Julia, why should you avoid growing arrays inside a loop without pre-allocation?

What is a general tip for writing more performant Julia code when using loops?