*We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.*

When we are trying to speed up a memory-bound loop, there are several different paths. We could try decreasing the dataset size. We could try increasing available instruction level parallelism. We could try modifying the way we access data. Some of these techniques are very advanced. But sometimes we should start with the basics.

One of the ways to improve on memory boundness of a certain piece of code is the old-fashioned way: decrease the total number of memory accesses (loads or stores). Once a piece of data is in the register, using it is very cheap, to the point of being free (due to CPU’s ability to execute up to 4 instructions in a single cycle and their out-of-order nature). So all techniques that try to lower the total number of loads and stores should result in speedups.

## Techniques

*Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available. Need help with software performance? Contact us!*

### Loop Fusion

If we have two consecutive loops that operate on the same dataset, fusing them into one loop would decrease the number of memory loads and stores and consequently improve performance. This transformation is called *loop fusion* or *loop jamming*. Consider the example of finding a minimum and maximum in an array. One of the ways to do it is using two separate loops:

double min = a[0]; for (int i = 1; i < n; i++) { if (a[i] < min) { min = a[i] }; } double max = a[0]; for (int i = 1; i < n; i++) { if (a[i] > max) { max = a[i] }; }

Both loops touch process the same dataset. We could merge the two loops into one loop that finds both minimum and maximum. This cuts the number of data loads by two.

double min = a[0]; double max = a[0]; for (int i = 1; i < n; i++) { if (a[i] < min) { min = a[i] }; if (a[i] > max) { max = a[i] }; }

We measure the effect of loop fusion in the experiments section.

A few notes about loop fusion:

*Loop fusion is also a compiler optimization technique*, so in theory the compiler can do it automatically. But this doesn’t happen often and when it does happen, this optimization has a tendency to break easily.- With regard to loop fusion,
*there are cases where loop fusion would fail to deliver speed improvements*. If one or both loops are vectorized before the fusion, but the fused loop is not vectorized, then this transformation can result in a slowdown, not a speedup. *Loop fusion is a close relative of loop sectioning*. The main difference is that loop fusion reuses the data while it is in registers, whereas loop sectioning reuses the data while it is in fast data caches. Therefore, a loop fusion is more memory efficient that loop sectioning, but fusing two loops is more complex than sectioning two loops. Also, preserving vectorization is easier with loop sectioning.

#### C++ Ranges

For folks using C++ STL algorithms, it is important to be aware of C++ ranges. The original STL library contains many algorithms, but they are not composable. Composability means that the result of one algorithm is fed into another algorithm. Consider the example^{1}:

struct user_t { int id; std::string name; int age; }; std::vectorget_ids_adult_users(std::vector & users) { std::vector adult_users; std::copy_if(std::cbegin(users), std::cend(users), std::back_inserter(adult_users), [](auto const& user) { return user.age >= 18; }); std::vector ids; std::transform(std::cbegin(adult_users), std::cend(adult_users), std::back_inserter(ids), [](auto const& user){ return user.id; }); return ids; }

The function `get_ids_adult_users`

returns the vector containing the `id`

s of all the users who are adults, i.e. whose age is 18 or more. To achieve this using STL, we use two algorithms: `std::copy_if`

which filters out the minor users to create the list of adult users and `std::transform`

to extract only ids from the vector of `user_t`

class.

This approach forces the code to iterate the two collections instead of one: the first collection is the original collection of users, and the second collection is the temporary collection holding adult users. To avoid this, C++ developers have *STL ranges* at their disposal. Here is the same example rewritten using ranges:

std::vectorget_ids_adult_users(std::vector & users) { auto ids = users | std::views::filter([](auto const& user){ return user.age >= 18; }) | std::views::transform([](auto const& user){ return user.id; }); return {ids.begin(), ids.end()} }

This code, apart from being cleaner, also performs fewer memory loads and memory stores. The filter adapter performs the filtering, and the result of filtering is directly fed into the transform adapter, element by element. This avoids running through the vector two times, and it is equivalent to loop fusion.

#### Kernel Fusion

*Kernel Fusion* is just loop fusion brought to a much higher level. Consider the following: imagine you have N image processing algorithms making an image processing pipeline. The output of algorithm X is the input of algorithm X+1. One of the ways to implement the pipeline is to have them run one after another, from zero to N-1. Each algorithm must finish before the next one starts.

With this kind of setup, processing an image will often be memory inefficient. The whole image is loaded from the slower parts of the memory subsystem to CPU registers, processed, then the output is written back to the memory subsystem. And this is repeated for each algorithm in the pipeline: load input, do some modifications, store output.

In this example, each algorithm is a *kernel*. And by kernel fusion, we mean that two algorithms are fused. An algorithm X generates a single pixel, and feeds it directly to the algorithm X + 1, then to the algorithm X + 2, etc. The benefit of such an approach is that all relevant data never leaves the CPU, which avoids unnecessary data movements.

However, there are two problems with this approach:

- Writing such a processing pipeline is not easy, and this task needs to be planned from day one. In fact, there is a special programming language,
*Halide*, that is designed for writing fast and portable image and tensor processing codes. - The types of algorithms that would benefit from this approach must be memory bound, i.e. light in computation. Algorithms that are computationally intensive would benefit little or not at all, because computational bottleneck will hide the memory latency.

If you happen to know more about this topic, please leave a comment or send me an e-mail, I would like to know more as well (and also keep this post updated).

*Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available. Need help with software performance? Contact us!*

#### Loop Fusion inside a Loop Nest

Loop fusion inside a loop nest (also called *unroll-and-jam*) is a more advanced type of loop fusion, applicable to loop nests. The technique is simple to apply to many sorting algorithms, but not only them. Consider the selection sort algorithm. Here is the source code:

for (int i = 0; i < n; i++) { double min = a[i]; int index_min = i; for (int j = i + 1; j < n; j++) { if (a[j] < min) { min = a[j]; index_min = j; } } std::swap(a[i], a[index_min]); }

The algorithm is very simple: in the `i`

-th iteration, it scans the elements from `i`

to `n`

to find the smallest value, and puts the value in place `a[i]`

.

For loop fusion inside a loop nest, we need to unroll the outer loop two or more times to make the fusion potential explicit. Here is the selection sort algorithm, where the outer loop has been unrolled two times:

for (int i = 0; i < n; i+=2) { min = a[i]; index_min = i; for (int j = i + 1; j < n; j++) { if (a[j] < min) { min = a[j]; index_min = j; } } std::swap(a[i], a[index_min]); min = a[i + 1]; index_min = i + 1; for (int j = i + 2; j < n; j++) { if (a[j] < min) { min = a[j]; index_min = j; } } std::swap(a[i + 1], a[index_min]); }

The first and second inner loops are iterating over almost identical datasets. There are a few statements preventing a simple loop fusion, but they can be moved away. With some tricks, we fused together the two inner loops. Here is the result:

for (int i = 0; i < n; i+=2) { min_1 = a[i]; min_2 = a[i + 1]; index_min_1 = i; index_min_2 = i + 1; // min1 must be smaller than min2 // Swap two values if not true if (min2 < min1) { std::swap(min1, min2); std::swap(index_min_1, index_min_2); } // Look-up two smallest values in array. // The smaller is kept in min1, the larger // in min2. for (int j = i + 2; j < n; j++) { if (a[j] < min_2) { if (a[j] < min_1) { min_2 = min_1; index_min_2 = index_min_1; min_1 = a[j]; index_min_1 = j; } else { min_2 = a[j]; index_min_2 = j; } } } std::swap(a[i], a[index_min_1]); std::swap(a[i + 1], a[index_min_2]); }

The loop fusion in this case is not trivial, but it is possible. The inner loop is looking up the two smallest values in the array to put them into the beginning of the section currently being processed. The total number of memory accesses is about two times lower compared to the simple selection sort algorithm.

Note: This optimization closely resembles *outer loop vectorization*, where the outer loop is running several instances of the inner loop in parallel.

### Decreasing the Number of Data Passes by Doing More Work

As we have seen in previous examples, loop fusion allows the elimination of some memory accesses by fusing two neighboring loops running over the same data. But this is not the only way. Many ideas that decrease the number of data passes will result in fewer memory accesses and better performance.

Consider the simple selection sort algorithm from the previous section. The original algorithm was looking for the minimum in the remaining array. To decrease the number of total memory accesses, we could scan for both minimum and maximum. We would then put the minimum element at the beginning of the remaining array and the maximum element at its end. The algorithm looks like this:

for (int i = 0, j = n - 1; i < j; i++, j--) { double min = a[i]; int index_min = i; double max = a[j]; int index_max = j; for (int k = i; k < j; k++) { if (a[k] < min) { min = a[k]; index_min = k; } if (a[k] > max) { max = a[k]; index_max = k; } } std::swap(a[i], a[index_min]); if (a[index_min] != max) { std::swap(a[j], a[index_max]; } else { std::swap(a[j], a[index_min]); } }

This version has fewer iterations, and therefore fewer memory loads, but inside each iteration it does twice as much work. From the algorithmic point of view, it performs roughly the same number of operations as the first version. But, performance-wise it is more efficient. In the experiments section we will see how much efficient.

Another important algorithm with a similar reduction in memory accesses is a variant of Quicksort called Multi-Pivot Quicksort. Before explaining MPQ, let’s give a quick reminder about Quicksort. The Quicksort algorithm consists of two steps. The first step is array partitioning: picking a pivot and then partitioning the array into a left part that is smaller than the pivot and a right part that is larger. The second step is the recursive call: the partitioning is performed recursively on the left and right part of the input array, until the size of the array becomes 1.

With Multi-Pivot Quicksort, the partitioning step is performed by picking two pivots and partitioning the array into three parts. Then partitioning is recursively performed on the left, middle and right part of the array. If an array has N elements, with plain Quicksort we expect an average number of memory accesses for each element of the array to be O(log_{2} N). With Multi-Pivot Quicksort, the average number of memory accesses would be O(log_{3} N).

*Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available. Need help with software performance? Contact us!*

## Experiments

All the experiments were executed on Intel Core i5-10210U CPU, 4 cores, 2 threads per core. Each core has 32kB of L1i cache, 32kB of L1d cache and 256kB of L2 cache. There is a total of 6MB of L3 cache shared by all the cores. The source code used for all experiments is available here.

### Loop Fusion

The first experiment is related to loop fusion. We measure the runtime of two separate loops and compare it with the runtime of the fused loop. The examples we use for testing are loops that calculate the min and max, first in two separate loops, then merged.

Here are the runtimes (five repetitions, average values):

Array Size | Original | Fused |
---|---|---|

32 kB | Runtime: 0.159 s Instr: 671 M CPI: 0.793 |
Runtime: 0.068 s Instr: 402 M CPI: 0.665 |

256 kB | Runtime: 0.136 s Instr: 671 M CPI: 0.800 |
Runtime: 0.068 s Instr: 402 M CPI: 0.667 |

2 MB | Runtime: 0.136 s Instr: 671 M CPI: 0.801 |
Runtime: 0.068 s Instr: 402 M CPI: 0.667 |

16 MB | Runtime: 0.171 s Instr: 671 CPI: 0.855 |
Runtime: 0.085 s Instr: 402 M CPI: 0.739 |

128 MB | Runtime: 0.175 s Instr: 671 M CPI: 0.855 |
Runtime: 0.086 s Instr: 402 M CPI: 0.742 |

The table shows, that, on average, the fused version is about two times faster than the original version. The fused version also executes less instruction and is more hardware efficient (Cycle per Instruction metric is better). Fewer instructions are executed because (1) there is only one loop instead of two, so this means fewer iterator increases, iterator comparisons, jumps and (2) removed one redundant load, since the piece of data is already in a register.

### Selection Sort

We are going to experiment with selection sort, as described in the section about decreasing the number of data passes. To measure the effect we will compare the version of the selection sort where we are finding minimum only vs the version where we are finding both minimum and maximum. The first version scans the remaining part of the array to find the minimum and put it at the beginning. The second version scans to find both the minimum and maximum, and places them at the beginning and at the end of the array respectively. We expect the second version to be faster because it needs to perform two times fewer scans.

Here are the numbers (five runs, average numbers):

Array Size | Only Min | Min and Max |
---|---|---|

8 kB 16384 repeats |
Runtime: 8.74 s Instr: 60.3 B CPI: 0.57 |
Runtime: 4.44 s Instr: 43.2 B CPI: 0.405 |

32 kB 1024 repeats |
Runtime: 8.72 s Instr: 60.1 B CPI: 0.573 |
Runtime: 4.36 s Instr: 43.0 B CPI: 0.401 |

128 kB 64 repeats |
Runtime: 8.69 s Instr: 60.1 B CPI: 0.572 |
Runtime: 4.37 s Instr: 43.0 B CPI: 0.402 |

512 kB 4 repeats |
Runtime: 8.69 s Instr: 60.1 B CPI: 0.572 |
Runtime: 4.39 s Instr: 42.9 B CPI: 0.405 |

The Min and Max version is both faster and more hardware efficient in all cases. It also executes fewer instructions, because both the inner and the outer loops are shorter, so they perform fewer memory accesses.

## Conclusion

Loop fusion is a simple and powerful technique to decrease the total number of memory accesses in the program. Although we described here the simplest version, loop fusion is possible even if datasets overlap partially.

In general, any idea that would result in a decrease of memory accesses has the potential to speed up your code. If you have any ideas that are not mentioned in this post, feel free to leave a comment so we can update this post.

In the next post we will talk about another way to decrease the total number of memory accesses. These memory accesses are “unwanted”, in the sense that the compiler has created them without your intention: memory accesses related to pointer aliasing and memory accesses related to register spilling. Until soon!

Need help with software performance? Contact us!