Section 5: Processor performance optimization

Goals

Processor performance
- Loop unrolling
- Vectorization
- Caching

Introduction

This unit of the course, Storage, has performance optimization as a major concern. In this Section, we’ll investigate several performance optimization strategies that concern processor storage, including loop unrolling, vectorization, and cache management.

The memset function is extraordinarily simple, but it highlights two common sources of overhead in code: memory accesses and branch predictions.

For memory accesses, we want to optimize for more cache accesses (which are faster) and fewer direct RAM accesses (which are slower).

For branch predictions, we want to optimize for a minimal amount of conditional branches, which can be relatively expensive to compute.

The memsets.c file contains four valid memset implementations; which do you think will be the fastest, and under what conditions? (Note that some of them have stronger preconditions than the library memset—for example, memset_unrolled requires that n, the number of bytes, is a multiple of 8.)

Take a look at memset_basic and make sure you understand its implementation. We will take a look at the others in the following topics.

Loop unrolling

Recall that a loop is implemented by evaluating a conditional on each iteration, which incurs conditional branching overhead per iteration, and we have already asserted that this is quite expensive.

The memset_unrolled variant uses an optimization technique called loop unrolling to address this overhead by increasing the workload per iteration to reduce the number of total iterations (and thus, the number of conditionals evaluated).

Consider the for-loops in memset_basic and memset_unrolled:

for (; p < endp; ++p) {
    *p = c;
}

for (; p < endp; p += 8) {
    p[0] = c;
    p[1] = c;
    p[2] = c;
    p[3] = c;
    p[4] = c;
    p[5] = c;
    p[6] = c;
    p[7] = c;
}

Can you spot the difference? In memset_basic, we loop once per element, while in memset_unrolled, we loop once every 8 elements. This means that memset_unrolled has roughly ¹/₈ the number of conditional branches compared to memset_basic!

To formalize some more, loop unrolling refers to this method of reducing conditional branches in a loop by manually repeating an operation in the loop body. While loop unrolling alone will often yield some performance benefits, it is much more powerful when combined with vectorization.

Vectorization

The memset_wordwise function uses an optimization called vectorization. This can be thought of as a form of loop unrolling that uses a single register to do multiple loops’ worth of work in one instruction. Doing more work per instruction lets us reduce the number of memory and/or computation instructions executed, while loop unrolling reduces just the number of conditional branch instructions.

Specifically, the memset_basic and memset_unrolled functions execute 1 memory write instruction per byte. (Use objdump to confirm this.) But we don’t just have movb (to move a byte from register to memory), we also have movq (to move 8 bytes from register to memory). What if, instead of executing 8 movbs, we execute 1 movq?

Consider the for-loops in memset_basic and memset_wordwise function:

for (; p < endp; ++p) {
    *p = c;
}

unsigned long c8 = 0x0101010101010101UL * ((unsigned char) c);
for (; p < endp; p += 8) {
    *((unsigned long*) p) = c8;
}

In both cases, each iteration of the loop does only one assignment. The difference lies in how many bytes are written in each case. In memset_basic, each assignment affects only 1 byte, while in memset_wordwise, each assignment affects 8 bytes. This means that memset_wordwise makes ¹/₈ the number of assignments that memset_basic does!

This vectorization is only possible because of this line:

unsigned long c8 = 0x0101010101010101UL * (unsigned char) c;

This line loads into c8 a value consisting of c repeated 8 times. For example, if c = 0x61, then c8 = 0x6161616161616161. Storing c8 to memory has the same effect as assigning 0x61 to 8 bytes at once!

Caching

As we have alluded, loop unrolling and vectorization are really only effective for sequential accesses of contiguous memory, such as arrays. What happens if our memory accesses aren't sequential, e.g., in a dynamically-allocated linked list?

Recall that memory is loaded in groups of contiguous memory. This means that frequently accessing nonsequential memory may result in more of these groups being loaded. More groups loaded means more memory accesses, and as we have asserted earlier, more memory accesses means more time.

traversebench.c contains two implementations of linked lists to demonstrate our problem. make_array creates a linked list with nodes in contiguous blocks of memory (similar to Java's ArrayList class). make_list creates a linked list through the more "traditional" method, with nodes being allocated by malloc (potentially in discontiguous blocks of memory).

This problem is related to the concept of caching. Caching is technique of moving some memory closer to the processor, so that accessing that memory is now faster. The diagram below shows the relative capacities, speeds, and costs of different storage units.

Due to caching, sequential memory accesses make better use of caches than nonsequential memory accesses do.

Assignment

As you progress through the assignment, please fill out the form for this section at https://goo.gl/BLJyu5.

Case study: memset

The memsetbench program benchmarks the memset implementations, found in memsets.c. Passing in the -t0 option will run the built-in memset function; -t1 calls memset_basic, -t2 calls memset_unrolled, -t3 calls memset_wordwise, and -t4 calls memset_wordwise_unrolled. The number of bytes to memset can be specified with the -n option.

Now look at memsetbench.c and understand it. Once you're comfortable with how the benchmark works and what it does, answer the following questions:

Find a -n parameter for which -t2 reliably outperforms -t1. What is the parameter? What’s the performance difference (as a ratio)?
Find a parameter for which -t2 reliably performs similar to -t1. What is the parameter? What’s different between this parameter and the one you found in #1?
Develop a hypothesis to explain this difference.
Find a parameter for which -t3 reliably outperforms -t2. What is the parameter? What’s the performance difference?
-t4 and -t2 are the unrolled variants of -t3 and -t1, respectively. Using your hypothesis from #3 and your observations in #4, can you find a parameter where -t4 outperforms -t3, but -t2 does not outperform -t1?
Loop unrolling and vectorized copying are good optimizations for memset: they improve the performance of memset across a wide range of parameters. Unfortunately, our implementations came with additional preconditions (see memsets.c) compared to the original memset (see man memset in your terminal). For instance, the unrolled version requires that n % 8 == 0; the wordwise version requires that n % 8 == 0 && ((uintptr_t) s % 8) == 0. Fortunately, it is possible to remove these preconditions. Try modifying one of the optimized memset implementations so that it has the same preconditions as built-in memset. If you don't write code, at least develop a plan. (NB We think this exercise will help prepare you for some aspects of the stdio problem set!)

Remember to answer #6 in the section form!

The memset implementations all operate on nicely-contiguous memory in a nicely-sequential access pattern. We will now shift our focus to nonsequential access patterns.

Case study: traverse

The traversebench program benchmarks two linked list implementations, as discussed in the caching section. Passing in the -t0 option will run the array implementation of the linked list, while -t1 will run the dynamically-allocated-per-node implementation. Test parameters can be specified with the -n and -p parameters.

Now look at traversebench.c and understand it. Once you're comfortable with how the benchmark works and what it does, answer the following questions:

Find parameters for which -t0 reliably outperforms -t1. What are these parameters? What's the performance difference?
Find parameters for which -t0 reliably performs similar to -t1. What are these parameters? What's the difference between these parameters and the ones you found in #7?
Based on your observations in #7 and #8, what might be the reasons for using the array implementation? What might be the reasons for using the list implementation?

Remember to answer #9 in the section form!

That's it! Please remember to submit the form. If you're interested in a little more investigation into prefetching, feel free to look at the section below.

Advanced: Prefetching

The rationale for why nonsequential memory is "cache-inefficient" might be intuitive: nonsequential accesses mean that more memory in a contiguous cache line is skipped, so the probability of jumping from one cache line to another is higher. But, what is a cache line?

A cache line is the contiguous block of memory that is moved from a slower storage unit (usually also bigger and less costly) to a faster storage unit (usually also smaller and more costly). When you try to access memory, your computer will try to load the appropriate cache line(s) from one of the faster storage units. If the data isn't there, your machine will simultaneously retrieve the data from a slower storage unit to give to the processor, and place that data into a faster storage unit. The details of how this happens is beyond the scope of this week, but as you might guess, loading a cache line for the first time is very expensive.

Enter prefetching. Prefetching is the technique of loading something into a faster storage unit in anticipation of it being used. Since loading bits occurs in parallel with the processor's computing pipeline, if we can pre-emptively fetch the cache lines we need, then the processor will have that information more readily on hand, meaning that it can chug through the computations faster, without having to wait for data to load.

There are actually two kinds of prefetching in modern architectures:

Data prefetching is mostly what we have been discussing so far. This is when data from larger memory stores are moved into smaller ones in anticipation of the processor using that data.
Instruction prefetching is another kind of prefetching. This allows the processor to "look ahead" a little and execute instructions before program counter (instruction pointer) actually gets there. This is the reason why conditional branches can be so expensive!

We'll mostly worry about data prefetching for the rest of this discussion, but remember that instruction prefetching exists as well.

Recall the original problem we were trying to solve: being able to access nonsequential memory more quickly. How can prefetching help us here? As it turns out, normally the compiler tries to guess at what you might want to prefetch. Sometimes, though, as might be the case with linked lists, it needs more of a hint from you, the programmer.

You'll note that we have actually already done this for you! In traversebench.c, look for the line that looks like this:

__asm__ volatile ("prefetch %0" : : "m" (*ns->next));

This strongly suggests (in a Bostonian way) to the compiler to add the assembly instruction for prefetching at the point in the code. In our case, you can activate this by calling ./traversebench with the -p1 option.

Prefetching assignment

Run traversebenchmark with the -p1 option and consider these questions:

How does prefetching affect the list implementation?
How does prefetching affect the array implementation?
Can you think of a case when prefetching might not work well, or even be more expensive than not prefetching? Hint: think of other bottlenecks that we discussed earlier.