Section 5: Processor performance optimization
Goals
- Processor performance
- Loop unrolling
- Vectorization
- Caching
Introduction
This unit of the course, Storage, has performance optimization as a major concern. In this Section, we’ll investigate several performance optimization strategies that concern processor storage, including loop unrolling, vectorization, and cache management.
The memset
function is extraordinarily simple, but it highlights two
common sources of overhead in code: memory accesses and branch
predictions.
For memory accesses, we want to optimize for more cache accesses (which are faster) and fewer direct RAM accesses (which are slower).
For branch predictions, we want to optimize for a minimal amount of conditional branches, which can be relatively expensive to compute.
The memsets.c
file contains four valid memset
implementations;
which do you think will be the fastest, and under what conditions? (Note
that some of them have stronger preconditions than the library
memset
—for example, memset_unrolled
requires that n
, the
number of bytes, is a multiple of 8.)
Take a look at memset_basic
and make sure you understand its
implementation. We will take a look at the others in the following
topics.
Loop unrolling
Recall that a loop is implemented by evaluating a conditional on each iteration, which incurs conditional branching overhead per iteration, and we have already asserted that this is quite expensive.
The memset_unrolled
variant uses an optimization technique called
loop unrolling to address this overhead by increasing the workload per
iteration to reduce the number of total iterations (and thus, the number
of conditionals evaluated).
Consider the for-loops in memset_basic
and memset_unrolled
:
for (; p < endp; ++p) {
*p = c;
}
for (; p < endp; p += 8) {
p[0] = c;
p[1] = c;
p[2] = c;
p[3] = c;
p[4] = c;
p[5] = c;
p[6] = c;
p[7] = c;
}
Can you spot the difference? In memset_basic
, we loop once per
element, while in memset_unrolled
, we loop once every 8 elements.
This means that memset_unrolled
has roughly
1/8 the number of conditional branches compared to
memset_basic
!
To formalize some more, loop unrolling refers to this method of reducing conditional branches in a loop by manually repeating an operation in the loop body. While loop unrolling alone will often yield some performance benefits, it is much more powerful when combined with vectorization.
Vectorization
The memset_wordwise
function uses an optimization called
vectorization. This can be thought of as a form of loop unrolling that
uses a single register to do multiple loops’ worth of work in one
instruction. Doing more work per instruction lets us reduce the number
of memory and/or computation instructions executed, while loop
unrolling reduces just the number of conditional branch instructions.
Specifically, the memset_basic
and memset_unrolled
functions
execute 1 memory write instruction per byte. (Use objdump
to confirm
this.) But we don’t just have movb
(to move a byte from register to
memory), we also have movq
(to move 8 bytes from register to
memory). What if, instead of executing 8 movb
s, we execute 1
movq
?
Consider the for-loops in memset_basic
and memset_wordwise
function:
for (; p < endp; ++p) {
*p = c;
}
unsigned long c8 = 0x0101010101010101UL * ((unsigned char) c);
for (; p < endp; p += 8) {
*((unsigned long*) p) = c8;
}
In both cases, each iteration of the loop does only one assignment. The
difference lies in how many bytes are written in each case. In
memset_basic
, each assignment affects only 1 byte, while in
memset_wordwise
, each assignment affects 8 bytes. This means that
memset_wordwise
makes 1/8 the number of
assignments that memset_basic
does!
This vectorization is only possible because of this line:
unsigned long c8 = 0x0101010101010101UL * (unsigned char) c;
This line loads into c8
a value consisting of c
repeated 8
times. For example, if c = 0x61
, then c8 = 0x6161616161616161
.
Storing c8
to memory has the same effect as assigning 0x61
to 8
bytes at once!
Caching
As we have alluded, loop unrolling and vectorization are really only effective for sequential accesses of contiguous memory, such as arrays. What happens if our memory accesses aren't sequential, e.g., in a dynamically-allocated linked list?
Recall that memory is loaded in groups of contiguous memory. This means that frequently accessing nonsequential memory may result in more of these groups being loaded. More groups loaded means more memory accesses, and as we have asserted earlier, more memory accesses means more time.
traversebench.c
contains two implementations of linked lists to
demonstrate our problem. make_array
creates a linked list with nodes
in contiguous blocks of memory (similar to Java's ArrayList
class).
make_list
creates a linked list through the more "traditional"
method, with nodes being allocated by malloc
(potentially in
discontiguous blocks of memory).
This problem is related to the concept of caching. Caching is technique of moving some memory closer to the processor, so that accessing that memory is now faster. The diagram below shows the relative capacities, speeds, and costs of different storage units.
Due to caching, sequential memory accesses make better use of caches than nonsequential memory accesses do.
Assignment
As you progress through the assignment, please fill out the form for this section at https://goo.gl/BLJyu5.
Case study: memset
The memsetbench
program benchmarks the memset
implementations,
found in memsets.c
. Passing in the -t0
option will run the
built-in memset
function; -t1
calls memset_basic
, -t2
calls memset_unrolled
, -t3
calls memset_wordwise
, and
-t4
calls memset_wordwise_unrolled
. The number of bytes to
memset
can be specified with the -n
option.
Now look at memsetbench.c
and understand it. Once you're comfortable
with how the benchmark works and what it does, answer the following
questions:
- Find a
-n
parameter for which-t2
reliably outperforms-t1
. What is the parameter? What’s the performance difference (as a ratio)? - Find a parameter for which
-t2
reliably performs similar to-t1
. What is the parameter? What’s different between this parameter and the one you found in #1? - Develop a hypothesis to explain this difference.
- Find a parameter for which
-t3
reliably outperforms-t2
. What is the parameter? What’s the performance difference? -t4
and-t2
are the unrolled variants of-t3
and-t1
, respectively. Using your hypothesis from #3 and your observations in #4, can you find a parameter where-t4
outperforms-t3
, but-t2
does not outperform-t1
?- Loop unrolling and vectorized copying are good optimizations for
memset
: they improve the performance ofmemset
across a wide range of parameters. Unfortunately, our implementations came with additional preconditions (seememsets.c
) compared to the originalmemset
(seeman memset
in your terminal). For instance, the unrolled version requires thatn % 8 == 0
; the wordwise version requires thatn % 8 == 0 && ((uintptr_t) s % 8) == 0
. Fortunately, it is possible to remove these preconditions. Try modifying one of the optimizedmemset
implementations so that it has the same preconditions as built-inmemset
. If you don't write code, at least develop a plan. (NB We think this exercise will help prepare you for some aspects of the stdio problem set!)
Remember to answer #6 in the section form!
The memset
implementations all operate on nicely-contiguous memory
in a nicely-sequential access pattern. We will now shift our focus to
nonsequential access patterns.
Case study: traverse
The traversebench
program benchmarks two linked list
implementations, as discussed in the caching
section. Passing in the -t0
option will run
the array implementation of the linked list, while -t1
will run the
dynamically-allocated-per-node implementation. Test parameters can be
specified with the -n
and -p
parameters.
Now look at traversebench.c
and understand it. Once you're
comfortable with how the benchmark works and what it does, answer the
following questions:
- Find parameters for which
-t0
reliably outperforms-t1
. What are these parameters? What's the performance difference? - Find parameters for which
-t0
reliably performs similar to-t1
. What are these parameters? What's the difference between these parameters and the ones you found in #7? - Based on your observations in #7 and #8, what might be the reasons for using the array implementation? What might be the reasons for using the list implementation?
Remember to answer #9 in the section form!
That's it! Please remember to submit the form. If you're interested in a little more investigation into prefetching, feel free to look at the section below.
Advanced: Prefetching
The rationale for why nonsequential memory is "cache-inefficient" might be intuitive: nonsequential accesses mean that more memory in a contiguous cache line is skipped, so the probability of jumping from one cache line to another is higher. But, what is a cache line?
A cache line is the contiguous block of memory that is moved from a slower storage unit (usually also bigger and less costly) to a faster storage unit (usually also smaller and more costly). When you try to access memory, your computer will try to load the appropriate cache line(s) from one of the faster storage units. If the data isn't there, your machine will simultaneously retrieve the data from a slower storage unit to give to the processor, and place that data into a faster storage unit. The details of how this happens is beyond the scope of this week, but as you might guess, loading a cache line for the first time is very expensive.
Enter prefetching. Prefetching is the technique of loading something into a faster storage unit in anticipation of it being used. Since loading bits occurs in parallel with the processor's computing pipeline, if we can pre-emptively fetch the cache lines we need, then the processor will have that information more readily on hand, meaning that it can chug through the computations faster, without having to wait for data to load.
There are actually two kinds of prefetching in modern architectures:
- Data prefetching is mostly what we have been discussing so far. This is when data from larger memory stores are moved into smaller ones in anticipation of the processor using that data.
- Instruction prefetching is another kind of prefetching. This allows the processor to "look ahead" a little and execute instructions before program counter (instruction pointer) actually gets there. This is the reason why conditional branches can be so expensive!
We'll mostly worry about data prefetching for the rest of this discussion, but remember that instruction prefetching exists as well.
Recall the original problem we were trying to solve: being able to access nonsequential memory more quickly. How can prefetching help us here? As it turns out, normally the compiler tries to guess at what you might want to prefetch. Sometimes, though, as might be the case with linked lists, it needs more of a hint from you, the programmer.
You'll note that we have actually already done this for you! In
traversebench.c
, look for the line that looks like this:
__asm__ volatile ("prefetch %0" : : "m" (*ns->next));
This strongly suggests (in a Bostonian way) to the compiler to add the
assembly instruction for prefetching at the point in the code. In our
case, you can activate this by calling ./traversebench
with the
-p1
option.
Prefetching assignment
Run traversebenchmark
with the -p1
option and consider these
questions:
- How does prefetching affect the
list
implementation? - How does prefetching affect the
array
implementation? - Can you think of a case when prefetching might not work well, or even be more expensive than not prefetching? Hint: think of other bottlenecks that we discussed earlier.