We continue our exploration with the memnode allocation benchmark introduced from the last lecture.
File cs61-lectures/datarep6/mb-malloc.cc
contains a version of the benchmark
using the system new
and delete
operators.
unsigned long memnode_benchmark(unsigned noperations, unsigned step) {
assert(step % 2 == 1); // `step` must be odd
long counter = 0;
// Allocate 4096 memnodes.
const unsigned nnodes = 4096;
memnode* m[nnodes];
for (unsigned i = 0; i != nnodes; ++i) {
m[i] = new memnode;
m[i]->file = "datarep/mb-filename.cc";
m[i]->line = counter;
++counter;
}
// Replace one `noperations` times.
for (unsigned i = 0; i != noperations; ++i) {
unsigned pos = (i * step) % nnodes;
delete m[pos];
m[pos] = new memnode;
m[pos]->file = "datarep/mb-filename.cc";
m[pos]->line = counter;
++counter;
}
// Compute a statistic and free them.
unsigned long result = 0;
for (unsigned i = 0; i != nnodes; ++i) {
result += m[i]->line;
delete m[i];
}
return result;
}
In this function we allocate an array of 4096 pointers to memnode
s, which
occupy 23*212=215 bytes on the stack. We
then allocate 4096 memnode
s. Our memnode
is defined like this:
struct memnode {
std::string file;
unsigned line;
};
Each memnode
contains a std::string
object and an unsigned integer. Each
std::string
object internally contains a pointer points to an character
array in the heap. Therefore, every time we create a new memnode
, we need 2
allocations: one to allocate the memnode itself, and another one performed
internally by the std::string
object when we initialize/assign a string
value to it.
Every time we deallocate a memnode
by calling delete
, we also delete the
std::string
object, and the string object knows that it should deallocate
the heap character array it internally maintains. So there are also 2
deallocations occuring each time we free a memnode.
We make the benchmark to return a seemingly meaningless
result
to prevent an aggressive compiler from optimizing everything away. We also use this result to make sure our subsequent optimizations to the allocator are correct by generating the same result.
This version of the benchmark, using the system allocator, finishes in 0.335 seconds. Not bad at all.
Spoilers alert: We can do 15x better than this.
1st optimization: std::string
We only deal with one file name, namely "datarep/mb-filename.cc", which is
constant throughout the program for all memnode
s. It's also a string
literal, which means it as a constant string has a static life time. Why can't
we just simply use a const char*
in place of the std::string
and let the
pointer point to the static constant string? This saves us the internal
allocation/deallocation performed by std::string
every time we
initialize/delete a string.
The fix is easy, we simply change the memnode
definition:
struct memnode {
const char* file;
unsigned line;
};
This version of the benchmark now finishes in 0.143 seconds, a 2x improvement over the original benchmark. This 2x improvement is consistent with a 2x reduction in numbers of allocation/deallocation mentioned earlier.
You may ask why people still use
std::string
if it involves an additional allocation and is slower thanconst char*
, as shown in this benchmark.std::string
is much more flexible in that it also deals data that doesn't have static life time, such as input from a user or data the program receives over the network. In short, when the program deals with strings that are not constant, heap data is likely to be very useful, andstd::string
provides facilities to conveniently handle on-heap data.
2nd optimization: the system allocator
We still use the system allocator to allocate/deallocate memnode
s. The
system allocator is a general-purpose allocator, which means it must handle
allocation requests of all sizes. Such general-purpose designs usually comes
with a compromise for performance. Since we are only memnode
s, which are
fairly small objects (and all have the same size), we can build a special-
purpose allocator just for them.
File cs61-lectures/datarep6/mb-arena-01.cc
contains a version of the
benchmark using an arena allocator.
unsigned long memnode_benchmark(unsigned noperations, unsigned step) {
assert(step % 2 == 1); // `step` must be odd
long counter = 0;
memnode_arena arena;
// Allocate 4096 memnodes.
const unsigned nnodes = 4096;
memnode* m[nnodes];
for (unsigned i = 0; i != nnodes; ++i) {
m[i] = arena.allocate();
m[i]->file = "datarep/mb-filename.cc";
m[i]->line = counter;
++counter;
}
// Replace one `noperations` times.
for (unsigned i = 0; i != noperations; ++i) {
unsigned pos = (i * step) % nnodes;
arena.deallocate(m[pos]);
m[pos] = arena.allocate();
m[pos]->file = "datarep/mb-filename.cc";
m[pos]->line = counter;
++counter;
}
// Compute a statistic and free them.
unsigned long result = 0;
for (unsigned i = 0; i != nnodes; ++i) {
result += m[i]->line;
}
return result;
}
Compare to the previous version of the benchmark, in this version, instead of
calling new
and delete
, we use arena.allocate()
and arena.deallocate()
to allocate and free memnode
s. Our arena
object (with type memnode_area
)
is our special-purpose allocator for our memnode
s.
This is how we implement the memnode_arena
allocator:
struct memnode_arena {
std::vector<memnode*> free_list;
memnode* allocate() {
memnode* n;
if (free_list.empty()) {
n = new memnode;
} else {
n = free_list.back();
free_list.pop_back();
}
return n;
}
void deallocate(memnode* n) {
free_list.push_back(n);
}
};
This allocator maintains a free list (a C++ vector
) of freed memnode
s.
allocate()
simply pops a memnode
off the free list if there is any, and
deallocate()
simply puts the memnode
on the free list. This free list
serves as a buffer between the system allocator and the benchmark function, so
that the system allocator is invoked less frequently. In fact, in the
benchmark, the system allocator is only invoked for 4096 times when it
initializes the pointer array. That's a huge reduction because all 10-million
"recycle" operations in the middle now doesn't involve the system allocator.
With this special-purpose allocator we can finish the benchmark in 0.057 seconds, another 2.5x improvement.
However this allocator now leaks memory: it never actually calls delete
!
Let's fix this by letting it also keep track of all allocated memnode
s. The
modified definition of memnode_arena
now looks like this:
struct memnode_arena {
std::vector<memnode*> allocated;
std::vector<memnode*> free_list;
~memnode_arena() {
destroy_all();
}
void destroy_all() {
for (auto a : allocated) {
delete a;
}
}
memnode* allocate() {
memnode* n;
if (free_list.empty()) {
n = new memnode;
allocated.push_back(n);
} else {
n = free_list.back();
free_list.pop_back();
}
return n;
}
void deallocate(memnode* n) {
free_list.push_back(n);
}
};
With the updated allocator we simply need to invoke arena.destroy_all()
at
the end of the function to fix the memory leak. And we don't even need to
invoke this method manually! We can use the C++ destructor for the
memnode_arena
struct, defined as ~memnode_arena()
in the code above, which
is automatically called when our arena
object goes out of scope. We simply
make the destructor invoke the destroy_all()
method, and we are all set.
Fixing the leak doesn't appear to affect performance at all. This is because
the overhead added by tracking the allocated
list and calling delete
only
affects our initial allocation the 4096 memnode*
pointers in the array plus
at the very end when we clean up. These 8192 additional operations is a
relative small number compared to the 10 million recycle operations, so the
added overhead is hardly noticeable.
Spoiler alert: We can improve this by another factor of 2.
3rd optimization: std::vector
In our special purpose allocator memnode_arena
, we maintain an allocated
list and a free list both using C++ std::vector
s. std::vector
s are dynamic
arrays, and like std::string
they involve an additional level of indirection
and stores the actual array in the heap. We don't access the allocated list
during the "recycling" part of the benchmark (which takes bulk of the
benchmark time, as we showed earlier), so the allocated list is probably not
our bottleneck. We however, add and remove elements from the free list for
each recycle operation, and the indirection introduced by the std::vector
here may actually be our bottleneck. Let's find out.
Instead of using a std::vector
, we could use a linked list of all free
memnode
s for the actual free list. We will need to include some extra
metadata in the memnode to store pointers for this linked list. However,
unlike in the debugging allocator pset, in a free list we don't need to store
this metadata in addition to actual memnode
data: the memnode
is free, and
not in use, so we can use reuse its memory, using a union:
union freeable_memnode {
memnode n;
freeable_memnode* next_free;
};
We then maintain the free list like this:
struct memnode_arena {
std::vector<freeable_memnode*> allocated_groups;
freeable_memnode* free_list;
...
memnode* allocate() {
if (!free_list) {
refresh_free_list();
}
freeable_memnode* fn = free_list;
free_list = fn->next_free;
return &fn->n;
}
void deallocate(memnode* n) {
freeable_memnode* fn = (freeable_memnode*) n;
fn->next_free = free_list;
free_list = fn;
}
...
};
Compared to the std::vector
free list, this free list we always directly
points to an available memnode
when it is not empty (free_list !=nullptr
),
without going through any indirection. In the std::vector
free
list one would first have to go into the heap to access the actual array
containing pointers to free memnode
s, and then access the memnode
itself.
With this change we can now finish the benchmark under 0.3 seconds! Another 2x improvement over the previous one!
Compared to the benchmark with the system allocator (which finished in 0.335 seconds), we managed to achieve a speedup of nearly 15x with arena allocation.