Review Questions:
- DATAREP-5 Hello program
- DATAREP-20. Bit manipulation
- DATAREP-21. Computer arithmetic
- DATAREP-26. Sizes and alignments
- ASM-5. Calling conventions: 6186
- ASM-6. Data structure assembly
- IO-2. Caches and reference strings
This is a bank of questions from prior midterms. The real midterm will not have this many questions.
Some of these questions might not be appropriate this year. Most topics and questions that seem less appropriate this year, or which cover topics that we haven’t yet covered in class, are marked with ⚠️.
Previous exams used the following rules; this year’s rules will be similar.
The exam is open book, open note, open computer. You may access the book, and your own notes in paper form. You may also use a computer or equivalent to access your own class materials and public class materials. However, you may not access other materials except as explicitly allowed below. Specifically:
- You may access a browser and a PDF reader.
- You may access your own notes and problem set code electronically.
- You may access an Internet site on which your own notes and problem set code are stored.
- You may access the course site.
- You may access pages directly linked from the course site, including our lectures, exercises, and section notes, and our preparation materials for the midterm (including solutions).
- You may run a C compiler, including an assembler and linker, or a calculator.
- You may use a Python interpreter.
- You may access manual pages.
But:
- You may not access Google or Wikipedia or anything else except as directly linked from the course site.
- You may not access Piazza.
- You may not access course videos.
- You may not access an on-line disassembler, compiler explorer, or similar applications.
- You absolutely may not contact other humans via IM or anything like it.
- You may not access solutions from any previous exam, by paper or computer, except for those on the course site.
Any violations of this policy, or the spirit of this policy, are breaches of academic honesty and will be treated accordingly. Please appreciate our flexibility and behave honestly and honorably.
DATAREP-1. Sizes and alignments
QUESTION DATAREP-1A. True or false: For any non-array type X, the
size of X (sizeof(X)
) is greater than or equal to the alignment of
type X (alignof(X)
).
True.
This also mostly true for arrays. The exception is zero-length arrays:
sizeof(X[0]) == 0
, butalignof(X[0]) == alignof(X)
.
QUESTION DATAREP-1B. True or false: For any type X, the size of
struct Y { X a; char newc; }
is greater than the size of X.
True
QUESTION DATAREP-1C. True or false: For any types A1
...An
(with
n
≥ 1), the size of struct Y
is greater than the size of struct X
,
given:
struct X { A1 a1; ... An an; };
struct Y { A1 a1; ... An an; char newc; };
False (example:
A1 = int
,A2 = char
)
QUESTION DATAREP-1D. True or false: For any types A1
...An
(with
n
≥ 1), the size of struct Y
is greater than the size of union X
,
given:
union X { A1 a1; ... An an; };
struct Y { A1 a1; ... An an; };
False (if
n = 1
)
QUESTION DATAREP-1E. Assume that structure struct Y { ... }
contains K char
members and M int
members, with K≤M, and
nothing else. Write an expression defining the maximum
sizeof(struct Y)
.
4M + 4K
QUESTION DATAREP-1F. You are given a structure struct Z { T1 a; T2
b; T3 c; }
that contains no padding. What does (sizeof(T1) +
sizeof(T2) + sizeof(T3)) % alignof(struct Z)
equal?
0
QUESTION DATAREP-1G. Arrange the following types in increasing order by size. Sample answer: “1 < 2 = 4 < 3” (choose this if #1 has smaller size than #2, which has equal size to #4, which has smaller size than #3).
char
struct minipoint { uint8_t x; uint8_t y; uint8_t z; }
int
unsigned short[1]
char**
double[0]
#6 < #1 < #4 < #2 < #3 = #5
DATAREP-2. Expressions
QUESTION DATAREP-2A. Here are eight expressions. Group the
expressions into four pairs so that the two expressions in each pair
have the same value, and each pair has a different value from every
other pair. There is one unique answer that meets these constraints. m
has the same type and value everywhere it appears (there’s one unique
value for m
that meets the problem’s constraints). Assume an x86-32
machine: a 32-bit architecture in which pointers are 32 bits long.
sizeof(&m)
-1
m & -m
m + ~m + 1
16 >> 2
m & ~m
m
1
1—5; 2—7; 3—8; 4—6
1—5 is easy.
m + ~m + 1 == m + (-m) == 0
, andm & ~m == 0
, giving us 3—8. Now what about the others?m & -m
(#3) is either 0 or a power of 2, so it cannot be -1 (#2). The remaining possiblities arem
and 1 . If(m & -m) == m
, then the remaining pair would be 1 and -1, which clearly doesn’t work. Thusm & -m
matches with 1, andm == -1
.
DATAREP-3. Hello binary
This problem locates 8-bit numbers horizontally and vertically in the following 16x16 image. Black pixels represent 1 bits and white pixels represent 0 bits. For horizontal arrangements, the most significant bit is on the left as usual. For vertical arrangements, the most significant bit is on top.
Examples: The 8-bit number 15 (hexadecimal 0x0F, binary 0b00001111) is located horizontally at 3,4, which means X=3, Y=4.
- The pixel at 3,4 is white, which has bit value 0.
- 4,4 is white, also 0.
- 5,4 is white, also 0.
- 6,4 is white, also 0.
- 7,4 is black, which has bit value 1.
- 8,4, 9,4, and 10,4 are black, giving three more 1s.
- Reading them all off, this is 0b00001111, or 15.
15 is also located horizontally at 7,6.
The 8-bit number 0 is located vertically at 0,0. It is also located horizontally at 0,0 and 1,0.
The 8-bit number 134 (hexadecimal 0x86, binary 0b10000110) is located vertically at 8,4.
QUESTION DATAREP-3A. Where is 3 located vertically? (All questions refer to 8-bit numbers.)
9,6
QUESTION DATAREP-3B. Where is 12 located horizontally?
5,5
QUESTION DATAREP-3C. Where is 255 located vertically?
14,3
DATAREP-4. Hello memory
Shintaro Tsuji wants to represent the image of Question DATAREP-3 in computer memory. He stores it in an array of 16-bit unsigned integers:
unsigned short cute[16];
Row Y of the image is stored in integer cute[Y]
.
QUESTION DATAREP-4A. What is sizeof(cute)
, 2, 16, 32, or 64?
32
QUESTION DATAREP-4B. printf("%d\n", cute[0]);
prints 16384
. Is
Shintaro’s machine big-endian or little-endian?
Little-endian
DATAREP-5. Hello program
Now that Shintaro has represented the image in memory as an array of unsigned
short
objects, he can manipulate the image using C. For example, here’s a
function.
void swap(void) {
for (int i = 0; i < 16; ++i) {
cute[i] = (cute[i] << 8) | (cute[i] >> 8);
}
}
Running swap
produces the following image:
Shintaro has written several other functions. Here are some images (A is the original):
![]() |
![]() |
![]() |
![]() |
![]() |
A |
B |
C |
D |
E |
![]() |
![]() |
![]() |
![]() |
![]() |
F |
G |
H |
I |
J |
For each function, what image does that function create?
QUESTION DATAREP-5A.
void f0() {
for (int i = 0; i < 16; ++i) {
cute[i] = ~cute[i];
}
}
H. The code flips all bits in the input.
QUESTION DATAREP-5B.
void f1() {
for (int i = 0; i < 16; ++i) {
cute[i] = ((cute[i] >> 1) & 0x5555) | ((cute[i] << 1) & 0xAAAA);
cute[i] = ((cute[i] >> 2) & 0x3333) | ((cute[i] << 2) & 0xCCCC);
cute[i] = ((cute[i] >> 4) & 0x0F0F) | ((cute[i] << 4) & 0xF0F0);
cute[i] = (cute[i] >> 8) | (cute[i] << 8);
}
}
D
QUESTION DATAREP-5C.
void f2() {
char* x = (char*) cute;
for (int i = 0; i < 16; ++i) {
x[2*i] = i;
}
}
J
For “fun”
The following programs generated the other images. Can you match them with their images?
void f3() {
for (int i = 0; i < 16; ++i) {
cute[i] &= ~(7 << i);
}
}
void f4() {
swap();
for (int i = 0; i < 16; ++i) {
cute[i] <<= i/4;
}
swap();
}
void f5() {
for (int i = 0; i < 16; ++i) {
cute[i] = -1 * !!(cute[i] & 64);
}
}
void f6() {
for (int i = 0; i < 8; ++i) {
int tmp = cute[15-i];
cute[15-i] = cute[i];
cute[i] = tmp;
}
}
void f7() {
for (int i = 0; i < 16; ++i) {
cute[i] = cute[i] & -cute[i];
}
}
void f8() {
for (int i = 0; i < 16; ++i) {
cute[i] ^= cute[i] ^ cute[i];
}
}
void f9() {
for (int i = 0; i < 16; ++i) {
cute[i] = cute[i] ^ 4080;
}
}
f3
—I;f4
—B;f5
—C;f6
—F;f7
—G;f8
—A;f9
—E
DATAREP-6. Memory regions
Consider the following program:
struct ptrs {
int** x;
int* y;
};
struct ptrs global;
void setup(struct ptrs* p) {
int* a = malloc(sizeof(int));
int* b = malloc(sizeof(int));
int* c = malloc(sizeof(int));
int* d = malloc(sizeof(int));
int* e = malloc(sizeof(int) * 2);
int** f = malloc(4 * sizeof(int*));
int** g = malloc(sizeof(int*));
*a = 0;
*b = 0;
*c = (int) a;
*d = *b;
e[0] = 29;
e[1] = (int) &d[100000];
f[0] = b;
f[1] = c;
f[2] = 0;
f[3] = 0;
*g = c;
global.x = f;
global.y = e;
p->x = g;
p->y = &e[1];
}
int main(int argc, char** argv) {
stack_bottom = (char*) &argc;
struct ptrs p;
setup(&p);
m61_collect();
do_stuff(&p);
}
This program allocates objects a
through g
on the heap and then
stores those pointers in some stack and global variables. (It then calls
our conservative garbage collector from class⚠️, but that won’t matter
until the next problem.) We recommend you draw a picture of the state
setup
creates.
QUESTION DATAREP-6A. Assume that (uintptr_t) a == 0x8300000
, and
that malloc
returns increasing addresses. Match each address to the
most likely expression with that address value. The expressions are
evaluated within the context of main
. You will not reuse an
expression.
Value | Expression | |||
---|---|---|---|---|
1. | 0x8300040 | A. | &p |
|
2. | 0x8049894 | B. | (int*) *p.x[0] |
|
3. | 0x8361AF0 | C. | &global.y |
|
4. | 0x8300000 | D. | global.y |
|
5. | 0xBFAE0CD8 | E. | (int*) *p.y |
1—D; 2—C; 3—E; 4—B; 5—A
Since
p
has automatic storage duration, it is located on the stack, giving us 5—A. Theglobal
variable has static storage duration, and so does its componentglobal.y
; so the pointer&global.y
has an address that is below all heap-allocated pointers. This gives us 2—C. The remaining expressions go like this:
global.y == e
;
p.y == &e[1]
, so*p.y == e[1] == (int) &d[100000]
, and(int *) *p.y == &d[100000]
;
p.x == g
, sop.x[0] == g[0] == *g == c
, and*p.x[0] == *c == (int) a
.Address #4 has value 0x8300000, which by assumption is
a
’s address; so 4—B. Address #3 is much larger than the other heap addresses, so 3—E. This leaves 1—D.
DATAREP-7. Garbage collection
Here is the top-level function for the conservative garbage collector we wrote in class. (⚠️ 2018 note: We haven’t done this. ⚠️)
void m61_collect(void) {
char* stack_top = (char*) &stack_top;
// The entire contents of the heap start out unmarked
for (size_t i = 0; i != nmr; ++i) {
mr[i].marked = 0;
}
// Mark all reachable objects, starting with the roots (the stack)
m61_markaccessible(stack_top, stack_bottom - stack_top);
// Free everything that wasn't marked
for (size_t i = 0; i != nmr; ++i) {
if (mr[i].marked == 0) {
m61_free(mr[i].ptr);
--i; // m61_free moved different data into this
// slot, so we must recheck the slot
}
}
}
This garbage collector is not correct because it doesn’t capture all memory roots.
Consider the program from the previous section, and assume that an
object is reachable if do_stuff
can access an address within the
object via variable references and memory dereferences without casts or
pointer arithmetic. Then:
QUESTION DATAREP-7A. Which reachable objects will m61_collect()
free? Circle all that
apply.
|
|
|
|
|
|
|
None of these |
b
,f
.The collector searches the stack for roots. This yields just the values in
struct ptrs p
(the only pointer-containing variable with automatic storage duration at the timem61_collect
is called). The objects directly pointed to byp
areg
ande
. The collector then recursively marks objects pointed to by these objects. Fromg
, it findsc
. Frome
, it finds nothing. Then it checks one more time. Fromc
, it finds the value ofa
! Now,a
is actually not a pointer here—the type of*c
isint
—so by the definition above,a
is not actually reachable. But the collector doesn’t know this.Putting it together, the collector marks
a
,c
,e
, andg
. It won’t free these objects; it will free the others (b
,d
, andf
). Butb
andf
are reachable fromglobal
.
QUESTION DATAREP-7B. Which unreachable objects will
m61_collect()
not free? Circle all that
apply.
|
|
|
|
|
|
|
None of these |
a
QUESTION DATAREP-7C. Conservative garbage collection in C is often slower than precise garbage collection in languages such as Java. Why? Circle all that apply.
- C is generally slower than other languages.
- Conservative garbage collectors must search all reachable memory for pointers. Precise garbage collectors can ignore values that do not contain pointers, such as large character buffers.
- C programs generally use the heap more than programs in other languages.
- None of the above.
#2
DATAREP-8. Memory errors
The following function constructs and returns a lower-triangular matrix of size N. The elements are random 2-dimensional points in the unit square. The matrix is represented as an array of pointers to arrays.
struct point2 {
double d[2];
};
typedef point2* point2_vector;
point2_vector* make_random_lt_matrix(size_t N) {
point2_vector* m = (point2_vector*) malloc(sizeof(point2_vector) * N);
for (size_t i = 0; i < N; ++i) {
m[i] = (point2*) malloc(sizeof(point2) * (i + 1)); /* LINE A */
for (size_t j = 0; j <= i; ++j) {
for (int d = 0; d < 2; ++d) {
m[i][j].d[d] = drand48(); /* LINE B */
}
}
}
return m;
}
This code is running on an x86-32 machine (size_t
is 32 bits,
not 64). You may assume that the machine has enough free physical memory
and the process has enough available virtual address space to satisfy
any memory allocation request.
QUESTION DATAREP-8A. Give a value of N so that, while
make_random_lt_matrix(N)
is running, no new
fails, but
a memory error (such as a null pointer dereference or an out-of-bounds
dereference) happens on Line A. The memory error should happen
specifically when i == 1
.
(This problem is probably easier when you write your answer in hexadecimal.)
We are asked to produce a value of N so that no memory error happens on Line A when
i == 0
, but a memory error does happen wheni == 1
. So reason that through. What memory errors could happen on Line A ifmalloc()
returns non-nullptr
? There’s only one memory operation, namely the dereferencem[i]
. Perhaps this dereference is out of bounds.If no memory error happens when
i == 0
, then am[0]
dereference must not cause a memory error. So them
object must contain at least 4 bytes. But a memory error does happen on Line A wheni == 1
. So them
object must contain less than 8 bytes. How many bytes were allocated form
?sizeof(point2_vector) * N == sizeof(point2 *) * N == 4 * N
. So we have:
(4 * N)
≥ 4(4 * N)
< 8It seems like the only possible answer is
N == 1
. But no, this doesn’t cause a memory error, because the loop body would never be executed withi == 1
!The key insight is that the multiplications above use 32-bit unsigned computer arithmetic. Let’s write
N
asX + 1
. Then these inequalities become:
- 4 ≤
4 * (X + 1)
=4 * X + 4
< 8- 0 ≤
(4 * X)
< 4(Multiplication distributes over addition in computer arithmetic.) What values of
X
satisfy this inequality? It might be easier to see if we remember that multiplication by powers of two is equivalent to shifting:
- 0 ≤
(X << 2)
< 4The key insight is that this shift eliminates the top two bits of
X
. There are exactly four values forX
that work:0
,0x40000000
,0x80000000
, and0xC0000000
. For any of these,4 * X == 0
in 32-bit computer arithmetic, because 4×X
= 0 (mod 232) in normal arithmetic.Plugging
X
back in toN
, we see thatN ∈ {0x40000001, 0x80000001, 0xC0000001}
. These are the only values that work.Partial credit was awarded for values that acknowledged the possibility of overflow.
QUESTION DATAREP-8B. Give a value of N so that no new
fails, and no memory error happens on Line A, but a memory
error does happen on Line B.
If no memory error happens on Line A, then
N
< 230 (otherwise overflow would happen as seen above). But a memory error does happen on Line B. Line B dereferencesm[i][j]
, for 0 ≤j
≤i
; so how big ism[i]
? It was allocated on Line A with size `sizeof(point2)
- (i + 1) == 2 * sizeof(double) * (i + 1) == 16 * (i + 1)
. If
i + 1≥ 2<sup>32</sup> / 16 = 2<sup>28</sup>, this multiplication will overflow. Since
i < N, we can finally reason that any
Ngreater than or equal to 2<sup>28</sup> =
0x10000000and less than 2<sup>30</sup> =
0x40000000` will cause the required memory error.
DATAREP-9. Data representation
Assume a 64-bit x86-64 architecture unless explicitly told otherwise.
Write your assumptions if a problem seems unclear, and write down your reasoning for partial credit.
QUESTION DATAREP-9A. Arrange the following values in increasing
numeric order. Assume that x
is an int
with value 8192.
1. | EOF |
5. | 1000 |
|
2. | x & ~x |
6. | (signed char) 65535 |
|
3. | (signed char) 0x47F |
7. | The size of the stdio cache | |
4. | x | ~x |
8. | -0x80000000 |
A possible answer might be “a < b < c = d < e < f < g < h.”
h < a = d = f < b < c < e < g
For each of the remaining questions, write one or more arguments that, when passed to the provided function, will cause it to return the integer 61 (which is 0x3d hexadecimal). Write the expected number of arguments of the expected types.
QUESTION DATAREP-9B.
int f1(int n) {
return 0x11 ^ n;
}
0x2c == 44
QUESTION DATAREP-9C.
int f2(const char* s) {
return strtol(s, nullptr, 0);
}
"61"
QUESTION DATAREP-9D. Your answer should be different from the previous answer.
int f3(const char* s) {
return strtol(s, nullptr, 0);
}
" 0x3d"
," 61 "
, etc.
QUESTION DATAREP-9E. For this problem, you will also need to define a global variable. Give its type and value.
f4:
andl $5, %edi
leal (%rsi,%rdi,2), %eax
movzbl y(%rip), %ecx
subl %ecx, %eax
retq
This code was compiled from:
int f4(int a, int b) { extern unsigned char y; return (a & 5) * 2 + b - y; }
A valid solution is
a
=0,b
=61,unsigned char y
=0.
DATAREP-10. Sizes and alignments
Assume a 64-bit x86-64 architecture unless explicitly told otherwise.
Write your assumptions if a problem seems unclear, and write down your reasoning for partial credit.
QUESTION DATAREP-10A. Use the following members to create a struct
of size 16, using each member exactly once, and putting char a
first;
or say “impossible” if this is impossible.
char a;
(we’ve written this for you)unsigned char b;
short c;
int d;
struct size_16 {
char a;
};
Impossible
QUESTION DATAREP-10B. Repeat Part A, but create a struct with size 12.
struct size_12 {
char a;
};
abdc, acbd, acdb, adbc, adcb, …
QUESTION DATAREP-10C. Repeat Part A, but create a struct with size 8.
struct size_8 {
char a;
};
abcd
QUESTION DATAREP-10D. Consider the following structs:
struct x {
T x1;
U x2;
};
struct y {
struct x y1;
V y2;
};
Give definitions for T, U, and V so that there is one byte of padding in
struct x
after x2
, and two bytes of padding in struct y
after
y1
.
Example: T =
short[2]
, U =char
, V =int
DATAREP-11. Dynamic memory allocation
QUESTION DATAREP-11A. True or false?
free(nullptr)
is an error.malloc(0)
can never returnnullptr
.
False, False
QUESTION DATAREP-11B. Give values for sz
and nmemb
so that
calloc(sz, nmemb)
will always return nullptr
(on a 32-bit x86 machine),
but malloc(sz * nmemb)
might or might not return null.
(size_t) -1, (size_t) -1
—anything that causes an overflow
Consider the following 8 statements. (p
and q
have type char*
.)
free(p);
free(q);
p = q;
q = nullptr;
p = (char*) malloc(12);
q = (char*) malloc(8);
p[8] = 0;
q[4] = 0;
QUESTION DATAREP-11C. Put the statements in an order that would execute without error or evoking undefined behavior. Memory leaks count as errors. Use each statement exactly once. Sample answer: “abcdefgh.”
cdefghab (and others). Expect “OK”
QUESTION DATAREP-11D. Put the statements in an order that would cause one double-free error, and no other error or undefined behavior (except possibly one memory leak). Use each statement exactly once.
efghbcad (and others). Expect “double-free + memory leak”
QUESTION DATAREP-11E. Put the statements in an order that would cause one memory leak (one allocated piece of memory is not freed), and no other error or undefined behavior. Use each statement exactly once.
efghadbc (and others). Expect “memory leak”
QUESTION DATAREP-11F. Put the statements in an order that would cause one boundary write error, and no other error or undefined behavior. Use each statement exactly once.
eafhcgbd (and others). Expect “out of bounds write”
DATAREP-12. Pointers and debugging allocators
You are debugging some students’ m61
code from Problem Set 1. The
codes use the following metadata:
struct meta { ...
meta* next;
meta* prev;
};
meta* mhead; // head of active allocations list
Their linked-list manipulations in m61_malloc
are similar.
void* m61_malloc(size_t sz, const char* file, int line) {
...
meta* m = (meta*) ptr;
m->next = mhead;
m->prev = nullptr;
if (mhead) {
mhead->prev = m;
}
mhead = m;
...
}
But their linked-list manipulations in m61_free
differ.
Alice’s code:
void m61_free(void* ptr, ...) { ... meta* m = (meta*) ptr - 1; if (m->next != nullptr) { m->next->prev = m->prev; } if (m->prev == nullptr) { mhead = nullptr; } else { m->prev->next = m->next; } ... }
Bob’s code:
void m61_free(void* ptr, ...) { ... meta* m = (meta*) ptr - 1; if (m->next) { m->next->prev = m->prev; } if (m->prev) { m->prev->next = m->next; } ... }
Chris’s code:
void m61_free(void* ptr, ...) { ... meta* m = (meta*) ptr - 1; m->next->prev = m->prev; m->prev->next = m->next; ... }
Donna’s code:
void m61_free(void* ptr, ...) { ... meta* m = (meta*) ptr - 1; if (m->next) { m->next->prev = m->prev; } if (m->prev) { m->prev->next = m->next; } else { mhead = m->next; } ... }
You may assume that all code not shown is correct.
QUESTION DATAREP-12A. Whose code will segmentation fault on this input? List all students that apply.
int main() {
void* ptr = malloc(1);
free(ptr);
}
Chris
QUESTION DATAREP-12B. Whose code might report something like
“invalid free of pointer [ptr1], not allocated
” on this input?
(Because a list traversal starting from mhead
fails to find ptr1
.)
List all students that apply. Don’t include students whose code would
segfault before the report.
int main() {
void* ptr1 = malloc(1);
void* ptr2 = malloc(1);
free(ptr2);
free(ptr1); // <- message printed here
}
Alice
QUESTION DATAREP-12C. Whose code would improperly report something
like “LEAK CHECK: allocated object [ptr1] with size 1
” on this input?
(Because the mhead
list appears not empty, although it should be.)
List all students that apply. Don’t include students whose code would
segfault before the report.
int main() {
void* ptr1 = malloc(1);
free(ptr1);
m61_printleakreport();
}
Bob
QUESTION DATAREP-12D. Whose linked-list code is correct for all inputs? List all that apply.
Donna
DATAREP-13. Arena allocation
Chimamanda Ngozi Adichie is a writing a program that needs to allocate and free a lot of nodes, where a node is defined as follows:
struct node {
int key;
void* value;
node* left;
node* right; // also used in free list
};
She uses an arena allocator variant. Here’s her code.
struct arena_group {
arena_group* next_group;
node nodes[1024];
};
struct arena {
node* frees;
arena_group* groups;
};
node* node_alloc(arena* a) {
if (!a->frees) {
arena_group* g = new arena_group;
// ... link `g` to `a->groups` ...
for (size_t i = 0; i != 1023; ++i) {
g->nodes[i].right = &g->nodes[i + 1];
}
g->nodes[1023].right = nullptr;
a->frees = &g->nodes[0];
}
node* n = a->frees;
a->frees = n->right;
return n;
}
void node_free(arena* a, node* n) {
n->right = a->frees;
a->frees = n;
}
QUESTION DATAREP-13A. True or false?
- This allocator never has external fragmentation.
- This allocator never has internal fragmentation.
True, True
QUESTION DATAREP-13B. Chimamanda’s frenemy Paul Auster notices that if many nodes are allocated right in a row, every 1024th allocation seems much more expensive than the others. The reason is that every 1024th allocation initializes a new group, which in turn adds 1024 nodes to the free list. Chimamanda decides instead to allow a single element of the free list to represent many contiguous free nodes. The average allocation might get a tiny bit slower, but no allocation will be much slower than average. Here’s the start of her idea:
node* node_alloc(arena* a) {
if (!a->frees) {
arena_group* g = new arena_group;
// ... link `g` to `a->groups` ...
g->nodes[0].key = 1024; // g->nodes[0] is the 1st of 1024 contiguous free nodes
g->nodes[0].right = nullptr;
a->frees = &g->nodes[0];
}
node* n = a->frees;
// ???
return n;
}
Complete this function by writing code to replace // ???
.
if (n->key == 1) { a->frees = n->right; } else { a->frees = n + 1; a->frees->key = n->key - 1; a->frees->right = n->right; }
Another solution:
if (n->right) { a->frees = n->right; } else if (n->key == 1) { a->frees = NULL; } else { a->frees = n + 1; a->frees->key = n->key - 1; }
QUESTION DATAREP-13C. Write a node_free
function that works with
the node_alloc
function from the previous question.
void node_free(arena* a, node* n) {
}
void node_free(arena* a, node* n) { n->right = a->frees; n->key = 1; a->frees = n; }
Or, if you use the solution above:
void node_free(arena* a, node* n) { n->right = a->frees; a->frees = n; }
QUESTION DATAREP-13D. Complete the following new function.
// Return the arena_group containing node `n`. `n` must be a node returned by
// a previous call to `node_alloc(a)`.
arena_group* node_find_group(arena* a, node* n) {
for (arena_group* g = a->groups; g; g = g->next_group) {
}
return nullptr;
}
arena_group* node_find_group(arena* a, node* n) { for (arena_group* g = a->groups; g; g = g->next_group) { if ((uintptr_t) &g->nodes[0] <= (uintptr_t) n && (uintptr_t) n <= (uintptr_t) &g->nodes[1023]) { return g; } } return nullptr; }
QUESTION DATAREP-13E. Chimamanda doesn’t like that the
node_find_group
function from part D takes O(G) time,
where G is the number of allocated arena_groups. She remembers a
library function that might help, posix_memalign
:
int posix_memalign(void** memptr, size_t alignment, size_t size);
The function posix_memalign() allocates size
bytes and places the
address of the allocated memory in *memptr
. The address of the
allocated memory will be a multiple of alignment
, which must be a
power of two and a multiple of sizeof(void*)
. ...
“Cool,” she says, “I can use this to speed up node_find_group
!” She
now allocates a new group with the following code:
arena_group* g;
int r = posix_memalign(&g, 32768, sizeof(arena_group));
assert(r == 0); // posix_memalign succeeded
Given this allocation strategy, write a version of node_find_group
that takes O(1) time.
arena_group* node_find_group(arena* a, node* n) {
}
arena_group* node_find_group(arena* a, node* n) { uintptr_t n_addr = (uintptr_t) n; return (arena_group*) (n_addr - n_addr % 32768); }
DATAREP-14. Data representation
Sort the following expressions in ascending order by value, using the operators <, =, >. For example, if we gave you:
int A = 6;
int B = 0x6;
int C = 3;
you would write C < A = B
.
unsigned char a = 0x191;
char b = 0x293;
unsigned long c = 0xFFFFFFFF;
int d = 0xFFFFFFFF;
int e = d + 3;
- f = 4 GB
size_t g = sizeof(*s)
(givenshort *s
)long h = 256;
- i =
0b100000000000000000000000000000000000
(binary) unsigned long j = 0xACE - 0x101;
b < d < e = g < a < h < j < c < f < i
DATAREP-15. Memory
For the following questions, select the part(s) of memory from the list below that best describes where you will find the object.
- heap
- stack
- between the heap and the stack
- in a read-only data segment
- in a text segment starting at address 0x08048000
- in a read/write data segment
- in a register
Assume the following code, compiled without optimization.
#include <stdio.h>
#include <stdlib.h>
const long maxitems = 1000;
struct info {
char name[20];
unsigned int age;
short height;
} s = { "sushi", 1, 9 };
int main(int argc, char* argv[]) {
static long L = 0xbadf00d;
unsigned long u = 0x8badf00d;
int i, num = maxitems + 1;
struct info *sp;
printf("What did you do? %lx?\n", u);
while (num > maxitems || num < 10) {
printf("How much of it did you eat? ");
scanf(" %d", &num);
}
sp = (struct info *)malloc(num * sizeof(*sp));
for (i = 0; i < num; i++) {
sp[i] = s;
}
return 0xdeadbeef;
}
QUESTION DATAREP-15A. The value 0xdeadbeef, when we are returning from main.
7, in a register
QUESTION DATAREP-15B. The variable maxitems
4, in a read-only data segment
QUESTION DATAREP-15C. The structure s
6, in a read/write data segment
QUESTION DATAREP-15D. The structure at sp[9]
1, heap
QUESTION DATAREP-15E. The variable u
2, stack, or 7, in a register
QUESTION DATAREP-15F. main
5, in a text segment starting at address 0x08048000
QUESTION DATAREP-15G. printf
3, between the heap and the stack
QUESTION DATAREP-15H. argc
2, stack, or 7, in a register
QUESTION DATAREP-15I. The number the user enters
2, stack
QUESTION DATAREP-15J. The variable L
6, in a read/write data segment
DATAREP-16. Memory and pointers
⚠️ This question may benefit from Unit 4, kernel programming. ⚠️
If multiple processes are sharing data via mmap
, they may have the file
mapped at different virtual addresses. In this case, pointers to the
same object will have different values in the different processes. One
way to store pointers in mmapped memory so that multiple processes can
access them consistently is using relative pointers. Rather than storing
a regular pointer, you store the offset from the beginning of the
mmapped region and add that to the address of the mapping to obtain a
real pointer. An alternative representation is called self-relative
pointers. In this case, you store the difference in address between the
current location (i.e., the location containing the pointer) and the
location to which you want to point. Neither representation addresses
pointers between the mmapped region and the rest of the address space;
you may assume such pointers do not exist.
QUESTION DATAREP-16A. State one advantage that relative pointers have over self-relative pointers.
The key thing to understand is that both of these approaches use relative pointers and both can be used to solve the problem of sharing a mapped region among processes that might have the region mapped at different addresses.
Possible advantages:
Within a region, you can safely use memcpy as moving pointers around inside the region does not change their value. If you copy a self relative pointer to a new location, its value has to change. That is, imagine that you have a self-relative pointer at offset 4 from the region and it points to the object at offset 64 from the region. The value of the self relative pointer is 60. If I copy that pointer to the offset 100 from the region, I have to change it to be -36. If you save the region as a
uintptr_t
or achar *
, then you can simply add the offset to the region; self-relative-pointers will always be adding/subtracting from the address of the location storing the pointer, which may have a type other thanchar *
, so you'd need to cast it before performing the addition/subtraction.You can use a larger region: if we assume that we have only N bits to store the pointer, then in the base+offset model, offset could be an unsigned value, which will be larger than the maximum offset possible with a signed pointer, which you need for the self-relative case. That is, although the number of values that can be represented by signed and unsigned numbers differs by one, the implementation must allow for a pointer from the beginning of the region to reference an item at the very last location of the region -- thus, your region size is limited by the largest positive number you can represent.
QUESTION DATAREP-16B. State one advantage that self-relative pointers have over relative pointers.
You don't have to know the address at which the region is mapped to use them. That is, given a location containing a self-relative pointer, you can find the target of that pointer.
For the following questions, assume the following setup:
char* region; /* Address of the beginning of the region. */
// The following are sample structures you might find in
// a linked list that you are storing in an mmaped region.
struct ll1 {
unsigned value;
TYPE1 r_next; /* Relative Pointer. */
};
struct ll2 {
unsigned value;
TYPE2 sr_next; /* Self-Relative Pointer. */
};
ll1 node1;
ll2 node2;
QUESTION DATAREP-16C. Propose a type for TYPE1 and give 1 sentence why you chose that type.
A good choice is
ptrdiff_t
, which represents differences between pointers. Other reasonable choices includeuintptr_t
andunsigned long
.
QUESTION DATAREP-16D. Write a C expression to generate a (properly
typed) pointer to the element referenced by the r_next field of ll1
.
(ll1*) (region + node1.r_next)
QUESTION DATAREP-16E. Propose a type for TYPE2 and give 1 sentence why you chose that type.
The same choices work; again
ptrdiff_t
is best.
QUESTION DATAREP-16F. Write a C expression to generate a (properly
typed) pointer to the element referenced by the sr_next field of ll2
.
(ll2*) ((char*) &node2.sr_next + node2.sr_next)
DATAREP-17. Data representation: Allocation sizes
union my_union {
int f1[4];
long f2[2];
};
int main() {
void* p = malloc(sizeof(char*));
my_union u;
my_union* up = &u;
....
}
How much user-accessible space is allocated on the stack and/or the heap by each of the following statements? Assume x86-64.
QUESTION DATAREP-17A. union my_union { ... };
0; this declares the type, not any object
QUESTION DATAREP-17B. void* p = malloc(sizeof(char*));
16: 8 on the heap plus 8 on the stack
QUESTION DATAREP-17C. my_union u;
16 (on the stack)
QUESTION DATAREP-17D. my_union* up = &u;
8 (on the stack)
DATAREP-18. Data representation: ENIAC
Professor Kohler has been developing Eddie’s NIfty Awesome Computer (ENIAC). When he built the C compiler for ENIAC, he assigned the following sizes and alignments to C’s fundamental data types. (Assume that every other fundamental type has the same size and alignment as one of these.)
Type | sizeof |
alignof |
|
---|---|---|---|
char |
1 | 1 | |
char* |
16 | 16 | Same for any pointer |
short |
4 | 4 | |
int |
8 | 8 | |
long |
16 | 16 | |
long long |
32 | 32 | |
float |
16 | 16 | |
double |
32 | 32 |
QUESTION DATAREP-18A. This set of sizes is valid: it obeys all the requirements set by C’s abstract machine. Give one different size assignment that would make the set as a whole invalid.
Some examples:
sizeof(char) = 0
;sizeof(char) = 2
;sizeof(short) = 8
(i.e., longer thanint
);sizeof(int) = 2
(though not discussed in class, turns out that C++ requires ints are at least 2 bytes big); etc.
QUESTION DATAREP-18B. What alignment must the ENIAC malloc guarantee?
32
For the following two questions, assume the following struct on the ENIAC:
struct s {
char f1[7];
char *f2;
short f3;
int f4;
};
QUESTION DATAREP-18C. What is sizeof(struct s)
?
f1
is 7 bytes.
f2
is 16 bytes with 16-byte alignment, so add 9B padding.
f3
is 4 bytes (and is already aligned).
f4
is 8 bytes with 8-byte alignment, so add 4B padding.That adds up to 7 + 9 + 16 + 4 + 4 + 8 = 16 + 16 + 16 = 48 bytes.
That’s a multiple of the structure’s alignment, which is 16, so no need for any end padding.
QUESTION DATAREP-18D. What is alignof(struct s)
?
16
The remaining questions refer to this structure definition:
// This include file defines a struct inner, but you do not know anything
// about that structure, just that it exists.
#include "inner.hh"
struct outer {
char f1[3];
inner f2;
short f3;
int f4;
};
Indicate for each statement whether the statement is always true, possibly true, or never true on the ENIAC.
QUESTION DATAREP-18E: sizeof(outer) > sizeof(inner)
(Always / Possibly / Never)
Always
QUESTION DATAREP-18F: sizeof(outer)
is a multiple of
sizeof(inner)
(Always / Possibly / Never)
Possibly
QUESTION DATAREP-18G: alignof(outer) > alignof(struct
inner)
(Always / Possibly / Never)
Possibly
QUESTION DATAREP-18H: sizeof(outer) - sizeof(inner) < 4
(Always / Possibly / Never)
Never
QUESTION DATAREP-18I: sizeof(outer) - sizeof(inner) > 32
(Always / Possibly / Never)
Possibly
QUESTION DATAREP-18J: alignof(inner) == 2
(Always / Possibly / Never)
Never
DATAREP-19. Undefined behavior
Which of the following expressions, instruction sequences, and code
behaviors cause undefined behavior? For each question, write Defined or
Undefined. (Note that the INT_MAX
and UINT_MAX
constants have
types int
and unsigned
, respectively.)
QUESTION DATAREP-19A. INT_MAX + 1
(Defined / Undefined)
Undefined
QUESTION DATAREP-19B. UINT_MAX + 1
(Defined / Undefined)
Defined
QUESTION DATAREP-19C.
movq $0x7FFFFFFFFFFFFFFF, %rax
addl $1, %rax
(Defined / Undefined)
Defined (only C++ programs can have undefined behavior; the behavior of x86-64 instructions is always defined)
QUESTION DATAREP-19D. Failed memory allocation, i.e., malloc
returns nullptr
(Defined / Undefined)
Defined
QUESTION DATAREP-19E. Use-after-free (Defined / Undefined)
Undefined
QUESTION DATAREP-19F. Here are two functions and a global variable:
const char string[128] = ".......";
int read_nth_char(int n) {
return string[n];
}
int f(int i) {
if (i & 0x40) {
return read_nth_char(i * 2);
} else {
return i * 2;
}
}
C’s undefined behavior rules would allow an aggressive optimizing
compiler to simplify the code generated for f
. Fill in the following
function with the simplest C code you can, under the constraint that an
aggressive optimizing compiler might generate the same object code for
f
and f_simplified
.
int f_simplified(int i) {
}
return i * 2;
DATAREP-20. Bit manipulation
It’s common in systems code to need to switch data between big-endian and little-endian representations. This is because networks represent multi-byte integers using big-endian representation, whereas x86-family processors store multi-byte integers using little-endian representation.
QUESTION DATAREP-20A. Complete this function, which translates an integer
from big-endian representation to little-endian representation by swapping
bytes. For instance, big_to_little(0x01020304)
should return 0x04030201
.
Your return statement must refer to the u.c
array, and must not
refer to x
. This function is compiled on x86-64 Linux (as every function is
unless we say otherwise).
unsigned big_to_little(unsigned x) {
union {
unsigned intval;
unsigned char c[4];
} u;
u.intval = x;
return ______________________________________;
}
return (u.c[0] << 24) | (u.c[1] << 16) | (u.c[2] << 8) | u.c[3];
QUESTION DATAREP-20B. Complete the function again, but this time
write a single expression that refers to x
(you may refer to x
multiple times, of course).
unsigned big_to_little(unsigned x) {
return ______________________________________;
}
return ((x & 0xFF) << 24) | ((x & 0xFF00) << 8) | ((x & 0xFF0000) >> 8) | (x >> 24);
QUESTION DATAREP-20C. Now write the function little_to_big
,
which will translate a little-endian integer into big-endian
representation. You may introduce helper variables or even call
big_to_little
if that’s helpful.
unsigned little_to_big(unsigned x) {
}
return big_to_little(x);
DATAREP-21. Computer arithmetic
Bitwise operators and computer arithmetic can represent vectors of
bits, which in turn are useful for representing sets. For example, say
we have a function bit
that maps elements to distinct bits; thus,
bit(X) == (1 << i)
for some i
. Then a set {X0, X1, X2, …, Xn} can be
represented as bit(X0) | bit(X1) | bit(X2) | … | bit(Xn)
. Element Xi
is in the set with integer representation z
if and only if (bit(Xi) & z)
!= 0
.
QUESTION DATAREP-21A. What is the maximum number of set elements
that can be represented in a single unsigned
variable on an x86
machine?
32
QUESTION DATAREP-21B. Match each set operation with the C operator(s) that could implement that operation. (Complement is a unary operation.)
intersection | == |
|
equality | ~ |
|
complement | & |
|
union | ^ |
|
toggle membership (flip whether an element is in the set) |
| |
intersection &
equality ==
complement ~
union |
toggle membership ^
QUESTION DATAREP-21C. Complete this function, which should return
the set difference between the sets with representations a
and b
.
This is the set containing exactly those elements of set a
that are
not in set b
.
unsigned set_difference(unsigned a, unsigned b) {
}
Any of these work:
return a & ~b; return a - (a & b); return a & ~(a & b);
QUESTION DATAREP-21D. Below we’ve given a number of C++ expressions,
some of their values, and some of their set representations for a set of
elements. For example, the first row says that the integer value of
expression 0
is just 0, which corresponds to an empty set. Fill in the
blanks. This will require figuring out which bits correspond to the set
elements A
, B
, C
, and D
, and the values for the 32-bit int
variables a
, x
, and s
. No arithmetic operation overflows; abs(x)
returns the absolute value of x
(that is, x < 0 ? -x :
x
).
Expression e |
Integer value | Represented set |
---|---|---|
0 |
0 | {} |
a == a |
______________ | {A} |
(unsigned) ~a < (unsigned) a |
______________ | {A} |
a < 0 |
______________ | ______________ |
(1 << (s/2)) - 1 |
______________ | {A,B,C,D} |
a * a |
______________ | {C} |
abs(a) |
______________ | ______________ |
x & (x - 1) |
______________ | {} |
x - 1 |
______________ | {A,D} |
x |
______________ | ______________ |
s |
______________ | ______________ |
Expression e
Integer value Represented set 0
0 {}
a == a
1 {A}
(unsigned) ~a < (unsigned) a
1 {A}
a < 0
1 {A}
(1 << (s/2)) - 1
15 {A,B,C,D}
a * a
4 {C}
abs(a)
2 {D}
x & (x - 1)
0 {}
x - 1
3 {A,D}
x
4 {C}
s
8 {B}
DATAREP-22. Bit Tac Toe
Brenda Bitdiddle is implementing tic-tac-toe using bitwise arithmetic. (If you’re unfamiliar with tic-tac-toe, see below.) Her implementation starts like this:
struct tictactoe {
unsigned moves[2];
};
#define XS 0
#define OS 1
void tictactoe_init(tictactoe* b) {
b->moves[XS] = b->moves[OS] = 0;
}
static const unsigned ttt_values[3][3] = {
{ 0x001, 0x002, 0x004 },
{ 0x010, 0x020, 0x040 },
{ 0x100, 0x200, 0x400 }
};
// Mark a move by player `p` at row `row` and column `col`.
// Return 0 on success; return –1 if position `row,col` has already been used.
int tictactoe_move(tictactoe* b, int p, int row, int col) {
1. assert(row >= 0 && row < 3 && col >= 0 && col < 3);
2. assert(p == XS || p == OS);
3. /* TODO: check for position reuse */
4. b->moves[p] |= ttt_values[row][col];
5. return 0;
}
Each position on the board is assigned a distinct bit.
Tic-tac-toe, also known as noughts and crosses, is a simple paper-and-pencil game for two players, X and O. The board is a 3x3 grid. The players take turns writing their symbol (X or O) in an empty square on the grid. The game is won when one player gets their symbol in all three squares in one of the rows, one of the columns, or one of the two diagonals. X goes first; played perfectly, the game always ends in a draw.
You may access the Wikipedia page for tic-tac-toe.
QUESTION DATAREP-22A. Brenda’s current code doesn’t check whether a move reuses a position. Write a snippet of C code that returns –1 if an attempted move is reusing a position. This snippet will replace line 3.
Lots of people misinterpreted this to mean the player reused their own position and ignored the other player. That mistake was allowed with no points off. The code below checks whether any position was reused by either player.
if ((b->moves[XS] | b->moves[OS]) & ttt_values[row][col]) { return -1; } OR if ((b->moves[XS] | b->moves[OS] | ttt_values[row][col]) == (b->moves[XS] | b->moves[OS])) { return -1; } OR if ((b->moves[XS] + b->moves[OS]) & ttt_values[row][col]) { return -1; } OR if ((b->moves[p] ^ ttt_values[row][col]) < b->moves[p]) { return -1; }
etc.
QUESTION DATAREP-22B. Complete the following function. You may use the following helper function:
int popcount(unsigned n)
Return the number of 1 bits inn
. (Stands for “population count”; is implemented on recent x86 processors by a single instruction,popcnt
.)
For full credit, your code should consist of a single “return
”
statement with a simple expression, but for substantial partial credit
write any correct solution.
// Return the number of moves that have happened so far.
int tictactoe_nmoves(const tictactoe* b) {
}
return popcount(b->moves[XS] | b->moves[OS]);
QUESTION DATAREP-22C. Write a simple expression that, if nonzero,
indicates that player XS
has a win on board b
across the main
diagonal (has marks in positions 0,0
, 1,1
, and 2,2
).
(b->moves[XS] & 0x421) == 0x421
Lydia Davis notices Brenda’s code and has a brainstorm. “If you use different values,” she suggests, “it becomes easy to detect any win.” She suggests:
static const unsigned ttt_values[3][3] = {
{ 0x01001001, 0x00010002, 0x10100004 },
{ 0x00002010, 0x22020020, 0x00200040 },
{ 0x40004100, 0x00040200, 0x04400400 }
};
QUESTION DATAREP-22D. Repeat part A for Lydia’s values: Write a snippet of C code that returns –1 if an attempted move is reusing a position. This snippet will replace line 3 in Brenda’s code.
The same answers as for part A work.
QUESTION DATAREP-22E. Repeat part B for Lydia’s values: Use
popcount
to complete tictactoe_nmoves
.
int tictactoe_nmoves(const tictactoe* b) {
}
Either of:
return popcount((b->moves[0] | b->moves[1]) & 0x777); return popcount((b->moves[0] | b->moves[1]) & 0x777000);
QUESTION DATAREP-22F. Complete the following function for Lydia’s
values. For full credit, your code should consist of a single “return
”
statement containing exactly two constants, but for substantial partial
credit write any correct solution.
// Return nonzero if player `p` has won, 0 if `p` has not won.
int tictactoe_check_win(const tictactoe* b, int p) {
assert(p == XS || p == OS);
}
return (b->moves[p] + 0x11111111) & 0x88888888; // Another amazing possibility (Allen Chen and others): return b->moves[p] & (b->moves[p] << 1) & (b->moves[p] << 2);
DATAREP-23. Memory and Pointers
Two processes are mapping a file into their address space. The mapped file contains an unsorted linked list of integers. As the processes cannot ensure that the file will be mapped at the same virtual address, they use relative pointers to link elements in the list. A relative pointer holds not an address, but an offset that user code can use to calculate a true address. Our processes define the offset as relative to the start of the file.
Thus, each element in the linked list is represented by the following structure:
struct ll_node {
int value;
size_t offset;
};
offset == (size_t) -1
indicates the end of the list. Other
offset
values represent the position of the next item in the list,
calculated relative to the start of the file.
QUESTION DATAREP-23A. Write a function to find an item in the list. The function's prototype is:
ll_node* find_element(void* mapped_file, ll_node* list, int value);
The mapped_file
parameter is the address of the mapped file data;
the list
parameter is a pointer to the first node in the list; and
the value
parameter is the value for which we are searching. The
function should return a pointer to the linked list element if the value
appears in the list or nullptr if the value is not in the list.
ll_node* find_element(void* mapped_file, ll_node* list, int value) { while (1) { if (list->value == value) return list; if (list->offset == (size_t) -1) return NULL; list = (ll_node*) ((char*) mapped_file + list->offset); } }
DATAREP-24. Integer representation
Write the value of the variable or expression in each problem, using signed decimal representation.
For example, if we gave you:
int i = 0xA;
int j = 0xFFFFFFFF;
you would write A) 10 B) -1.
QUESTION DATAREP-24A. int i = 0xFFFF;
(You may write this either
in decimal or as an expression using a power of 2)
216 - 1 or 65535
QUESTION DATAREP-24B. short s = 0xFFFF;
(You may write this
either in decimal or as an expression using a power of 2)
-1
QUESTION DATAREP-24C. unsigned u = 1 << 10;
1024 or 210
QUESTION DATAREP-24D. ⚠️ From WeensyOS: unsigned long l = PTE_P |
PTE_U;
5
QUESTION DATAREP-24E. int j = ~0;
-1
QUESTION DATAREP-24F. ⚠️ From WeensyOS: sizeof(x86_64_pagetable);
4096 or 212
QUESTION DATAREP-24G. Given this structure:
struct s {
char c;
short s;
long l;
};
s* ps;
This expression: sizeof(ps);
TRICK QUESTION! 8
QUESTION DATAREP-24H. Using the structure above: sizeof(*ps);
16
QUESTION DATAREP-24I. unsigned char u = 0xABC;
0xBC == 11*16 + 12 == 160 + 16 + 12 == 188
QUESTION DATAREP-24J. signed char c = 0xABC;
0xBC has most-significant bit on, so the value as a signed char is less than zero. We seek
x
so that0xBC + x == 0x100
. The answer is 0x44:0xBC + 4 == 0xC0
, and0xC0 + 0x40 == 0x100
. So−0x44 == −4*16 − 4 == −68
.
DATAREP-25. Data representation
In gdb, you observe the following values for a set of memory locations.
0x100001020: 0xa0 0xb1 0xc2 0xd3 0xe4 0xf5 0x06 0x17
0x100001028: 0x28 0x39 0x4a 0x5b 0x6c 0x7d 0x8e 0x9f
0x100001030: 0x89 0x7a 0x6b 0x5c 0x4d 0x3e 0x2f 0x10
0x100001038: 0x01 0xf2 0xe3 0xd4 0xc5 0xb6 0xa7 0x96
For each C expression below, write its value in hexadecimal. For example, if we gave you:
char *cp = (char*) 0x100001020; cp[0] =
the answer would be 0xa0
.
Assume the following structure and union declarations and variable definitions.
struct _s1 {
int i;
long l;
short s;
};
struct _s2 {
char c[4];
int i;
struct _s1 s;
};
union _u {
char c[8];
int i;
long l;
short s;
};
char* cp = (char*) 0x100001020;
struct _s1* s1 = (struct _s1*) 0x100001020;
struct _s2* s2 = (struct _s2*) 0x100001020;
union _u* u = (union _u*) 0x100001020;
QUESTION DATAREP-25A. cp[4] =
0xE4 (-28)
QUESTION DATAREP-25B. cp + 7 =
0x100001027
QUESTION DATAREP-25C. s1 + 1 =
0x100001038
QUESTION DATAREP-25D. s1->i =
0xd3c2b1a0 (-742215264)
QUESTION DATAREP-25E. sizeof(s1) =
8
QUESTION DATAREP-25F. &s2->s =
0x100001028
QUESTION DATAREP-25G. &u->s =
0x100001020
QUESTION DATAREP-25H. s1->l =
0x9f8e7d6c5b4a3928 (-6949479270644565720)
QUESTION DATAREP-25I. s2->s.s =
0xf201 (-3583)
QUESTION DATAREP-25J. u->l =
0x1706f5e4d3c2b1a0 (1659283875886707104)
DATAREP-26. Sizes and alignments
Here’s a test struct with n members. Assume an x86-64 machine, where
each Ti
either is a basic x86-64 type (e.g., int
, char
,
double
) or is a type derived from such types (e.g., arrays, structs,
pointers, unions, possibly recursively), and assume that ai≤8 for all
i.
struct test {
T1 m1; // sizeof(T1) == s1, alignof(T1) == a1
T2 m2; // sizeof(T2) == s2, alignof(T2) == a2
...
Tn mn; // sizeof(Tn) == sn, alignof(Tn) == an
};
In these questions, you will compare this struct with other structs that have the same members, but in other orders.
QUESTION DATAREP-26A. True or false: The size of struct test
is minimized
when its members are sorted by size. In other words, if
s1≤s2≤…≤sn, then sizeof(struct test)
is less than or
equal to the struct size for any other member order.
If true, briefly explain your answer; if false, give a counterexample
(i.e., concrete types for T1
, …, Tn
that do not minimize
sizeof(struct test)
).
False. T1 =
char
, T2 =int
, T3 =char[5]
QUESTION DATAREP-26B. True or false: The size of struct test
is minimized
when its members are sorted by alignment. In other words, if
a1≤a2≤…≤an, then sizeof(struct test)
is less than or
equal to the struct size for any other member order.
If true, briefly explain your answer; if false, give a counterexample.
True. Padding only occurs between objects with different alignments, and is limited by the second alignment; sorting by alignment therefore minimizes padding.
QUESTION DATAREP-26C. True or false: The alignment of struct test
is
minimized when its members are sorted in increasing order by alignment.
In other words, if a1≤a2≤…≤an, then
alignof(struct test)
is less than or equal to the struct
alignment for any other member order.
If true, briefly explain your answer; if false, give a counterexample.
True. It’s all the same; alignment is max alignment of every component, and is independent of order.
QUESTION DATAREP-26D. What is the maximum number of bytes of padding that
struct test
could contain for a given n? The answer will be a pretty
simple formula involving n. (Remember that ai≤8 for all i.)
Alternating
char
andlong
gives the most padding, which is 7*(n/2) when n is even and 7*(n+1)/2 otherwise.
QUESTION DATAREP-26E. What is the minimum number of bytes of padding that
struct test
could contain for a given n?
0
DATAREP-27. Undefined behavior
QUESTION DATAREP-27A. Sometimes a conforming C compiler can assume that a +
1 > a
, and sometimes it can’t. For each type below, consider this
expression:
a + (int) 1 > a
and say whether the compiler:
- Must reject the expression as a type error.
- May assume that the expression is true (that
a + (int) 1 > a
for alla
). - Must not assume that the expression is true.
int a
unsigned a
char* a
unsigned char a
struct {int m;} a
1—May assume; 2—Must not assume; 3—May assume; 4—May assume (in fact due to integer promotion, this statement really is always true, even in mathematical terms); 5—Must reject.
QUESTION DATAREP-27B. The following code checks its arguments for sanity, but not well: each check can cause undefined behavior.
void sanity_check(int* array, size_t array_size, int* ptr_into_array) {
if (array + array_size < array) {
fprintf(stderr, "`array` is so big that it wraps around!\n");
abort();
}
if (ptr_into_array < array || ptr_into_array > array + array_size) {
fprintf(stderr, "`ptr_into_array` doesn’t point into the array!\n");
abort();
}
...
Rewrite these checks to avoid all undefined behavior. You will likely
add one or more casts to uintptr_t
. For full credit, write each
check as a single comparison (no &&
or ||
, even though the current
ptr_into_array
check uses ||
).
array_size
check:
ptr_into_array
check:
array_size
check:(uintptr_t) array + 4 * array_size < (uintptr_t) array
ptr_into_array
check:(uintptr_t) ptr_into_array - (uintptr_t) array > 4 * array_size
QUESTION DATAREP-27C. In lecture, we discussed several ways to tell
if a signed integer x
is negative. One of them was the following:
int isnegative = (x & (1UL << (sizeof(x) * CHAR_BIT))) != 0;
But this is incorrect: it has undefined behavior. Correct it by adding two characters.
(x & (1UL << (sizeof(x) * CHAR_BIT - 1))) != 0
DATAREP-28. Memory errors and garbage collection
⚠️ We didn’t discuss garbage collectors in class this year. ⚠️
Recall that a conservative garbage collector is a program that can automatically free dynamically-allocated memory by detecting when that memory is no longer referenced. Such a GC works by scanning memory for currently-referenced pointers, starting from stack and global memory, and recursing over each referenced object until all referenced memory has been scanned. We built a conservative garbage collector in lecture datarep6.
QUESTION DATAREP-28A. An application program that uses conservative GC, and
does not call free
directly, will avoid certain errors and undefined
behaviors. Which of the following errors are avoided? List all that
apply.
- Use-after-free
- Double free
- Signed integer overflow
- Boundary write error
- Unaligned access
1, 2
QUESTION DATAREP-28B. Write a C program that leaks unbounded memory without GC, but does not do so with GC. You should need less than 5 lines. (Leaking “unbounded” memory means the program will exhaust the memory capacity of any machine on which it runs.)
while (1) { (void) malloc(1); }
QUESTION DATAREP-28C. Not every valid C program works with a conservative GC, because the C abstract machine allows a program to manipulate pointers in strange ways. Which of the following pointer manipulations might cause the conservative GC from class to inappropriately free a memory allocation? List all that apply.
Storing the pointer in a
uintptr_t
variableWriting the pointer to a disk file and reading it back later
Using the least-significant bit of the pointer to store a flag:
int* set_ptrflag(int* x, int flagval) { return (int*) ((uintptr_t) x | (flagval ? 1 : 0)); } int get_ptrflag(int* x) { return (uintptr_t) x & 1; } int deref_ptrflag(int* x) { return *((int*) ((uintptr_t) x & ~1UL)); }
Storing the pointer in textual form:
void save_ptr(char buf[100], void* p) { sprintf(buf, "%p", p); } void* restore_ptr(const char buf[100]) { void* p; sscanf(buf, "%p", &p); return p; }
Splitting the pointer into two parts and storing the parts in an array:
typedef union { unsigned long ival; unsigned arr[2]; } value; value save_ptr(void* p) { value v; v.arr[0] = (uintptr_t) p & 0xFFFFFFFFUL; v.arr[1] = ((uintptr_t) p / 4294967296UL) & 0xFFFFFFFFUL; return v; }
2, 4
DATAREP-29. Bitwise
QUESTION DATAREP-29A. Consider this C fragment:
uintptr_t x = ...;
uintptr_t r = 0;
if (a < b) {
r = x;
}
Or, shorter:
uintptr_t r = a < b ? x : 0;
Write a single expression that evaluates to the same value, but that
does not use the conditional ?: operator. You will use the fact that
a < b
always equals 0 or 1. For full credit, do not use expensive
operators (multiply, divide, modulus).
Examples:
(a < b) * x
,(-(uintptr_t) (a < b)) & x
QUESTION DATAREP-29B. This function returns one more than the index of the least-significant 1 bit in its argument, or 0 if its argument is zero.
int ffs(unsigned long x) {
for (int i = 0; i < sizeof(x) * CHAR_BIT; ++i) {
if (x & (1UL << i)) {
return i + 1;
}
}
return 0;
}
This function runs in O(B) time, where B is the number of bits in
an unsigned long
. Write a version of ffs
that runs instead in
O(log B) time.
int ffs(unsigned long x) { if (!x) { return 0; } int ans = 1; if (!(x & 0x00000000FFFFFFFFUL)) { ans += 32; x >>= 32; } if (!(x & 0x0000FFFF)) { ans += 16; x >>= 16; } if (!(x & 0x00FF)) { ans += 8; x >>= 8; } if (!(x & 0x0F)) { ans += 4; x >>= 4; } if (!(x & 0x3)) { ans += 2; x >>= 2; } return ans + (x & 0x1 ? 0 : 1); }
DATAREP-30. Data representation
QUESTION DATAREP-30A. Write a type whose size is 19,404,329 times larger than its alignment.
char[19404329]
QUESTION DATAREP-30B. Consider a structure type T
with N members, all
of which have nonzero size. Assume that sizeof(T) ==
alignof(T)
. What is N?
1
QUESTION DATAREP-30C. What is a C type that obeys (T) -1 == (T) 255
on
x86-64?
char
(orunsigned char
orsigned char
)
Parts D–G use this code. The architecture might or might not be x86-64.
unsigned char a[] = {
0x7A, 0xEC, 0x0D, 0xBE, 0x99, 0x0A, 0xD8, 0x0E
};
unsigned* s1 = (unsigned*) &a[0];
unsigned* s2 = s1 + 1;
Assume that (uintptr_t) s2 - (uintptr_t) s1 == 4
and *s1 >
*s2
.
QUESTION DATAREP-30D. What is sizeof(a)
?
8
QUESTION DATAREP-30E. What is sizeof(unsigned)
on this architecture?
4
QUESTION DATAREP-30F. Is this architecture big-endian or little-endian?
Little-endian
QUESTION DATAREP-30G. Might the architecture be x86-64?
Yes
DATAREP-31. Memory errors
Mavis Gallant is starting on her debugging memory allocator. She’s
written code that aims to detect invalid frees, where a pointer passed
to m61_free
was not returned by an earlier
m61_malloc
.
D1. typedef struct m61_metadata {
D2. size_t magic;
D3. size_t padding;
D4. } m61_metadata;
M1. void* m61_malloc(size_t sz) {
M2. m61_metadata* meta = base_malloc(sz + sizeof(m61_metadata));
M3. meta->magic = 0x84157893401;
M4. return (void*) (meta + 1);
M5. }
F1. void m61_free(void* ptr) {
F2. m61_metadata* meta = (m61_metadata*) ptr - 1;
F3. if (meta->magic != 0x84157893401) {
F4. fprintf(stderr, "invalid free of %p\n", ptr);
F5. abort();
F6. }
F7. base_free(ptr);
F8. }
C1. void* m61_calloc(size_t count, size_t sz) {
C2. void* p = m61_malloc(sz * count);
C3. memset(p, 0, sz * count);
C4. return p;
C5. }`
Help her track down bugs.
QUESTION DATAREP-31A. What is sizeof(struct m61_metadata)
?
16
QUESTION DATAREP-31B. Give an m61_
function call (function name and
arguments) that would cause both unsigned integer overflow and invalid
memory accesses.
m61_malloc((size_t) -15)
. This turns intomalloc(1)
and the dereference ofmeta->magic
becomes invalid.
QUESTION DATAREP-31C. Give an m61_
function call (function name and
arguments) that would cause integer overflow, but no invalid memory
access within the m61_
functions. (The application might or might
not make an invalid memory access later.)
m61_malloc((size_t) -1)
QUESTION DATAREP-31D. These functions have some potential null pointer dereferences. Fix one such problem, including the line number(s) where your code should go.
C3:
if (p) { memset(p, 0, sz * count); }
QUESTION DATAREP-31E. Put a single line of C code in the blank. The
resulting program should (1) be well-defined with no memory leaks when
using default malloc
/free
/calloc
, but (2) always cause
undefined behavior when using Mavis’s debugging
malloc
/free
/calloc
.
... #includes ...
int main(void) {
________________________
}
free(nullptr);
QUESTION DATAREP-31F. A double free should print a different message than an invalid free. Write code so Mavis’s implementation does this; include the line numbers where the code should go.
F4:
fprintf(stderr, meta->magic == 0xB0B0B0B0 ? "double free of %p" : "invalid free of %p", ptr)
after F6:
meta->magic = 0xB0B0B0B0;
ASM-1. Disassemble
Here’s some assembly produced by compiling a C program.
.globl f
.align 16, 0x90
.type f,@function
f:
movl $1, %r8d
jmp .LBB0_1
.LBB0_6:
incl %r8d
.LBB0_1:
movl %r8d, %ecx
imull %ecx, %ecx
movl $1, %edx
.LBB0_2:
movl %edx, %edi
imull %edi, %edi
movl $1, %esi
.align 16, 0x90
.LBB0_3:
movl %esi, %eax
imull %eax, %eax
addl %edi, %eax
cmpl %ecx, %eax
je .LBB0_7
cmpl %edx, %esi
leal 1(%rsi), %eax
movl %eax, %esi
jl .LBB0_3
cmpl %r8d, %edx
leal 1(%rdx), %eax
movl %eax, %edx
jl .LBB0_2
jmp .LBB0_6
.LBB0_7:
pushq %rax
.Ltmp0:
movl $.L.str, %edi
xorl %eax, %eax
callq printf
movl $1, %eax
popq %rcx
retq
.type .L.str,@object
.L.str:
.asciz "%d %d\n"
.size .L.str, 7
QUESTION ASM-1A. How many arguments might this function have? Circle all that apply.
- 0
- 1
- 2
- 3 or more
All (1–4). The function has no arguments that it uses, but it might have arguments it doesn’t use.
QUESTION ASM-1B. What might this function return? Circle all that apply.
- 0
- 1
- −1
- Its first argument, whatever that argument is
- A square number other than 0 or 1
- None of the above
It can only return 1.
QUESTION ASM-1C. Which callee-saved registers does this function save and restore? Circle all that apply.
- %rax
- %rbx
- %rcx
- %rdx
- %rbp
- %rsi
- %rdi
- None of the above
The callee-saved registers are %rbx, %rbp, %rsp, and %r12-%r15. The code does not modify any of these registers, so it doesn’t “save and restore” them either.
QUESTION ASM-1D. This function handles signed integers. If we changed the C source to use unsigned integers instead, which instructions would change? Circle all that apply.
movl
imull
addl
cmpl
je
jl
popq
- None of the above
jl
We accepted circled
imull
or not! Although x86imull
is signed, as used in C it behaves equivalently to the nominally-unsignedmull
, and some compilers useimull
for both kinds of integer. From the Intel manuals:“[These] forms [of
imul
] may also be used with unsigned operands because the lower half of the product is the same regardless if the operands are signed or unsigned. The CF and OF flags, however, cannot be used to determine if the upper half of the result is non-zero.”
QUESTION ASM-1E. What might this function print? Circle all that apply.
0 0
1 1
3 4
4 5
6 8
- None of the above
Choice #3 (
3 4
) only. The function searches for a solution tox
2
+ y
2
== z
2
, under the constraint thatx ≤ y
. When it finds one, it printsx
andy
and then returns. It always starts from1 1
and incrementsx
andy
one at a time, so it can only print3 4
.
ASM-2. Assembly
Here is some x86 assembly code.
f:
movl a, %eax
movl b, %edx
andl $255, %edx
subl %edx, %eax
movl %eax, a
retq
QUESTION ASM-2A. Write valid C code that could have compiled into
this assembly (i.e., write a C definition of function f
), given the
global variable declarations “extern unsigned a, b;
.” Your C code
should compile without warnings. REMINDER: You are not permitted to
run a C compiler, except for the C compiler that is your brain.
void f() { a -= b & 255; }
Or see below for more possibilities.
QUESTION ASM-2B. Write different valid, warning-free C code that could have compiled into that assembly. This version should contain different operators than your first version. (For extra credit, use only one operator.)
void f() { a += -(b % 256); }
QUESTION ASM-2C. Again, write different valid, warning-free C code
that could have compiled into that assembly. In this version, f
should
have a different type than in your first version.
unsigned f() { a = a - b % 0x100; return a; } unsigned f() { a -= (unsigned char) b; return a; } char* f(int x, int y, int z[1000]) { a -= (unsigned char) b; return (char*) a; }
ASM-3. Assembly and Data Structures
For each code sample below, indicate the most likely type of the data being accessed. (If multiple types are equally likely, just pick one.)
QUESTION ASM-3A. movzbl %al, %eax
unsigned char
QUESTION ASM-3B. movl -28(%rbp), %edx
int
orunsigned
QUESTION ASM-3C. movsbl -32(%rbp), %eax
[signed] char
QUESTION ASM-3D. movzwl -30(%rbp), %eax
unsigned short
For each code sample below, indicate the most likely data structure
being accessed (assume that g_var
is a global variable). Be as
specific as possible.
QUESTION ASM-3E. movzwl 6(%rdx,%rax,8), %eax
unsigned short
in an array of 8-byte structures
QUESTION ASM-3F. movl (%rdx,%rax,4), %eax
Array of
int
s orunsigned int
s
QUESTION ASM-3G.
movzbl 4(%rax), %eax
movsbl %al, %eax
char
field from a structure; or the 4th character in a string
For the remaining questions, indicate for what values of the register contents will the jump be taken.
QUESTION ASM-3H.
xorl %eax, %eax
jge LABEL
Always
QUESTION ASM-3I.
testb $1, %eax
jne LABEL
Any odd value (the fact that we're only looking at the lowest byte is pretty irrelevant)
QUESTION ASM-3J.
cmpl %edx, %eax
jl LABEL
When
%eax
is less than%edx
, considered as signed integers
ASM-4. Assembly language
The next four questions pertain to the following four code samples.
f1
f1: subq $8, %rsp call callfunc movl %eax, %edx leal 1(%rax,%rax,2), %eax testb $1, %dl jne .L3 movl %edx, %eax shrl $31, %eax addl %edx, %eax sarl %eax .L3: addq $8, %rsp ret
f2
f2: pushq %rbx xorl %ebx, %ebx .L3: movl %ebx, %edi addl $1, %ebx call callfunc cmpl $10, %ebx jne .L3 popq %rbx ret
f3
f3: subq $8, %rsp call callfunc subl $97, %eax cmpb $4, %al ja .L2 movzbl %al, %eax jmp *.L4(,%rax,8) .L4: .quad .L3 .quad .L9 .quad .L6 .quad .L7 .quad .L8 .L3: movl $42, %edx jmp .L5 .L6: movl $4096, %edx jmp .L5 .L7: movl $52, %edx jmp .L5 .L8: movl $6440, %edx jmp .L5 .L2: movl $0, %edx jmp .L5 .L9: movl $61, %edx .L5: movl $.LC0, %esi movl $1, %edi movl $0, %eax call __printf_chk addq $8, %rsp ret .LC0: .string "Sum = %d\n"
f4
f4: subq $40, %rsp movl $1, (%rsp) movl $0, 16(%rsp) .L2: leaq 16(%rsp), %rsi movq %rsp, %rdi call callfunc movl 16(%rsp), %eax cmpl %eax, (%rsp) jne .L2 addq $40, %rsp ret
Now answer the following questions. Pick the most likely sample; you will use each sample exactly once.
QUESTION ASM-4A. Which sample contains a for loop?
f2
QUESTION ASM-4B. Which sample contains a switch statement?
f3
QUESTION ASM-4C. Which sample contains only an if/else construct?
f1
QUESTION ASM-4D. Which sample contains a while loop?
f4
ASM-5. Calling conventions: 6186
University Professor Helen Vendler is designing a poetic new processor, the 6186. Can you reverse-engineer some aspects of the 6186’s calling convention from its assembly?
Here’s a function:
int f(int* a, unsigned b) {
extern int g(int x);
int index = g(a[2*b + 1]);
return a[index + b];
}
And here’s that function compiled into 6186 instructions.
f:
sub $24, %rsp
movq %ra, (%rsp)
mov %rb, %rx
shl $1, %rx
add $1, %rx
movl (%ra, %rx, 4), %ra
call g
add %rb, %rr
movq (%rsp), %ra
movl (%ra, %rr, 4), %ra
mov %ra, %rr
add $24, %rsp
ret
6186 assembly syntax is based on x86-64 assembly, and like the x86-64,
6186 registers are 64 bits wide. However, the 6186 has a different set
of registers. There are just five general-purpose registers, %ra
,
%rb
, %rr
, %rx
, and %ry
. (“[W]hen she tries to be
deadly serious she is speaking under…constraint”.) The example also
features the stack pointer register, %rsp
.
Give brief explanations if unsure.
QUESTION ASM-5A. Which register holds function return values?
%rr
QUESTION ASM-5B. What is sizeof(int)
on the 6186?
4
QUESTION ASM-5C. Which general-purpose register(s) must be callee-saved?
%rb
QUESTION ASM-5D. Which general-purpose register(s) must be caller-saved?
%rr
,%ra
,%rx
QUESTION ASM-5E. Which general-purpose register(s) might be callee-saved or caller-saved (you can’t tell which)?
%ry
QUESTION ASM-5F. Assuming the compiler makes function stack frames as small as possible given the calling convention, what is the alignment of stack frames?
32
QUESTION ASM-5G. Assuming that the 6186 supports the same addressing
modes as the x86-64, write a single instruction that has the same
effect on %ra
as these three instructions:
shl $1, %rx
add $1, %rx
movl (%ra,%rx,4), %ra
movl 4(%ra,%rx,8), %ra
ASM-6. Data structure assembly
Here are four assembly functions, f1
through f4
.
f1: pushq %rbp movq %rsp, %rbp testl %esi, %esi jle LBB0_3 incl %esi LBB0_2: movq 8(%rdi), %rdi decl %esi cmpl $1, %esi jg LBB0_2 LBB0_3: movl (%rdi), %eax popq %rbp retq
f2: pushq %rbp movq %rsp, %rbp movslq %esi, %rax movq (%rdi,%rax,8), %rcx movl (%rcx,%rax,4), %eax popq %rbp retq
f3: testl %esi, %esi jle LBB2_3 incl %esi LBB2_2: movl %edx, %eax andl $1, %eax movq 8(%rdi,%rax,8), %rdi sarl %edx decl %esi cmpl $1, %esi jg LBB2_2 LBB2_3: movl (%rdi), %eax retq
f4: movslq %esi, %rax movl (%rdi,%rax,4), %eax retq
QUESTION ASM-6A. Each function returns a value loaded from some data structure. Which function uses which data structure?
- Array
- Array of pointers to arrays
- Linked list
- Binary tree
Array—
f4
; Array of pointers to arrays—f2
; Linked list—f1
; Binary tree—f3
QUESTION ASM-6B. The array data structure is an array of type T. Considering the code for the function that manipulates the array, which of the following types are likely possibilities for T? Circle all that apply.
char
int
unsigned long
unsigned long long
char*
- None of the above
int
ASM-7. Where’s Waldo?
In the following questions, we give you C code and a portion of the
assembly generated by some compiler for that code. (Sometimes we blank
out a part of the assembly.) The C code contains a variable, constant,
or function called waldo
, and a point in the assembly is marked with
asterisks ***
. Your job is to find Waldo: write an assembly
expression or constant that holds the value of waldo
at the marked
point. We’ve done the first one for you.
NON-QUESTION: Where’s Waldo?
int identity(int waldo) {
return waldo;
}
00000000004007f6 <identity>:
4007f6: 55 push %rbp
4007f7: 48 89 e5 mov %rsp,%rbp
4007fa: 89 7d fc mov %edi,-0x4(%rbp)
4007fd: 8b 45 fc mov -0x4(%rbp),%eax
***
400800: 5d pop %rbp
400801: c3 retq
ANSWER: %edi
, -0x4(%rbp)
, %eax
, and %rax
all hold the value
of waldo
at the marked point, so any of them is a valid answer. If the
asterisks came before the first instruction, only %edi
would work.
QUESTION ASM-7A: Where’s Waldo?
int f1(int a, int b, int waldo, int d) {
if (a > b) {
return waldo;
} else {
return d;
}
}
0000000000400802 <f1>:
***
400802: 55 push %rbp
400803: 48 89 e5 mov %rsp,%rbp
400806: 89 7d fc mov %edi,-0x4(%rbp)
400809: 89 75 f8 mov %esi,-0x8(%rbp)
40080c: 89 55 f4 mov %edx,-0xc(%rbp)
40080f: 89 4d f0 mov %ecx,-0x10(%rbp)
400812: 8b 45 fc mov -0x4(%rbp),%eax
400815: 3b 45 f8 cmp -0x8(%rbp),%eax
400818: 7e 05 jle 40081f <f1+0x1d>
40081a: 8b 45 f4 mov -0xc(%rbp),%eax
40081d: eb 03 jmp 400822 <f1+0x20>
40081f: 8b 45 f0 mov -0x10(%rbp),%eax
400822: 5d pop %rbp
400823: c3 retq
%edx
QUESTION ASM-7B: Where’s Waldo?
int int_array_get(int* a, int waldo) {
int x = a[waldo];
return x;
}
00000000004007d9 <int_array_get>:
INSTRUCTIONS OMITTED
***
4007dc: 8b 04 b7 mov (%rdi,%rsi,4),%eax
4007df: c3 retq
%rsi
QUESTION ASM-7C: Where’s Waldo?
int matrix_get(int** matrix, int row, int col) {
int* waldo = matrix[row];
return waldo[col];
}
00000000004007e0 <matrix_get>:
4007e0: 48 63 f6 movslq %esi,%rsi
4007e3: 48 63 d2 movslq %edx,%rdx
***
4007e6: ?? ?? ?? ?? mov ??,%rax
4007ea: 8b 04 90 mov (%rax,%rdx,4),%eax
4007ed: c3 retq
(%rdi,%rsi,8)
QUESTION ASM-7D: Where’s Waldo?
int f5(int x) {
extern int waldo(int);
return waldo(x * 45);
}
0000000000400be0 <f5>:
***
400be0: 6b ff 2d imul $0x2d,%edi,%edi
400be3: eb eb jmp 400bd0
0x400bd0
QUESTION ASM-7E: Where’s Waldo?
int factorial(int waldo) {
if (waldo < 2) {
return 1;
} else {
return waldo * factorial(waldo - 1);
}
}
0000000000400910 <factorial>:
400910: 83 ff 01 cmp $0x1,%edi
400913: b8 01 00 00 00 mov $0x1,%eax
400918: 7e 13 jle .L2 <factorial+0x1b>
40091a: [6 bytes of padding (a no-op instruction)]
.L1: ***
400920: 0f af c7 imul %edi,%eax
400923: 83 ef 01 sub $0x1,%edi
400926: 83 ff 01 cmp $0x1,%edi
400929: 75 f5 jne .L1 <factorial+0x10>
.L2: 40092b: f3 c3 repz retq
%edi
QUESTION ASM-7F: Where’s Waldo?
⚠️ This question currently uses 32-bit assembly.
int binary_search(const char* needle, const char** haystack, unsigned sz) {
unsigned waldo = 0, r = sz;
while (waldo < r) {
unsigned m = waldo + ((r - waldo) >> 1);
if (strcmp(needle, haystack[m]) < 0) {
r = m;
} else if (strcmp(needle, haystack[m]) == 0) {
waldo = r = m;
} else {
waldo = m + 1;
}
}
return waldo;
}
80484ab <binary_search>:
INSTRUCTIONS OMITTED
.L1: 80484c3: 89 fe mov %edi,%esi
80484c5: 29 de sub %ebx,%esi
80484c7: d1 ee shr %esi
80484c9: 01 de add %ebx,%esi
80484cb: 8b 44 b5 00 mov 0x0(%ebp,%esi,4),%eax
80484cf: 89 44 24 04 mov %eax,0x4(%esp)
80484d3: 8b 44 24 30 mov 0x30(%esp),%eax
80484d7: 89 04 24 mov %eax,(%esp)
80484da: e8 11 fe ff ff call 80482f0 <strcmp@plt>
80484df: 85 c0 test %eax,%eax
80484e1: 78 09 js .L2 <binary_search+0x41>
80484e3: 85 c0 test %eax,%eax
80484e5: 74 13 je 80484fa <binary_search+0x4f>
***
80484e7: 8d 5e 01 lea 0x1(%esi),%ebx
80484ea: eb 02 jmp .L3 <binary_search+0x43>
.L2: 80484ec: 89 f7 mov %esi,%edi
.L3: 80484ee: 39 df cmp %ebx,%edi
80484f0: 77 d1 ja .L1 <binary_search+0x18>
INSTRUCTIONS OMITTED
%ebx
In the remaining questions, you are given assembly compiled from one of the above functions by a different compiler, or at a different optimization level. Your goal is to figure out what C code corresponds to the given assembly.
QUESTION ASM-7G:
⚠️ This question currently uses 32-bit assembly.
804851d <waldo>:
804851d: 55 push %ebp
804851e: 89 e5 mov %esp,%ebp
8048520: 83 ec 18 sub $0x18,%esp
8048523: 83 7d 08 01 cmpl $0x1,0x8(%ebp)
8048527: 7f 07 jg 8048530
8048529: b8 01 00 00 00 mov $0x1,%eax
804852e: eb 10 jmp 8048540
8048530: 8b 45 08 mov 0x8(%ebp),%eax
8048533: 48 dec %eax
8048534: 89 04 24 mov %eax,(%esp)
8048537: e8 e1 ff ff ff call 804851d
804853c: 0f af 45 08 imul 0x8(%ebp),%eax
8048540: c9 leave
8048541: c3 ret
What’s Waldo? Circle one.
f1
f5
matrix_get
permutation_compare
factorial
binary_search
5,
factorial
QUESTION ASM-7H:
⚠️ This question currently uses 32-bit assembly.
8048425 <waldo>:
8048425: 55 push %ebp
8048426: 89 e5 mov %esp,%ebp
8048428: 8b 45 08 mov 0x8(%ebp),%eax
804842b: 3b 45 0c cmp 0xc(%ebp),%eax
804842e: 7e 05 jle 8048435 <waldo+0x10>
8048430: 8b 45 10 mov 0x10(%ebp),%eax
8048433: eb 03 jmp 8048438 <waldo+0x13>
8048435: 8b 45 14 mov 0x14(%ebp),%eax
8048438: 5d pop %ebp
8048439: c3 ret
What’s Waldo? Circle one.
f1
f5
matrix_get
permutation_compare
factorial
binary_search
1,
f1
QUESTION ASM-7I:
00000000004008b4 <waldo>:
4008b4: 55 push %rbp
4008b5: 48 89 e5 mov %rsp,%rbp
4008b8: 48 83 ec 10 sub $0x10,%rsp
4008bc: 89 7d fc mov %edi,-0x4(%rbp)
4008bf: 8b 45 fc mov -0x4(%rbp),%eax
4008c2: 6b c0 2d imul $0x2d,%eax,%eax
4008c5: 89 c7 mov %eax,%edi
4008c7: e8 9e 05 00 00 callq 400e6a
4008cc: c9 leaveq
4008cd: c3 retq
What’s Waldo? Circle one.
f1
f5
matrix_get
permutation_compare
factorial
binary_search
2,
f5
ASM-8. (removed because redundant)
ASM-9. Disassembly II
The questions in this section concern a function called ensmallen
,
which has the following assembly.
ensmallen:
1. movzbl (%rsi), %edx
2. testb %dl, %dl
3. movb %dl, (%rdi)
4. jne .L22
5. jmp .L23
6. .L18:
7. addq $1, %rsi
8. .L22:
9. movzbl (%rsi), %eax
10. cmpb %dl, %al
11. je .L18
12. addq $1, %rdi
13. testb %al, %al
14. movb %al, (%rdi)
15. je .L23
16. movl %eax, %edx
17. jmp .L22
18. .L23:
19. retq
QUESTION ASM-9A. How many arguments is this function likely to take? Give line numbers that helped you determine an answer.
2, because of lines 1 & 3
QUESTION ASM-9B. Are the argument(s) pointers? Give line numbers that helped you determine an answer.
Yes, because of lines 1, 3, 9, 14
QUESTION ASM-9C. What type(s) are the argument(s) likely to have? Give line numbers that helped you determine an answer.
unsigned char*
. Lines 1, 3, 9, and 14 are byte-moving instructions. Thez
inmovzbl
(Lines 1 and 9) indicates zero-extension, i.e.,unsigned char
. Butchar*
is possible too; the characters are only compared for equality with each other (Line 10) or zero (Lines 2/4 and 13/15), so we can’t really distinguish signed from unsigned.
QUESTION ASM-9D. Write a likely signature for the function. Use
return type void
.
void ensmallen(unsigned char* a, unsigned char* b);
QUESTION ASM-9E. Write an alternate likely signature for the
function, different from your last answer. Again, use return type
void
.
void ensmallen(unsigned char* a, const unsigned char* b); void ensmallen(char* a, char* b); void ensmallen(void* dst, const void* src);
etc., etc.
QUESTION ASM-9F. Which callee-saved registers does this function use? Give line numbers that helped you determine an answer.
None except possibly %rsp (no callee-saved registers are referenced in the code).
QUESTION ASM-9G. The function has an “input” and an “output”. Give
an “input” that would cause the CPU to jump from line 5 to label .L23
,
and describe what is placed in the “output” for that “input”.
The input is an empty string (
""
), and the function puts an empty string in the output.You might think the function’s output was the value of its %eax register what it returned. But remember that functions without return values can also use %eax, and we told you above that this function’s return type is
void
!ensmallen
’s “output” is most likely the string pointed to by its first parameter. In that senseensmallen
is sort of likestrcpy
ormemcpy
.
QUESTION ASM-9H. Give an “input” for which the corresponding “output” is not a copy of the “input”. Your answer must differ from the previous answer.
"aaaa"
(output is"a"
); any string that has adjacent characters that are the same
QUESTION ASM-9I. Write C code corresponding to this function. Make it as compact as you can.
void ensmallen(char* dst, const char* src) { while ((*dst = *src)) { while (*dst == *src) ++src; ++dst; } }
Or, a little less compactly:
void ensmallen(char* dst, const char* src) { while (*src) { *dst = *src; while (*src == *dst) ++src; ++dst; } *dst = 0; }
ASM-10. Machine programming
Intel really messed up this time. They’ve released a processor, the Fartium Core Trio, where every instruction is broken except the ones on this list.
1. | cmpq %rdi, %rsi |
2. | decq %rsi |
3. | incq %rax |
4. | je L1 |
5. | jl L2 |
6. | jmp L3 |
7. | movl (%rdi,%rax,4), %edi |
8. | retq |
9. | xchgq %rax, %rcx |
10. | xorq %rax, %rax |
(In case you forgot, xchgq
swaps two values—here, the values in two
registers—without modifying condition codes.)
“So what if it’s buggy,” says Andy Grove; “it can still run programs.” For instance, he argues convincingly that this function:
void do_nothing() {
}
is implemented correctly by this Fartium instruction sequence:
retq
Your job is to implement more complex functions using only Fartium
instructions. Your implementations must have the same semantics as the C
functions, but may perform much worse than one might expect. You may
leave off arguments and write instruction numbers (#1–10) or
instruction names. Indicate where labels L1–L3
point (if you need
them). Assume that the Fartium Core Trio uses the normal x86-64 calling
convention.
QUESTION ASM-10A.
int return_zero() {
return 0;
}
xorq %rax, %rax; retq
.
%rax
has unknown value when a function begins, so we need to clear it.
QUESTION ASM-10B.
int identity(int a) {
return a;
}
xchgq %rdi, %rax; retq
.
QUESTION ASM-10C.
void infinite_loop() {
while (1) {
/* do nothing */
}
}
L3: jmp L3
.
QUESTION ASM-10D.
struct point {
int x;
int y;
int z;
};
int extract_z(point* p) {
return p->z;
}
xorq %rax, %rax incq %rax incq %rax movl (%rdi,%rax,4), %edi xchgl %rax, %rdi ret
So much for the easy ones. Now complete one out of the following parts, or more than one for extra credit.
QUESTION ASM-10E.
long add(long a, long b) {
return a + b;
}
xorq %rax, %rax # %rax := 0 xchgq %rax, %rdi # now %rax == a and %rdi == 0 L3: cmpq %rdi, %rsi # compare %rsi and %rdi (which is 0) je L1 # "if %rsi == 0 goto L1" incl %rax # ++%rax decl %rsi # --%rsi jmp L3 L1: retq
The loop at
L3
executesb
times, incrementing%eax
each time. Here’s morally equivalent C++:long add(long a, long b) { while (b != 0) { ++a; --b; } return a; }
QUESTION ASM-10F.
int array_dereference(int* a, long i) {
return a[i];
}
xorq %rax, %rax # %rax := 0 L3: xchgq %rax, %rdi cmpl %rdi, %rsi xchgq %rax, %rdi je L1 # "if %rax == i goto L1" incq %rax # ++%rax jmp L3 L1: movl (%rdi,%rax,4), %edi # %edi := a[i] xchgq %rax, %rdi ret
ASM-11. Program Layout
For the following questions, select the part(s) of memory from the list below that best describes where you will find the object.
- heap
- stack
- between the heap and the stack
- in a read-only data segment
- in a text segment starting at address 0x08048000
- in a read/write data segment
- in a register
Assume the following code, compiled without optimization.
#include <errno.h>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// The following is copied from stdio.h for your reference
#define EOF (-1)
1. unsigned long fib(unsigned long n) {
2. if (n < 2) {
3. return n;
4. }
5. return fib(n - 1) + fib(n - 2);
6. }
7.
8. int main(int argc, char *argv[]) {
9. extern int optind;
10. char ch;
11. unsigned long f, n;
12.
13. // Command line processing
14. while ((ch = getopt(argc, argv, "h")) != EOF) {
15. switch (ch) {
16. case 'h':
17. case '?':
18. default:
19. return (usage());
20. }
21. }
22.
23. argc -= optind;
24. argv += optind;
25.
26. if (argc != 1) {
27. return usage();
28. }
29.
30. n = strtoul(strdup(argv[0]), nullptr, 10);
31. if (n == 0 && errno == EINVAL) {
32. return usage();
33. }
34.
35. /* Now call one of the fib routines. */
36. f = fib(n);
37. printf("fib(%lu) = %lu\n", n, f);
38.
39. return 0;
40. }
QUESTION ASM-11A. The string "fib(%lu) = %lu\n"
(line 37).
Read-only data segment, aka text segment
QUESTION ASM-11B. optind
(line 23).
Read/write data segment
QUESTION ASM-11C. When executing at line 5, where you will find the
address to which fib
returns.
Stack
QUESTION ASM-11D. Where will you find the value of EOF that is
compared to the return value of getopt
in line 14.
Register—although this register is likely to be hidden inside the processor, not one of the ones that have programmable names. Alternately, text segment, since the −1 will be encoded into some instruction.
QUESTION ASM-11E. getopt
(line 14)
Text segment; alternately: Between the heap and the stack (because that’s where shared libraries tend to be loaded)
QUESTION ASM-11F. fib
(lines 1-6)
Text segment
QUESTION ASM-11G. the variable f
(line 36)
Register or stack
QUESTION ASM-11H. the string being passed to strtoul
(line 30)
Heap
QUESTION ASM-11I. strdup
(line 30)
Text segment or between heap & stack (same as
getopt
)
QUESTION ASM-11J. The value of the fib
function when we return
from fib
(line 5).
Register (%rax)
ASM-12. Assembly and Data Structures
Consider the following assembly function.
func:
xorl %eax, %eax
cmpb $0, (%rdi)
je .L27
.L26:
addq $1, %rdi
addl $1, %eax
cmpb $0, (%rdi)
jne .L26
.L27:
retq
QUESTION ASM-12A. How many parameters does this function appear to have?
1
QUESTION ASM-12B. What do you suppose the type of that parameter is?
const char*
(orconst unsigned char*
,char*
, etc.)
QUESTION ASM-12C. Write C++ code that corresponds to it.
It’s
strlen
!int strlen(const char* x) { int n = 0; for (; *x; ++x) ++n; return n; }
ASM-13. Assembly language
Consider the following four assembly functions.
# Code Sample 1 movq %rdi, %rax testq %rdx, %rdx je .L2 addq %rdi, %rdx movq %rdi, %rcx .L3: addq $1, %rcx movb %sil, -1(%rcx) cmpq %rdx, %rcx jne .L3 .L2: rep ret
# Code Sample 2 movq %rdi, %rax testq %rdx, %rdx je .L2 addq %rdi, %rdx movq %rdi, %rcx .L3: addq $1, %rcx addq $1, %rsi movzbl -1(%rsi), %r8d movb %r8b, -1(%rcx) cmpq %rdx, %rcx jne .L3 .L2: rep ret
# Code Sample 3 movb (%rsi), %al testb %al, %al je .L3 incq %rsi .L2: movb %al, (%rdi) incq %rdi movb (%rsi), %al incq %rsi testb %al, %al jne .L2 .L3: movq %rdi, %rax ret
# Code Sample 4 testq %rdx, %rdx je .L3 movq %rdi, %rax .L2: movb %sil, (%rax) incq %rax decq %rdx jne .L2 .L3: movq %rdi, %rax ret
(Note: The %sil
register is the lowest-order byte of register
%rsi
, just as %al
is the lowest-order byte of %rax
and
%r8b
is the lowest-order byte of %r8
.)
QUESTION ASM-13A. Which two of the assembly functions perform the exact same task?
1 and 4
QUESTION ASM-13B. What is that task? You can describe it briefly, or give the name of the corresponding C library function.
memset
QUESTION ASM-13C. Explain how the other two functions differ from each other.
One is
memcpy
and the other isstrcpy
, so the difference is that #2 terminates after copying a number of bytes indicated by the parameter while the other terminates when it encounters a NUL value in the source string.
ASM-14. Golden Baron
A very rich person has designed a new x86-64-based computer, the Golden Baron Supercomputer 9000, and is paying you handsomely to write a C compiler for it. There’s just one problem. This person, like many very rich people, is dumb, and on their computer, odd-numbered memory addresses don’t work for data. When data is loaded into a general-purpose register from an odd-numbered address, the value read is zero. For example, consider the following instructions:
movl $0x01020304, a(%rip)
movl a(%rip), %eax
(where the address of a
is even). Executed on true x86-64, %rax
will hold the value 0x01020304; on Golden Baron, %rax
will hold
0x00020004.
But it is still possible to write a correct C compiler for this ungodly
hardware—you just have to work around the bad memory with code. You plan
to use two bytes of Golden Baron memory for every one byte of normal
x86-64 memory. For instance, an array int a[2] = {1, 0x0a0b0c0d};
would be stored in 16 bytes of memory, like so:
01 00 00 00 00 00 00 00 0d 00 0c 00 0b 00 0a 00
Pointer arithmetic, and moving multi-byte values to and from registers,
must account for the zero bytes that alternate with meaningful bytes. So
to read the correct value for a[2]
, the compiler must arrange to
read the bytes at addresses A+8
, A+10
, A+12
, and A+14
,
where A
is the address of the first byte in a
.
QUESTION ASM-14A. What should printf("%zu\n", sizeof(char))
print on
Golden Baron?
1. This is required by the C++ abstract machine:
sizeof(char) == 1
.
QUESTION ASM-14B. This function
int f(signed char* c, size_t i) {
return c[i];
}
can compile to two instructions on x86-64, including retq
. It can
also compile to two instructions on Golden Baron. (We’re assuming that
memory used for Golden Baron instructions works normally.) What are
those instructions?
movsbl (%rdi,%rsi,2), %eax retq
QUESTION ASM-14C. This function
int g(int* a, size_t i) {
return a[i];
}
can compile to two instructions on x86-64, but Golden Baron requires more work. Write the Golden Baron translation of this function in x86-64 assembly language. For partial credit, write C code that, executed on x86-64, would return the correct value from a Golden Baron-formatted array.
movzbl (%rdi,%rsi,8), %eax movzbl 2(%rdi,%rsi,8), %ecx shll $8, %ecx orl %ecx, %eax movzbl 4(%rdi,%rsi,8), %ecx shlq $16, %ecx orl %ecx, %eax movzbl 8(%rdi,%rsi,8), %ecx shlq $24, %ecx orl %ecx, %eax retq
Or:
movq (%rdi,%rsi,8), %rcx movq %rcx, %rax andq $255, %rax shrq $8, %rcx movq %rcx, %rdx andq $0xff00, %rdx orl %edx, %eax shrq $16, %rcx movq %rcx, %rdx andq $0xff0000, %rdx orl %edx, %eax shrq $32, %rcx movq %rcx, %rdx andq $0xff000000, %rdx orl %edx, %eax retq
QUESTION ASM-14D. The Golden Baron’s x86-64 processor actually supports a
secret instruction, swizzleq SRC, REG
, which rearranges the nybbles
(the hexadecimal digits—the aligned 4-bit slices) of the destination
register REG
based on the source argument SRC
. Here’s some
examples. Assuming that %rax
holds the value 0x0123456789ABCDEF
,
the following swizzleq
instructions leave the indicated results in
%rax
:
swizzleq $0, %rax
: %rax gets0xFFFF'FFFF'FFFF'FFFF
.The contents of nybble 0 [bits 0-3, inclusive], are repeated into every nybble.
swizzleq $0xFEDCBA9876543210, %rax
: %rax gets0x0123'4567'89AB'CDEF
.Each nybble is mapped to its current value: nybble 0 [bits 0-3] is placed in nybble 0 [bits 0-3], nybble 1 in nybble 1, and so forth.
swizzleq $0x0123456701234567, %rax
: %rax gets0xFEDC'BA98'FEDC'BA98
.Nybble 0 [bits 0-3] is placed in nybbles 7 and 15 [bits 28-31 and 60-63]; nybble 1 [bits 4-7] is placed in nybbles 6 and 14 [bits 24-27 and 56-59]; etc.
swizzleq $0xEFCDAB8967452301, %rax
: %rax gets0x1032'5476'98BA'DCFE
.The nybbles of each byte are exchanged.
Use swizzleq
to shorten your answer for Part C.
movq (%rdi,%rsi,8), %rax swizzleq $0x2222'2222'dc98'5410, %rax retq
ASM-15. Instruction behavior
QUESTION ASM-15A. Name three different x86-64 instructions that always
modify the stack pointer, no matter their arguments (instruction names
only; suffixes don’t count, so movl
and movq
are the same
instruction name).
push
,pop
,call
,ret
QUESTION ASM-15B. Name three different x86-64 instructions that sometimes modify the stack pointer, depending on their arguments.
mov
,add
,sub
,or
,and
, …
QUESTION ASM-15C. Name three different x86-64 instructions that never modify the stack pointer, no matter their arguments.
jmp
,jne
,je
,jWHATEVER
,cmp
,test
,nop
, many others
QUESTION ASM-15D. List three different instructions, including
arguments, that if placed immediately before a retq
instruction
that ends a function, will never change the function’s behavior. The
instructions should have different names. No funny business: assume the
function was compiled from valid C, that relative jumps are fixed up,
and that, for example, it doesn’t access its own code.
Many examples:
retq
:)jmp [next instruction]
test (any register), (any register)
cmp (any register), (any register)
nop
mov
s or arithmetic instructions that involve caller-saved registers other than%rax
ASM-16. Calling convention
The following questions concern valid C++ functions compiled using the normal x86-64 calling convention. True or false?
QUESTION ASM-16A. If the function’s instructions do not save and restore any registers, then the C++ function did not call any other function.
False for two reasons: (1) If this function doesn’t use any callee-saved registers, it doesn't need to explicitly save & restore anything. (2) Tail call elimination.
QUESTION ASM-16B. If the function’s instructions do not change the stack
pointer, then the function’s instructions do not contain a call
instruction.
True because of stack alignment.
QUESTION ASM-16C. If the function’s instructions do not change the stack pointer, then the C++ function did not call any other function. Explain your answer briefly.
False because of tail call elimination.
QUESTION ASM-16D. If the function’s instructions do not modify the %rax
register, then the C++ function must return void
.
False; the function could return the result of calling another function.
QUESTION ASM-16E. If the function’s instructions store a local variable on the stack, then that variable’s address will be less than the function’s initial stack pointer.
True
ASM-17. Assembly
Here are three x86-64 assembly functions that were compiled from C.
f1: xorl %eax, %eax L2: movsbq (%rdi), %rdx subq $48, %rdx cmpq $9, %rdx ja L5 imulq $10, %rax, %rax incq %rdi addq %rdx, %rax jmp L2 L5: ret
f2: movq %rdi, %rax L7: cmpb $0, (%rax) je L9 incq %rax jmp L7 L9: cmpq %rax, %rdi jnb L11 decq %rax movb (%rdi), %cl incq %rdi movb (%rax), %dl movb %cl, (%rax) movb %dl, -1(%rdi) jmp L9 L11: ret
f3: xorl %eax, %eax L13: cmpq %rax, %rdx je L15 movb (%rdi,%rax), %cl movb (%rsi,%rax), %r8b movb %r8b, (%rdi,%rax) movb %cl, (%rsi,%rax) incq %rax jmp L13 L15: ret
(Note: imulq $10, %rax, %rax
means %rax *= 10
.)
QUESTION ASM-17A. How many arguments does each function most likely take?
f1
—1,f2
—1,f3
—3
QUESTION ASM-17B. Which functions modify at least one caller-saved register? List all that apply or write “none”.
All of them
QUESTION ASM-17C. Which functions never modify memory? List all that apply or write “none”.
f1
QUESTION ASM-17D. Write a signature for each function, giving a likely
type for each argument and a likely return type. (You may give a
void
return type if you think the function doesn’t return a useful
value.)
_______ f1(___________________________________)
_______ f2(___________________________________)
_______ f3(___________________________________)
long f1(const char*); void f2(char*); void f3(char*, char*, size_t);
QUESTION ASM-17E. One of these functions swaps the contents of two memory regions. Which one?
f3
QUESTION ASM-17F. What is the value of %rax
in f2
the first time
L9
is reached? Write a C expression in terms of f2
’s argument or
arguments; you may use standard library functions.
%rdi + strlen(%rdi)
QUESTION ASM-17G. Give arguments for each function that would result in the function returning without writing to memory or causing a fault.
f1(_____________________)
f2(_____________________)
f3(_____________________)
f1("")
,f2("")
,f3("", "", 0)
QUESTION ASM-17H. Complete this function so that it returns the number
- For full credit, use only calls to
f1
,f2
, andf3
. For partial credit, do something simpler.
int magic() {
char s1[] = "Shaka kaSenzangakhona became King of the Zulu Kingdom in 1816";
char s2[] = "Dingane kaSenzangakhona succeeded Shaka in 1828";
char s3[] = "1661 in the Gregorian calendar is 3994 in the Korean calendar";
}
int magic() { … f2(s1); f3(s1, s3, 2); return f1(s3); }
IO-1. I/O caching
Mary Ruefle, a poet who lives in Vermont, is working on her caching I/O library for CS 61. She wants to implement a cache with N slots. Since searching those slots might slow down her library, she writes a function that maps addresses to slots. Here’s some of her code.
#define SLOTSIZ 4096
struct io61_slot {
char buf[SLOTSIZ];
off_t pos; // = (off_t) -1 for empty slots
ssize_t sz;
};
#define NSLOTS 64
struct io61_file {
int fd;
off_t pos; // current file position
io61_slot slots[NSLOTS];
};
static inline int find_slot(off_t off) {
return off % NSLOTS;
}
int io61_readc(io61_file* f) {
int slotindex = find_slot(f->pos);
io61_slot* s = &f->slots[slotindex];
if (f->pos < s->pos || f->pos >= s->pos + s->sz) {
// slot contains wrong data, need to refill it
off_t new_pos = lseek(f->fd, f->pos, SEEK_SET);
assert(new_pos != (off_t) -1); // only handle seekable files for now
ssize_t r = read(f->fd, s->buf, SLOTSIZ);
if (r == -1 || r == 0) {
return EOF;
}
s->pos = f->pos;
s->sz = r;
}
int ch = (unsigned char) s->buf[f->pos - s->pos];
++f->pos;
return ch;
}
Before she can run and debug this code, Mary is led “to an emergency of feeling that … results in a poem.” She’ll return to CS61 and fix her implementation soon, but in the meantime, let’s answer some questions about it.
QUESTION IO-1A. True or false: Mary’s cache is a direct-mapped cache.
True
QUESTION IO-1B. What changes to Mary’s code could change your answer to Part A? Circle all that apply.
- New code for
find_slot
(keepingio61_readc
the same) - New code in
io61_readc
(keepingfind_slot
the same) - New code in
io61_readc
and new code forfind_slot
- None of the above
#2 and #3
QUESTION IO-1C. Which problems would occur when Mary’s code was used to sequentially read a seekable file of size 2MiB (2×220 = 2097152 bytes) one character at a time? Circle all that apply.
- Excessive CPU usage (>10x stdio)
- Many system calls to read data (>10x stdio)
- Incorrect data (byte x read at a position where the file has byte y≠x)
- Read too much data (more bytes read than file contains)
- Read too little data (fewer bytes read than file contains)
- Crash/undefined behavior
- None of the above
#2 only
QUESTION IO-1D. Which of these new implementations for find_slot
would fix at least one of these problems with reading sequential files?
Circle all that apply.
return (off * 2654435761) % NSLOTS; /* integer hash function from Stack Overflow */
return (off / SLOTSIZ) % NSLOTS;
return off & (NSLOTS - 1);
return 0;
return (off >> 12) & 0x3F;
- None of the above
#2, #4, #5
IO-2. Caches and reference strings
QUESTION IO-2A. True or false: A direct-mapped cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.
False: direct-mapped caches can have conflict misses
QUESTION IO-2B. True or false: A fully-associative cache with N or more slots can handle any reference string containing ≤N distinct addresses with no misses except for cold misses.
True
Consider the following 5 reference strings.
Name | String |
---|---|
α | 1 |
β | 1, 2 |
γ | 1, 2, 3, 4, 5 |
δ | 2, 4 |
ε | 5, 2, 4, 2 |
QUESTION IO-2C. Which of the strings might indicate a sequential access pattern? Circle all that apply.
α |
β |
γ |
δ |
ε |
None of these |
(α), β, γ
QUESTION IO-2D. Which of the strings might indicate a strided access pattern with stride >1? Circle all that apply.
α |
β |
γ |
δ |
ε |
None of these |
(α), δ
One very clever person pointed out that β and γ could also represent large strides: for example, consider a file with 10 bytes accessed with stride 11!
The remaining questions concern concatenated permutations of these five strings. For example, the permutation αβγδε refers to this reference string:
1, 1, 2, 1, 2, 3, 4, 5, 2, 4, 5, 2, 4, 2. |
We pass such permutations through an initially-empty, fully-associative cache with 3 slots, and observe the numbers of hits.
QUESTION IO-2E. How many cold misses might a permutation observe? Circle all that apply.
0 |
1 |
2 |
3 |
4 |
5 |
Some other number |
5. The first time a reference string address is encountered, it must cause a cold miss.
Under LRU eviction, the permutation αβεγδ observes 5 hits as follows. (We annotate each access with “h” for hit or “m” for miss.)
1m; 1h, 2m; 5m, 2h, 4m, 2h; 1m, 2h, 3m, 4m, 5m; 2m, 4h. |
QUESTION IO-2F. How many hits does this permutation observe under FIFO eviction?
4 hits
QUESTION IO-2G. Give a permutation that will observe 8 hits under LRU eviction, which is the maximum for any permutation. There are several possible answers. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 7 hits, etc.)
The following four permutations observe 8 hits under LRU: αβγδε, αβγεδ, βαγδε, βαγεδ. 28 permutations observe 7 hits; 25 observe 6 hits; and 38 observe 5 hits.
QUESTION IO-2H. Give a permutation that will observe 2 hits under LRU eviction, which is the minimum for any permutation. There is one unique answer. (Write your answer as a permutation of αβγδε. For partial credit, find a permutation that has 3 hits, etc.)
δαεγβ. 4 permutations observe 3 hits and 20 observe 4 hits.
IO-3. Processor cache
The git version control system is based on commit hashes, which are 160-bit (20-byte) hash values used to identify commits. In this problem you’ll consider the processor cache behavior of several versions of a “grading server” that maps commits to grades. Here’s the first version:
struct commit_info {
char hash[20];
int grade[11];
};
commit_info* commits;
size_t N;
int get_grade1(const char* hash, int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i].hash, hash, 20) == 0) {
return commits[i].grade[pset];
}
}
return -1;
}
We will ask questions about the average number of cache lines accessed
by variants of get_grade(hash, pset)
. You should make the following
assumptions:
- The
hash
argument is uniformly drawn from the set of known commits. That is, the probability thathash
equals the ith commit’s hash is 1/N
. - Only count cache lines accessible via
commits
. Don’t worry about cache lines used for local variables, for parameters, for global variables, or for instructions. For instance, do not count thehash
argument or the global-data cache line that stores thecommits
variable itself. - Every object is 64-byte aligned, and no two objects share the same cache line.
- Commit hashes are mathematically indistinguishable from random numbers. Thus, the probability that two different hashes have the same initial k bits equals 1/2k.
- Fully correct answers would involve ceiling functions, but you don’t need to include them.
QUESTION IO-3A. What is the expected number of cache lines accessed
by get_grade1
, in terms of N
?
Each commit_info object is on its own cache line, and we will examine 1/2 of the objects on average, so the answer is ⌈N/2⌉.
The second version:
struct commit_info {
char hash[20];
int grade[11];
};
commit_info** commits;
size_t N;
int get_grade2(const char hash[20], int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i]->hash, hash, 20) == 0) {
return commits[i]->grade[pset];
}
}
return -1;
}
QUESTION IO-3B. What is the expected number of cache lines accessed
by get_grade2
, in terms of N
?
This still examines N/2 commit_info objects. But in addition it examines cache lines to evaluate the POINTERS to those objects. There are 8 such pointers per cache line (8×8=64), and we examine N/2 pointers, for N/8/2 = N/16 additional cache lines. Thus ⌈N/2⌉+⌈N/16⌉ ≅ 9N/16.
The third version:
struct commit_info {
char hash[20];
int grade[11];
};
struct commit_hint {
char hint[8];
commit_info* commit;
};
commit_hint* commits;
size_t N;
int get_grade3(const char* hash, int pset) {
for (size_t i = 0; i != N; ++i) {
if (memcmp(commits[i].hint, hash, 8) == 0
&& memcmp(commits[i].commit->hash, hash, 20) == 0) {
return commits[i].commit->grade[pset];
}
}
return -1;
}
QUESTION IO-3C. What is the expected number of cache lines accessed
by get_grade3
, in terms of N
? (You may assume that N
≤2000.)
The assumption that N≤2000 means we’re exceedingly unlikely to encounter a hint collision (i.e. a commit with the same hint, but different commit value). That means that we will examine N/2 commit_hint objects but ONLY ONE commit_info object. commit_hint objects are 16B big, so 4 hints/cache line: we examine N/4/2 = N/8 cache lines for hint objects, plus one for the info. ⌈N/8⌉ + 1.
The fourth version is a hash table.
struct commit_info {
char hash[20];
int grade[11];
};
commit_info** commits;
size_t commits_hashsize;
int get_grade4(const char* hash, int pset) {
// choose initial bucket
size_t bucket;
memcpy(&bucket, hash, sizeof(bucket));
bucket = bucket % commits_hashsize;
// search for the commit starting from that bucket
while (commits[bucket] != nullptr) {
if (memcmp(commits[bucket]->hash, hash, 20) == 0) {
return commits[bucket]->grade[pset];
}
bucket = (bucket + 1) % commits_hashsize;
}
return -1;
}
QUESTION IO-3D. Assume that a call to get_grade4
encounters C
expected hash collisions (i.e., examines C
buckets before finding the
bucket that actually contains hash
). What is the expected number of
cache lines accessed by get_grade4
, in terms of N
and C
?
For commit_info objects, the lookup will access C cache lines, for the collisions, plus 1, for the successful lookup. But we must also consider the commits[bucket] pointers. We will examine at least 1 cache line for the successful bucket. The C collisions that happen before that will access C buckets. But those buckets might be divided among multiple cache lines; for instance, if C=1, 2 buckets are accessed, but if the first bucket=15, those buckets will be divided among 2 cache lines. The correct formula for buckets, including the final lookup, is 1 + C/8. Thus the total lookup will examine 2 + C + C/8 cache lines on average.
IO-4. IO caching and strace
Elif Batuman is investigating several program executables left behind by
her ex-roommate Fyodor. She runs each executable under strace
in the
following way:
strace -o strace.txt ./EXECUTABLE files/text1meg.txt > files/out.txt
Help her figure out properties of these programs based on their system call traces.
QUESTION IO-4A. Program ./mysterya
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x8193000
brk(0x81b5000) = 0x81b5000
read(3, "A", 1) = 1
write(1, "A", 1) = 1
read(3, "\n", 1) = 1
write(1, "\n", 1) = 1
read(3, "A", 1) = 1
write(1, "A", 1) = 1
read(3, "'", 1) = 1
write(1, "'", 1) = 1
read(3, "s", 1) = 1
write(1, "s", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; a, No read cache; i, No write cache; D, Other
QUESTION IO-4B. Program ./mysteryb
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x96c5000
brk(0x96e6000) = 0x96e6000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 2048) = 2048
read(3, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
write(1, "kad\nAkron\nAkron's\nAl\nAl's\nAla\nAl"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; b/c, (Un)aligned read cache; ii, Write cache; B, Cache size 2048
QUESTION IO-4C. Program ./mysteryc
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9064000
brk(0x9085000) = 0x9085000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1046528, SEEK_SET) = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1044480, SEEK_SET) = 1044480
read(3, "Quinton\nQuinton's\nQuirinal\nQuisl"..., 2048) = 2048
write(1, "ehR\neehR\naehR\ns'hR\nhR\nsdlonyeR\ns"..., 2048) = 2048
lseek(3, 1042432, SEEK_SET) = 1042432
read(3, "emyslid's\nPrensa\nPrensa's\nPrenti"..., 2048) = 2048
write(1, "\ns'nailitniuQ\nnailitniuQ\nnniuQ\ns"..., 2048) = 2048
lseek(3, 1040384, SEEK_SET) = 1040384
read(3, "Pindar's\nPinkerton\nPinocchio\nPin"..., 2048) = 2048
write(1, "rP\ndilsymerP\ns'regnimerP\nregnime"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
2, Reverse sequential IO; c, Aligned read cache; ii, Write cache; B, Cache size 2048
QUESTION IO-4D. Program ./mysteryd
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9a0e000
brk(0x9a2f000) = 0x9a2f000
fstat64(3, {st_mode=S_IFREG|0664, st_size=1048576, ...}) = 0
lseek(3, 1048575, SEEK_SET) = 1048575
read(3, "o", 2048) = 1
lseek(3, 1048574, SEEK_SET) = 1048574
read(3, "Ro", 2048) = 2
lseek(3, 1048573, SEEK_SET) = 1048573
read(3, "\nRo", 2048) = 3
...
lseek(3, 1046528, SEEK_SET) = 1046528
read(3, "ingau\nRheingau's\nRhenish\nRhianno"..., 2048) = 2048
write(1, "oR\ntlevesooR\ns'yenooR\nyenooR\ns't"..., 2048) = 2048
lseek(3, 1046527, SEEK_SET) = 1046527
read(3, "eingau\nRheingau's\nRhenish\nRhiann"..., 2048) = 2048
lseek(3, 1046526, SEEK_SET) = 1046526
read(3, "heingau\nRheingau's\nRhenish\nRhian"..., 2048) = 2048
...
Circle at least one option in each column.
|
|
|
|
2, Reverse sequential IO; b, Unaligned read cache; ii, Write cache; B, Cache size 2048
QUESTION IO-4E. Program ./mysterye
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x93e5000
brk(0x9407000) = 0x9407000
read(3, "A", 1) = 1
read(3, "\n", 1) = 1
read(3, "A", 1) = 1
...
read(3, "A", 1) = 1
read(3, "l", 1) = 1
write(1, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 1024) = 1024
read(3, "t", 1) = 1
read(3, "o", 1) = 1
read(3, "n", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; a, No read cache; ii, Write cache; C, (write) cache size 1024
QUESTION IO-4F. Program ./mysteryf
:
open("files/text1meg.txt", O_RDONLY) = 3
brk(0) = 0x9281000
brk(0x92a3000) = 0x92a3000
read(3, "A\nA's\nAA's\nAB's\nABM's\nAC's\nACTH'"..., 4096) = 4096
write(1, "A", 1) = 1
write(1, "\n", 1) = 1
write(1, "A", 1) = 1
...
write(1, "A", 1) = 1
write(1, "l", 1) = 1
read(3, "ton's\nAludra\nAludra's\nAlva\nAlvar"..., 4096) = 4096
write(1, "t", 1) = 1
write(1, "o", 1) = 1
write(1, "n", 1) = 1
...
Circle at least one option in each column.
|
|
|
|
1, Sequential IO; b/c, (un)aligned read cache; i, No write cache; A, Cache size 4096
IO-5. Processor cache
The following questions use the following C definition for an N
xM
matrix (the matrix has N
rows and M
columns).
struct matrix {
unsigned N;
unsigned M;
double elt[0];
};
matrix* matrix_create(unsigned N, unsigned M) {
matrix* m = (matrix*) malloc(sizeof(matrix) + N * M * sizeof(double));
m->N = N;
m->M = M;
for (size_t i = 0; i < N * M; ++i) {
m->elt[i] = 0.0;
}
return m;
}
Typically, matrix data is stored in row-major order: element
mij (at row i and column
j) is stored in m->elt[i*m->M + j]
. We might write this in C
using an inline function:
inline double* melt1(matrix* m, unsigned i, unsigned j) {
return &m->elt[i * m->M + j];
}
But that’s not the only possible method to store matrix data. Here are several more.
inline double* melt2(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + j * m->N];
}
inline double* melt3(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + ((m->N - i + j) % m->M) * m->N];
}
inline double* melt4(matrix* m, unsigned i, unsigned j) {
return &m->elt[i + ((i + j) % m->M) * m->N];
}
inline double* melt5(matrix* m, unsigned i, unsigned j) {
assert(m->M % 8 == 0);
unsigned k = (i/8) * (m->M/8) + (j/8);
return &m->elt[k*64 + (i % 8) * 8 + j % 8];
}
QUESTION IO-5A. Which method (of melt1
–melt5
) will have the best
processor cache behavior if most matrix accesses use loops like this?
for (unsigned j = 0; j < 100; ++j) {
for (unsigned i = 0; i < 100; ++i) {
f(*melt(m, i, j));
}
}
melt2
QUESTION IO-5B. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (unsigned i = 0; i < 100; ++i) {
f(*melt(m, i, i));
}
melt3
QUESTION IO-5C. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (unsigned i = 0; i < 100; ++i) {
for (unsigned j = 0; j < 100; ++j) {
f(*melt(m, i, j));
}
}
melt1
(butmelt5
is almost as good)
QUESTION IO-5D. Which method will have the best processor cache behavior if most matrix accesses use loops like this?
for (int di = -3; di <= 3; ++di) {
for (int dj = -3; dj <= 3; ++dj) {
f(*melt(m, I + di, J + dj));
}
}
melt5
QUESTION IO-5E. Here is a matrix-multiply function in ikj order.
matrix* matrix_multiply(matrix* a, matrix* b) {
assert(a->M == b->N);
matrix* c = matrix_create(a->N, b->M);
for (unsigned i = 0; i != a->N; ++i) {
for (unsigned k = 0; k != a->M; ++k) {
for (unsigned j = 0; j != b->M; ++j) {
*melt(c, i, j) += *melt(a, i, k) * *melt(b, k, j);
}
}
}
}
This loop order is cache-optimal when data is stored in melt1
order.
What loop order is cache-optimal for melt2
?
jki is best; kji is a close second.
QUESTION IO-5F. You notice that accessing a matrix element using
melt1
is very slow. After some debugging, it seems like the processor
on which you are running code has a very slow multiply instruction.
Briefly describe a change to struct matrix
that would let you write a
version of melt1
with no multiply instruction. You may add members,
change sizes, or anything you like.
Example answers:
- Add a
double** rows
member that points to each row so you don't need to multiply- Round M up to a power of 2 and use shifts
IO-6. Caching
Assume that we have a cache that holds four slots. Assume that each letter below indicates an access to a block. Answer the following questions as they pertain to the following sequence of accesses.
E D C B A E D A A A B C D E
QUESTION IO-6A. What is the hit rate assuming an LRU replacement policy?
E D C B A E D A A A B C D E 1 Ⓔ Ⓐ A A A Ⓔ 2 Ⓓ Ⓔ Ⓒ 3 Ⓒ Ⓓ D 4 Ⓑ B The hit rate is 5/14.
QUESTION IO-6B. What pages will you have in the cache at the end of the run?
B C D E
QUESTION IO-6C. What is the best possible hit rate attainable if you could see into the future?
With Bélády’s algorithm we get:
E D C B A E D A A A B C D E 1 Ⓔ E E 2 Ⓓ D D 3 Ⓒ Ⓐ A A A Ⓒ 4 Ⓑ B So 8/14 (4/7).
IO-7. Caching
Intel and CrossPoint have announced a new persistent memory technology with performance approaching that of DRAM. Your job is to calculate some performance metrics to help system architectects decide how to best incorporate this new technology into their platform.
Let's say that it takes 64ns to access one (32-bit) word of main memory (DRAM) and 256ns to access one (32-bit) word of this new persistent memory, which we'll call NVM (non-volatile memory). The block size of the NVM is 256 bytes. The NVM designers are quite smart and although it takes a long time to access the first byte, when you are accessing NVM sequentially, the devices perform read ahead and stream data efficiently -- at 32 GB/second, which is identical to the bandwidth of DRAM.
QUESTION IO-7A. Let's say that we are performing random accesses of 32 bits (on a 32-bit processor). What fraction of the accesses must be to main memory (as opposed to NVM) to achieve performance within 10% of DRAM?
Let X be the fraction of accesses to DRAM: access time = 64X + 256(1-X). We want that to be <= 1.1*64 (within 10% of DRAM). So, 1.1*64 = 70.4. So, let's solve for: 64X + 256(1-X) = 70.4.
64X + 256 - 256X = 70.4. (256X - 64X) = 256 - 70.4 192X = 186 X = 186/192 about .97
So, we need a hit rate in main memory of 97%
QUESTION IO-7B. Let's say that they write every byte of a 256 block in units of 32 bits. How much faster will write-back cache perform relative to a write-through cache? (An approximate order of magnitude will be sufficient; showing work can earn partial credit.)
Write-through is going to cost 256ns/4 byte write = 256 * 64 = 28*26 = 2^14 = 16 K ns which is roughly 16 microseconds. If we assume a write-back, then it will take us 64 * 64ns to write into the DRAM, but then we get to stream the data from DRAM into the NVM at a rate of 32 GB/sec. So, 64*64 ns = 2^12 ns = 4 microseconds to write into DRAM. Let's convert 32 GB/second ito KB -- that's about 32 KB/microsecond. We need 1/4 of 1 KB which is 1/128 of a microsecond, which is about 8 ns. So, it's really really really fast to stream the data -- once you know that, then you also realize that the real difference is just the relative cost of writing to DRAM versus the cost of writing to NVM. So, the writeback cache is almost 4 times faster than the write through cache. You can get full credit by saying something like: the time to stream the data out of the DRAM into the NVM at the sequential speed is tiny relative to the time to write even a single word to DRAM, so the ultimate difference is the difference in writing to DRAM relative to NVM which is a ratio of 4:1. So, the writeback cache is about 4 times faster (because it is running at almost the full DRAM speed).
QUESTION IO-7C. Why might you not want to use a write-back cache?
A write-through cache will have very different persistence guarantees. If you need every 4- byte write to be persistent, then you have no choice but to implement a write-through cache.
IO-8. Reference strings
The following questions concern the FIFO (First In First Out), LRU (Least Recently Used), and LFU (Least Frequently Used) cache eviction policies.
Your answers should refer to seven-item reference strings made up of digits in the range 0–9. An example answer might be “1231231”. In each case, the reference string is processed by a 3-slot cache that’s initially empty.
QUESTION IO-8A. Give a reference string that has a 1/7 hit rate in all three policies.
1123456
QUESTION IO-8B. Give a reference string that has a 6/7 hit rate in all three policies.
1111111
QUESTION IO-8C. Give a reference string that has different hit rates under LRU and LFU policies, and compute the hit rates.
String:
LRU hit rate:
LFU hit rate:
String: 1234111
LRU hit rate: 2/7
LFU hit rate: 3/7
QUESTION IO-8D. Give a reference string that has different hit rates under FIFO and LRU policies, and compute the hit rates.
String:
FIFO hit rate:
LRU hit rate:
String: 1231411
FIFO hit rate: 2/7
LRU hit rate: 3/7
QUESTION IO-8E. Now let's assume that you know a reference string in advance. Given a 3-slot cache and the following reference string, what caching algorithm discussed in class and/or exercises would produce the best hit rate, and would would that hit rate be?
“12341425321521”
Bélády’s optimal algorithm (ACCENTS REQUIRED FOR FULL CREDIT!)(!*#^‡°
1m 2m 3m 4m [124] 1h 4h 2h 5m [125] 3m [123] 2h 1h 5m [125] 2h 1h
7/14 = 1/2
IO-9. Caching: Access times and hit rates
Recall that x86-64 instructions can access memory in units of 1, 2, 4, or 8 bytes at a time. Assume we are running on an x86-64-like machine with 1024-byte cache lines. Our machine takes 32ns to access a unit if the cache hits, regardless of unit size. If the cache misses, an additional 8160ns are required to load the cache, for a total of 8192ns.
QUESTION IO-9A. What is the average access time per access to access all the data in a cache line as an array of 256 integers, starting from an empty cache?
(8192ns * 1 + 32ns * 255)/256 (= 63.875)
QUESTION IO-9B. What unit size (1, 2, 4, or 8) minimizes the access time to access all data in a cache line, starting from an empty cache?
8
QUESTION IO-9C. What unit size (1, 2, 4, or 8) maximizes the hit rate to access all data in a cache line, starting from an empty cache?
1
IO-10. Single-slot cache code
Donald Duck is working on a single-slot cache for reading. He’s using
the pos_tag
/end_tag
representation, which
is:
struct io61_file {
int fd;
unsigned char cbuf[BUFSIZ];
off_t tag; // file offset of first character in cache (same as before)
off_t end_tag; // file offset one past last valid char in cache; end_tag - tag == old `csz`
off_t pos_tag; // file offset of next char to read in cache; pos_tag - tag == old `cpos`
};
Here’s our solution code; in case you want to scribble, the code is copied in the appendix.
1. ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
2. size_t pos = 0;
3. while (pos != sz) {
4. if (f->pos_tag < f->end_tag) {
5. ssize_t n = sz - pos;
6. if (n > f->end_tag - f->pos_tag)
7. n = f->end_tag - f->pos_tag;
8. memcpy(&buf[pos], &f->cbuf[f->pos_tag - f->tag], n);
9. f->pos_tag += n;
10. pos += n;
11. } else {
12. f->tag = f->end_tag;
13. ssize_t n = read(f->fd, f->cbuf, BUFSIZ);
14. if (n > 0)
15. f->end_tag += n;
16. else
17. return pos ? pos : n;
18. }
19. }
20. return pos;
21. }
Donald has ideas for “simplifying” this code. Specifically, he wants to try each of the following independently:
- Replacing line 4 with “
if (f->pos_tag <= f->end_tag) {
”. - Removing lines 6–7.
- Removing line 9.
- Removing lines 16–17.
QUESTION IO-10A. Which simplifications could lead to undefined behavior? List all that apply or say “none.”
B (removing 6–7): you read beyond the cache buffer.
QUESTION IO-10B. Which simplifications could cause io61_read
to
loop forever without causing undefined behavior? List all that apply or
say “none.”
A (replacing 4): you spin forever after exhausting the cache
D (removing 16–17): you spin forever if the file runs out of data or has a persistent error
QUESTION IO-10C. Which simplifications could lead to io61_read
returning incorrect data in buf
, meaning that the data read by a
series of io61_read
calls won’t equal the data in the file? List
all that apply or say “none.”
B (removing 6–7): you read garbage beyond the cache buffer
C (removing 9): you read the same data over & over again.
QUESTION IO-10D. Chastened, Donald decides to optimize the code for
a specific situation, namely when io61_read
is called with a sz
that is larger than BUFSIZ
. He wants to add code after line 11, like
so, so that fewer read
system calls will happen for large sz
:
11. } else if (sz - pos > BUFSIZ) {
// DONALD’S CODE HERE
11A. } else {
12. f->tag = f->end_tag;
....
Finish Donald’s code. Your code should maintain the relevant invariants
between tag
, pos_tag
, end_tag
, and the file position, but
you need not keep tag
aligned.
ssize_t n = read(f->fd, &buf[pos], sz - pos); if (n > 0) { f->tag = f->pos_tag = f->end_tag = f->end_tag + n; pos += n; } else { return pos ? pos : n; }
IO-11. Caching
QUESTION IO-11A. If it takes 200ns to access main memory, which of the following two caches will produce a lower average access time?
- A cache with a 10ns access time that produces a 90% hit rate
- A cache with a 20ns access time that produces a 98% hit rate
Let's compute average access time for each case:
.9 * 10 + .1 * 200 = 9 + 20 = 29 .98 * 20 + .02 * 200 = 19.6 + 4 = 23.6
The 20ns cache produces a lower average access time.
QUESTION IO-11B. Let’s say that you have a direct-mapped cache with
four slots. A page with page number N
must reside in the slot numbered
N % 4
. What is the best hit rate this could achieve given the
following sequence of page accesses?
3 6 7 5 3 2 1 1 1 8
Since it's direct mapped, each item can go in only one slot, so if we list the slots for each access, we get:
3 2 3 1 3 2 1 1 1 0
The only hits are the 2 1's, so your hit rate is 2/10 or 20% or .2.
QUESTION IO-11C. What is the best hit rate a fully-associative four-slot cache could achieve for that sequence of page accesses? (A fully-associative cache may put any page in any slot. You may assume you know the full reference stream in advance.)
Now we can get hits for 3, 1, and 1, so our hit rate goes to 3/10 or 30%.
QUESTION IO-11D. What hit rate would the fully-associative four-slot cache achieve if it used the LRU eviction policy?
Still 30% (3/10, .3)
IO-12. I/O traces
QUESTION IO-12A. Which of the following programs cannot be distinguished
by the output of the strace
utility, not considering open
calls? List all
that apply; if multiple indistinguishable groups exist (e.g., A, B, & C can’t
be distinguished, and D & E can’t be distinguished, but the groups can be
distinguished from each other), list them all.
- Sequential byte writes using stdio
- Sequential byte writes using system calls
- Sequential byte writes using system calls and
O_SYNC
- Sequential block writes using stdio and block size 2
- Sequential block writes using system calls and block size 2
- Sequential block writes using system calls and
O_SYNC
and block size 2 - Sequential block writes using stdio and block size 4096
- Sequential block writes using system calls and block size 4096
- Sequential block writes using system calls and
O_SYNC
and block size 4096
1, 4, 7, 8 can’t be distinguished.
If you consider
open
, then the O_SYNC cases and the others become distinguishable. Assuming that, 2&3 can’t be distinguished, 4&5 can’t be distinguished, and 1,4,7,8,9 can’t be distinguished.
QUESTION IO-12B. Which of the programs in Part A cannot be distinguished
using blktrace
output? List all that apply.
1, 2, 4, 5, 7, 8, and possibly 9 can’t be distinguished.
QUESTION IO-12C. The buffer cache is coherent. Which of the following operating system changes could make the buffer cache incoherent? List all that apply.
- Application programs can obtain direct read access to the buffer cache
- Application programs can obtain direct write access to the disk, bypassing the buffer cache
- Other computers can communicate with the disk independently
- The computer has a uninterruptible power supply (UPS), ensuring that the operating system can write the contents of the buffer cache to disk if main power is lost
#2, #3
QUESTION IO-12D. The stdio cache is incoherent. Which of the operating system changes from Part C could make the stdio cache coherent? List all that apply.
#1
IO-13. Reference strings and eviction
QUESTION IO-13A. When demonstrating cache eviction in class, we modeled a completely reactive cache, meaning that the cache performed at most one load from slow storage per access. Name a class of reference string that will have a 0% hit rate on any cold reactive cache. For partial credit, give several examples of such reference strings.
Sequential access
QUESTION IO-13B. What cache optimization can be used to improve the hit rate for the class of reference string in Part A? One word is enough; put the best choice.
Prefetching. Batching is an OK answer, but mostly because it involves prefetching when done for reads.
QUESTION IO-13C. Give a single reference string with the following properties:
- There exists a cache size and eviction policy that gives a 70% hit rate for the string.
- There exists a cache size and eviction policy that gives a 0% hit rate for the string.
Example: 1231231231. 70% on any 3-slot cache; 0% on a 1-slot cache.
QUESTION IO-13D. Put the following eviction algorithms in order of how much space they require for per-slot metadata, starting with the least space and ending with the most space. (Assume the slot order is fixed, so once a block is loaded into slot i, it stays in slot i until it is evicted.) For partial credit say what you think the metadata would be.
- FIFO
- LRU
- Random
Random, then FIFO, then LRU. Random needs no additional metadata; FIFO can deal with a single integer for the whole cache, pointing to the next index to use; LRU nees a least-recently-used time.
IO-14. Cache code
Several famous musicians have just started working on CS61 Problem Set
- They share the following code for their read-only, sequential, single-slot cache:
struct io61_file {
int fd;
unsigned char buf[4096];
size_t pos; // position of next character to read in `buf`
size_t sz; // number of valid characters in `buf`
};
int io61_readc(io61_file* f) {
if (f->pos >= f->sz) {
f->pos = f->sz = 0;
ssize_t nr = read(f->fd, f->buf, sizeof(f->buf));
if (nr <= 0) {
f->sz = 0;
return -1;
} else {
f->sz = nr;
}
}
int ch = f->buf[f->pos];
++f->pos;
return ch;
}
But they have different io61_read
implementations. Donald
(Lambert)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
return read(f->fd, buf, sz);
}
Solange (Knowles)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
for (size_t pos = 0; pos < sz; ++pos, ++buf) {
*buf = io61_readc(f);
}
return sz;
}
Caroline (Shaw)’s is:
ssize_t io61_read(io61_file* f, char* buf, size_t sz) {
if (f->pos >= f->sz) {
return read(f->fd, buf, sz);
} else {
int ch = io61_readc(f);
if (ch < 0) {
return 0;
}
*buf = ch;
return io61_read(f, buf + 1, sz - 1) + 1;
}
}
You are testing each of these musicians’ codes by executing a sequence
of io61_readc
and/or io61_read
calls on an input file and
printing the resulting characters to standard output. There are no
seeks, and your test programs print until end of file, so your tests’
output should equal the input file’s contents.
You should assume for these questions that no read
system call
ever returns -1.
QUESTION IO-14A. Describe an access pattern—that is, a sequence of
io61_readc
and/or io61_read
calls (with lengths)—for which
Donald’s code can return incorrect data.
io61_readc
,io61_read(1)
, …: Any alternation between readc and read is a disaster.
QUESTION IO-14B. Which of these musicians’ codes can generate an output file with incorrect length?
For the remaining parts, assume the problem in Part B has been corrected, so that all musicians’ codes generate output files with correct lengths.
Solange’s code never returns end-of-file; she mistakes EOF for a valid return value. If a program using Donald’s code calls
io61_readc
once and then switches toio61_read
, then they will read too few bytes (bytes in the readc buffer won’t be returned).
QUESTION IO-14C. Give an access pattern for which Solange’s code will return correct data and outperform Donald’s, or vice versa, and say whose code will win.
Solange’s code will outperform Donald’s when
io61_read
is called with small sizes. Donald’s code will outperform Solange’s whenio61_read
is called with large sizes.
QUESTION IO-14D. Suggest a small change (≤10 characters) to Caroline’s code that would, most likely, make it perform at least as well as both Solange’s and Donald’s codes on all access patterns. Explain briefly.
I would change Caroline’s test from
if (f->pos >= f->sz)
toif (f->pos >= f->sz && sz >= 1024)
(or 4096, or something similar). This uses the cache when read is called with small sizes, but avoids the extra copy when read is called with large sizes.
IO-15. Caches
Parts A–C concern different implementations of Pset 3’s stdio cache. Assume a program that reads a 32768-byte file a character at a time, like this:
while (io61_readc(inf) != EOF) {
}
This program will call io61_readc
32769 times. (32769 =
215 + 1 = 8×212 + 1; the +1 accounts for the EOF
return.) But the cache implementation might make many fewer system
calls.
QUESTION IO-15A. How many read
system calls are required assuming a
single-slot, 4096-byte io61 cache?
9
QUESTION IO-15B. How many read
system calls are required assuming an
eight-slot, 4096-byte io61 cache?
9
QUESTION IO-15C. How many mmap
system calls are required assuming an
mmap
-based io61 cache?
1
Parts D–F concern cache implementations and styles. We discussed many caches in class, including:
- The buffer cache
- The processor cache
- Single-slot aligned stdio caches
- Single-slot unaligned stdio caches
- Circular bounded buffers
QUESTION IO-15D. Which of those caches are implemented entirely in hardware? List all that apply.
B
QUESTION IO-15E. Which of those software caches could help speed up reverse sequential access to a disk file? List all that apply.
A, C
QUESTION IO-15F. Which of those software caches could help speed up access to a pipe or network socket? List all that apply.
C, D, E
MISC-1. Git
Edward Snowden is working on a CS61 problem set and he has some git questions.
QUESTION MISC-1A. The CS61 staff has released some new code. Which commands will help Edward get the code from code.seas.harvard.edu into his repository? Circle all that apply.
git commit
git add
git push
git pull
#4
QUESTION MISC-1B. Edward has made some changes to his code. He hasn’t run git since making the changes. He wants to upload his latest version to code.seas.harvard.edu. Put the following git commands in an order that will accomplish this goal. You won’t necessarily use every command. You may add flags to a command (but you don’t have to). If you add flags, tell us what they are.
git commit
git add
git push
git pull
#2, #1, #3; or “#1 with
-a
”, #3
Edward Snowden’s partner, Edward Norton, has been working on the problem set also. They’ve been working independently.
At midnight on October 10, here’s how things stood. The git log
for
the partners’ shared code.seas.harvard.edu repository looked like this.
The committer is listed in (parentheses).
52d44ee Pset release. (kohler)
The git log
for Snowden’s local repository:
3246d07 Save Greenwald's phone number (snowden)
8633fbd Start work on a direct-mapped cache (snowden)
52d44ee Pset release. (kohler)
The git log
for Norton’s local repository:
81f952e try mmap (norton)
52d44ee Pset release. (kohler)
At noon on October 11, their shared GitHub repository has this log:
d446e60 Increase cache size (snowden)
b677e85 use mmap on mmappable files (norton)
b46cfda Merge branch 'master' of code.seas.harvard.edu:~TheTrueHOOHA/cs61/TheTrueHOOHAs-cs61-psets.git
(norton)
81f952e try mmap (norton)
3246d07 Save Greenwald's phone number (snowden)
8633fbd Start work on a direct-mapped cache (snowden)
52d44ee Pset release. (kohler)
QUESTION MISC-1C. Give an order for these commands that could have produced that log starting from the midnight October 10 state. You might not use every command, and you might use some commands more than once. Sample (incorrect) answer: “1 4 4 5 2.”
- snowden:
git commit -a
- snowden:
git push
- snowden:
git pull
- norton:
git commit -a
- norton:
git push
- norton:
git pull
- #2 (snowden push)
- [#5 (norton push—OPTIONAL; this push would fail)]
- #6 (norton pull) (We know that Snowden pushed first, and Norton pulled before pushing, because Norton committed the merge) [CIRCLE FOR 1D]
- [#4 (norton commit—OPTIONAL for the merge commit; the merge commit will happen automatically if there are no conflicts] [ALLOW CIRCLE FOR 1D]
- #4 (norton commit for b677e85)
- #5 (norton push)
- #3 (snowden pull—snowden pulls before committing because there is no merge)
- #1 (snowden commit for d446e60)
- #2 (snowden push)
QUESTION MISC-1D. In your answer to Part C, circle the step(s) where there might have been a merge conflict.
(see above)
MISC-2. Debugging
QUESTION MISC-2A. Match each tool or technique with a debugging situation for which it is well suited. Produce the best overall match that uses each situation exactly once.
1. strace | A. Investigating segmentation faults |
2. gdb | B. Finding memory leaks |
3. valgrind --tool=memcheck | C. Checking your assumptions and verifying invariants |
4. printf statements | D. Discovering I/O patterns |
5. assert | E. Displaying program state |
1—D, 2—A, 3—B, 4—E, 5—C
MISC-3. Pot Pourri
QUESTION MISC-3A. What does the following instruction place in %eax?
sarl $31, %eax
It fills eax with the sign bit of eax (i.e., all 0's or all 1's)
QUESTION MISC-3B. True/False: A direct-mapped cache with N slots can handle any reference string with < N distinct addresses with no misses except for compulsory misses.
False
QUESTION MISC-3C. What is 1 (binary) TB in hexadecimal?
1 TB = 2^40 = 1 followed by 40 zeros: so those 0's turn into the 10 hex 0's preceded by a 1: 0x10000000000
QUESTION MISC-3D. Write the answer to the following in hexadecimal:
0xabcd + 12
12 = 0xC; 0xD + 0xC = (25 = 0x19), so the answer is 0xABD9
QUESTION MISC-3E. True/False: The garbage collector we discussed is conservative, because it only runs when we tell it to.
False (conservative because it never reclaims something it shouldn't, but might not reclaim things it could).
QUESTION MISC-3F. True/False: Given the definition int array[10] the following two expressions mean the same thing: `&array[4] and array
- 4`.
True
QUESTION MISC-3G. Using the matrix multiply from lecture 12, in what order should you iterate over the indices i, j, and k to achieve the best performance.
ikj
QUESTION MISC-3H. True/False: fopen, fread, fwrite, and fclose are system calls.
False; they are calls to the standard IO library.
QUESTION MISC-3I. Which do you expect to be faster on a modern Linux OS, insertion sorting into a linked list of 1000 elements or into an array of 1000 elements?
The array.
QUESTION MISC-3J. What does the hardware do differently when adding signed versus unsigned numbers?
Nothing