Data representation 5: Integers and undefined behavior

Overview

We explore integer representations and introduce integer undefined behavior.

Integer types

Computer integers come in signed and unsigned varieties
- We say “computer integer” to highlight that this is computer arithmetic, which involves size-limited values stored in memory, rather than mathematical arithmetic
Signed types have values in the range [−2^B−1, 2^B−1−1], where B = sizeof(T)*8
Unsigned types have values in the range [0, 2^B−1]
- Positive signed numbers have the same memory representation as the corresponding unsigned numbers
- signed T and unsigned T always have the same size
- Special case: char might be either signed or unsigned (on x86-64 it is signed); if you definitely want a signed version, say signed char
Types can be qualified as const or volatile
- Same data representation

Multi-byte integers

Like Arabic numerals (and many other notations going back to Babylonian cuneiform), computers use place-value notation to represent numbers larger than 255
Numbers are written in base 256 (because bytes have 8 bits)
Consider 258 = 256 + 2 = 1×256¹ + 2×256⁰

int x = 258;
hexdump_object(x);

// => 7ffeea26fac8  02 01 00 00                                       |....|

Unlike Arabic numerals, x86-64 represents numbers with the ones place “on the left” (at lower addresses)
- This is called little endian
Other computers make different choices; for example, the layout used in Internet protocols puts the highest-value place at lower addresses
- This is called big endian

Overflow in computer arithmetic

Overflow happens when the result of a computation is too large in magnitude to be represented using the chosen data type

unsigned char n = 255;   // range representable in `n` is [0,255]
n = n + 1;               // `n` should be 256, but can’t be represented

unsigned u = 1'000'000;  // range representable in `u` is [0,4'294'967'295]
u = u * 10'000'000;      // `n` should be 10'000'000'000'000,
                         // but can’t be represented

Unsigned integer overflow

When a computation using unsigned computer integers overflows, we throw away the carry bits, keeping the least-significant bits

datarep5/unsignedoverflow.cc

unsigned char n = 255;
n = n + 1;

unsigned u = 1'000'000;
u = u * 10'000'000;

printf("n=%u, u=%u\n", (unsigned) n, u);
// => n=0, u=1316134912

  255 =  0b1111'1111
+   1 =  0b0000'0001
       -------------
       0b1'0000'0000
    =>   0b0000'0000 = 0

     1'000'000 =              0b1111'0100'0010'0100'0000
*   10'000'000 =         0b1001'1000'1001'0110'1000'0000
--------------------------------------------------------
0b1001'0001'1000'0100'1110'0111'0010'1010'0000'0000'0000
=>             0b0100'1110'0111'0010'1010'0000'0000'0000 = 1316134912

Unsigned arithmetic overflow, mathematically

Mathematically what does this mean?
- The result is taken modulo 2^B, where B is the number of bits in the type
- So if c is a computer 32-bit integer whose mathematical representation is c, then c{} = c \bmod 2^{32}

The mystery of negative numbers

Say that x is an unsigned computer integer
Then -x is also an unsigned computer integer
- In C and C++, arithmetic on unsigned integers gives unsigned results
What’s the value of -x?

Negatives and overflow

In mathematics, -x is the additive inverse
- The number so that x + (-x) = 0
In computer arithmetic it’s the same!
- -x is the computer integer giving x + (-x) == 0
For unsigned values, use overflow!
- In unsigned char, 1 + 255 == 0, so (unsigned char) 255 == (unsigned char) -1

Two’s complement representation

Principle: Addition and subtraction of signed integers shall use the same representation as addition and subtraction of unsigned integers
Result: -x is represented in the same way as the unsigned number 2^B - x, where B is the number of bits in the integer

Signed overflow: here be dragons

What happens when a signed arithmetic computation overflows?

The upside down

Undefined behavior and abstract machine

A specification imposes requirements on an implementation and on the user of that implementation
C and C++ have a specification—the language standard
The user’s job: Write a program that obeys the specification’s rules
- The program is mostly for the abstract machine defined by the standard
- Example rule: You can’t name a function !#$^&*!^$*&$^#!
  - Try cd cs61-lectures/datarep5; make nope.o
- Example rule: You can’t dereference a null pointer
The compiler’s job: Translate any program that obeys the specification’s rules into an executable that performs the corresponding computation on the actual hardware
- The compiler must understand the abstract machine and the real machine
- The compiler may assume the user did their job

Broad-brush model of programmer psychology

People want their code to:

Priority 1. Work relatively correctly on most inputs

Priority 2. Run fast

Priority 3. Work exactly correctly on all inputs

Why?

If code is wrong on most inputs, its speed doesn’t matter
If code is too slow or expensive to use, its correctness doesn’t matter
Sometimes exact correctness is difficult to define
Of course you should aim for correctness and well-defined specifications though!

Compiler quality

People want their code to go fast (Priority 2)
Compiler writers are people and they get excited about producing faster code
Compilers take advantage of many opportunities to speed up code by removing redundancy

The as-if rule

“conforming implementations are required to emulate (only) the observable behavior of the abstract machine”
- For instance, data written to files
If the compiler can figure out that a function call is redundant, it can eliminate it!
If the compiler can figure out that a computation is redundant, it can eliminate it!
- Including dynamic memory allocation!
If the compiler can figure out a cheaper way to perform a computation, it can implement it!
Examples: datarep5/opt1.cc, datarep5/opt2.cc

Undefined behavior

The C and C++ standards define certain errors as undefined behavior
If a program executes undefined behavior:

Anything at all can happen; the Standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended. (ref)
Undefined behavior is not the same as implementation-defined behavior
- Implementation-defined behavior: The value of sizeof(int); the address value of a pointer
- Undefined behavior: Dereferencing a null pointer

How undefined behavior and optimizing compilers collide

Compilers may assume that their input programs never execute undefined behavior
- And they may optimize based on this assumption

Signed integer overflow is undefined behavior

Therefore, signed computer integer arithmetic must not overflow
- In unsigned arithmetic, 4294967295U + 1U == 0U
- In signed arithmetic, 2147483647 + 1 == 💩🚽

Example 1: Signed integer overflow

datarep5/signedoverflow.cc

int x_incr = x + 1;
assert(x_incr > x);
printf("assertion passed, so %d > %d\n", x_incr, x);

./signedoverflow 1 prints?
- assertion passed, so 2 > 1
Compiled without optimization by clang++, ./signedoverflow 2147483647 prints?
- signedoverflow.cc:17: int main(int, char **): Assertion `x_incr > x' failed.
- Because 2147483647 + 1 == -2147483648 in signed computer integer arithmetic!
Compiled with optimization, ./signedoverflow 2147483647 prints?
- assertion passed, so -2147483648 > 2147483647

Sanitizer to the rescue

Compilers with sanitizers have a different goal—making the code run as fast as possible, while detecting undefined behavior

$ make SAN=1 signedoverflow
...
$ ./signedoverflow 2147483647
[1msignedoverflow.cc:16:9: [31mruntime error:[30m signed integer overflow: 2147483647 + 1 cannot be represented in type 'int'[m

Always check your code with sanitizers!

Example 2: Null pointer dereference

static void __devexit agnx_pci_remove (struct pci_dev *pdev)
{
  struct ieee80211_hw *dev = pci_get_drvdata(pdev);
  struct agnx_priv *priv = dev->priv; 

  if (!dev) return;

This caused a serious Linux bug
The compiler removed the if (!dev) return; check!
- The earlier initialization priv = dev->priv; is only defined if dev points to a live object
- So the compiler assumed that dev points to a live object
- So dev != nullptr
- So testing dev is redundant, and the check can be removed
Fix: set priv = dev->priv; only after checking dev for null

Example 3: Signed integer overflow in a loop

datarep5/signedloop.cc

long sum = 0;
for (int i = n1; i <= n2; ++i) {
    sum += f(i);
}
return sum;

Consider ./signedloop 2147483647 2147483647; what should this print?
It depends!
- As we saw in Example 1, when i == 2147483647, then perhaps i + 1 == -2147483648
- That would cause the loop to execute forever
- On the other hand, computing i + 1 is undefined behavior, so maybe it does something else?
Compiled with optimization, it prints 2147483647
Compiled without optimization (make O=0 signedloop), it runs forever
Compiled with sanitization (make SAN=1 signedloop), it reports a sanitizer error

Example 4: Out-of-bounds access

datarep5/tablelookup.cc

int awesome[4] = {0, 1, 2, 3};

bool is_awesome(int v) {
    for (int i = 0; i <= 4; ++i) {
        if (awesome[i] == v)
            return true;
    }
    return false;
}

What’s is_awesome(5) return?
- It returns true!
- ./tablelookup 5 prints 5 is awesome
There’s an off-by-one error in the for condition
- Programmer meant i < 4
Table only has 4 elements: checking awesome[4] is an out-of-bounds read, and undefined behavior!

First compiler transforms code to something like this:

bool is_awesome(int v) {
    if (awesome[0] == v) {
        return true;
    }
    if (awesome[1] == v) {
        return true;
    }
    if (awesome[2] == v) {
        return true;
    }
    if (awesome[3] == v) {
        return true;
    }
    if (awesome[4] == v) { // UNDEFINED BEHAVIOR
        return true;
    }
    return false;
}

Compiler assumes that undefined behavior cannot happen
So compiler concludes that function must return before line 14
Which means that it could only have returned true
So final compiled code is simply return true

More examples of undefined behaviors

Dividing an integer by 0
Using an uninitialized variable
Violating the live object rule
Modifying a const object
Comparing pointers to unrelated objects using < <= > >=
Accessing an invalid iterator
Accessing an object using a pointer of incompatible type
Executing an infinite loop that doesn’t have any observable behavior (!)