This is not the current version of the class.

# Data representation 5: Integers and undefined behavior

## Overview

We explore integer representations and introduce integer undefined behavior.

Readings on undefined behavior

## Integer types

• Computer integers come in signed and unsigned varieties
• We say “computer integer” to highlight that this is computer arithmetic, which involves size-limited values stored in memory, rather than mathematical arithmetic
• Signed types have values in the range [−2B−1, 2B−1−1], where B = sizeof(T)*8
• Unsigned types have values in the range [0, 2B−1]
• Positive signed numbers have the same memory representation as the corresponding unsigned numbers
• signed T and unsigned T always have the same size
• Special case: char might be either signed or unsigned (on x86-64 it is signed); if you definitely want a signed version, say signed char
• Types can be qualified as const or volatile
• Same data representation

## Multi-byte integers

• Like Arabic numerals (and many other notations going back to Babylonian cuneiform), computers use place-value notation to represent numbers larger than 255
• Numbers are written in base 256 (because bytes have 8 bits)
• Consider 258 = 256 + 2 = 1×2561 + 2×2560
int x = 258;
hexdump_object(x);

// => 7ffeea26fac8  02 01 00 00                                       |....|

• Unlike Arabic numerals, x86-64 represents numbers with the ones place “on the left” (at lower addresses)
• Other computers make different choices; for example, the layout used in Internet protocols puts the highest-value place at lower addresses
• This is called big endian

## Overflow in computer arithmetic

• Overflow happens when the result of a computation is too large in magnitude to be represented using the chosen data type
unsigned char n = 255;   // range representable in n is [0,255]
n = n + 1;               // n should be 256, but can’t be represented

unsigned u = 1'000'000;  // range representable in u is [0,4'294'967'295]
u = u * 10'000'000;      // n should be 10'000'000'000'000,
// but can’t be represented


## Unsigned integer overflow

• When a computation using unsigned computer integers overflows, we throw away the carry bits, keeping the least-significant bits
unsigned char n = 255;
n = n + 1;

unsigned u = 1'000'000;
u = u * 10'000'000;

printf("n=%u, u=%u\n", (unsigned) n, u);
// => n=0, u=1316134912

  255 =  0b1111'1111
+   1 =  0b0000'0001
-------------
0b1'0000'0000
=>   0b0000'0000 = 0

     1'000'000 =              0b1111'0100'0010'0100'0000
*   10'000'000 =         0b1001'1000'1001'0110'1000'0000
--------------------------------------------------------
0b1001'0001'1000'0100'1110'0111'0010'1010'0000'0000'0000
=>             0b0100'1110'0111'0010'1010'0000'0000'0000 = 1316134912


## Unsigned arithmetic overflow, mathematically

• Mathematically what does this mean?
• The result is taken modulo 2B, where B is the number of bits in the type
• So if c is a computer 32-bit integer whose mathematical representation is c, then c{} = c \bmod 2^{32}

## The mystery of negative numbers

• Say that x is an unsigned computer integer
• Then -x is also an unsigned computer integer
• In C and C++, arithmetic on unsigned integers gives unsigned results
• What’s the value of -x?

## Negatives and overflow

• In mathematics, -x is the additive inverse
• The number so that x + (-x) = 0
• In computer arithmetic it’s the same!
• -x is the computer integer giving x + (-x) == 0
• For unsigned values, use overflow!
• In unsigned char, 1 + 255 == 0, so (unsigned char) 255 == (unsigned char) -1

## Two’s complement representation

• Principle: Addition and subtraction of signed integers shall use the same representation as addition and subtraction of unsigned integers
• Result: -x is represented in the same way as the unsigned number 2B - x, where B is the number of bits in the integer

## Signed overflow: here be dragons

• What happens when a signed arithmetic computation overflows?

## Undefined behavior and abstract machine

• A specification imposes requirements on an implementation and on the user of that implementation
• C and C++ have a specification—the language standard
• The user’s job: Write a program that obeys the specification’s rules
• The program is mostly for the abstract machine defined by the standard
• Example rule: You can’t name a function !#$^&*!^$*&$^#! • Try cd cs61-lectures/datarep5; make nope.o • Example rule: You can’t dereference a null pointer • The compiler’s job: Translate any program that obeys the specification’s rules into an executable that performs the corresponding computation on the actual hardware • The compiler must understand the abstract machine and the real machine • The compiler may assume the user did their job ## Broad-brush model of programmer psychology People want their code to: Priority 1. Work relatively correctly on most inputs Priority 2. Run fast Priority 3. Work exactly correctly on all inputs ## Why? • If code is wrong on most inputs, its speed doesn’t matter • If code is too slow or expensive to use, its correctness doesn’t matter • Sometimes exact correctness is difficult to define • Of course you should aim for correctness and well-defined specifications though! ## Compiler quality • People want their code to go fast (Priority 2) • Compiler writers are people and they get excited about producing faster code • Compilers take advantage of many opportunities to speed up code by removing redundancy ## The as-if rule • “conforming implementations are required to emulate (only) the observable behavior of the abstract machine” • For instance, data written to files • If the compiler can figure out that a function call is redundant, it can eliminate it! • If the compiler can figure out that a computation is redundant, it can eliminate it! • Including dynamic memory allocation! • If the compiler can figure out a cheaper way to perform a computation, it can implement it! • Examples: datarep5/opt1.cc, datarep5/opt2.cc ## Undefined behavior • The C and C++ standards define certain errors as undefined behavior • If a program executes undefined behavior: Anything at all can happen; the Standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended. (ref) • Undefined behavior is not the same as implementation-defined behavior • Implementation-defined behavior: The value of sizeof(int); the address value of a pointer • Undefined behavior: Dereferencing a null pointer ## How undefined behavior and optimizing compilers collide • Compilers may assume that their input programs never execute undefined behavior • And they may optimize based on this assumption ## Signed integer overflow is undefined behavior • Therefore, signed computer integer arithmetic must not overflow • In unsigned arithmetic, 4294967295U + 1U == 0U • In signed arithmetic, 2147483647 + 1 == 💩🚽 ## Example 1: Signed integer overflow int x_incr = x + 1; assert(x_incr > x); printf("assertion passed, so %d > %d\n", x_incr, x);  • ./signedoverflow 1 prints? • assertion passed, so 2 > 1 • Compiled without optimization by clang++, ./signedoverflow 2147483647 prints? • signedoverflow.cc:17: int main(int, char **): Assertion x_incr > x' failed. • Because 2147483647 + 1 == -2147483648 in signed computer integer arithmetic! • Compiled with optimization, ./signedoverflow 2147483647 prints? • assertion passed, so -2147483648 > 2147483647 ## Sanitizer to the rescue • Compilers with sanitizers have a different goal—making the code run as fast as possible, while detecting undefined behavior $ make SAN=1 signedoverflow
...
\$ ./signedoverflow 2147483647
[1msignedoverflow.cc:16:9: [31mruntime error:[30m signed integer overflow: 2147483647 + 1 cannot be represented in type 'int'[m

• Always check your code with sanitizers!

## Example 2: Null pointer dereference

static void __devexit agnx_pci_remove (struct pci_dev *pdev)
{
struct ieee80211_hw *dev = pci_get_drvdata(pdev);
struct agnx_priv *priv = dev->priv;

if (!dev) return;

• This caused a serious Linux bug
• The compiler removed the if (!dev) return; check!
• The earlier initialization priv = dev->priv; is only defined if dev points to a live object
• So the compiler assumed that dev points to a live object
• So dev != nullptr
• So testing dev is redundant, and the check can be removed
• Fix: set priv = dev->priv; only after checking dev for null

## Example 3: Signed integer overflow in a loop

long sum = 0;
for (int i = n1; i <= n2; ++i) {
sum += f(i);
}
return sum;

• Consider ./signedloop 2147483647 2147483647; what should this print?
• It depends!
• As we saw in Example 1, when i == 2147483647, then perhaps i + 1 == -2147483648
• That would cause the loop to execute forever
• On the other hand, computing i + 1 is undefined behavior, so maybe it does something else?
• Compiled with optimization, it prints 2147483647
• Compiled without optimization (make O=0 signedloop), it runs forever
• Compiled with sanitization (make SAN=1 signedloop), it reports a sanitizer error

## Example 4: Out-of-bounds access

int awesome[4] = {0, 1, 2, 3};

bool is_awesome(int v) {
for (int i = 0; i <= 4; ++i) {
if (awesome[i] == v)
return true;
}
return false;
}

• What’s is_awesome(5) return?
• It returns true!
• ./tablelookup 5 prints 5 is awesome
• There’s an off-by-one error in the for condition
• Programmer meant i < 4
• Table only has 4 elements: checking awesome[4] is an out-of-bounds read, and undefined behavior!
• First compiler transforms code to something like this:
bool is_awesome(int v) {
if (awesome[0] == v) {
return true;
}
if (awesome[1] == v) {
return true;
}
if (awesome[2] == v) {
return true;
}
if (awesome[3] == v) {
return true;
}
if (awesome[4] == v) { // UNDEFINED BEHAVIOR
return true;
}
return false;
}

• Compiler assumes that undefined behavior cannot happen
• So compiler concludes that function must return before line 14
• Which means that it could only have returned true
• So final compiled code is simply return true

## More examples of undefined behaviors

• Dividing an integer by 0
• Using an uninitialized variable
• Violating the live object rule
• Modifying a const object
• Comparing pointers to unrelated objects using < <= > >=`
• Accessing an invalid iterator
• Accessing an object using a pointer of incompatible type
• Executing an infinite loop that doesn’t have any observable behavior (!)