Overview
We explore integer representations and introduce integer undefined behavior.
Readings on undefined behavior
Integer types
- Computer integers come in signed and unsigned varieties
- We say “computer integer” to highlight that this is computer arithmetic, which involves size-limited values stored in memory, rather than mathematical arithmetic
- Signed types have values in the range [−2B−1, 2B−1−1], where B =
sizeof(T)
*8 - Unsigned types have values in the range [0, 2B−1]
- Positive signed numbers have the same memory representation as the corresponding unsigned numbers
signed T
andunsigned T
always have the same size- Special case:
char
might be either signed or unsigned (on x86-64 it is signed); if you definitely want a signed version, saysigned char
- Types can be qualified as
const
orvolatile
- Same data representation
Multi-byte integers
- Like Arabic numerals (and many other notations going back to Babylonian cuneiform), computers use place-value notation to represent numbers larger than 255
- Numbers are written in base 256 (because bytes have 8 bits)
- Consider 258 = 256 + 2 = 1×2561 + 2×2560
int x = 258;
hexdump_object(x);
// => 7ffeea26fac8 02 01 00 00 |....|
- Unlike Arabic numerals, x86-64 represents numbers with the ones place “on the left” (at lower addresses)
- This is called little endian
- Other computers make different choices; for example, the layout used in Internet protocols puts the highest-value place at lower addresses
- This is called big endian
Overflow in computer arithmetic
- Overflow happens when the result of a computation is too large in magnitude to be represented using the chosen data type
unsigned char n = 255; // range representable in `n` is [0,255]
n = n + 1; // `n` should be 256, but can’t be represented
unsigned u = 1'000'000; // range representable in `u` is [0,4'294'967'295]
u = u * 10'000'000; // `n` should be 10'000'000'000'000,
// but can’t be represented
Unsigned integer overflow
- When a computation using unsigned computer integers overflows, we throw away the carry bits, keeping the least-significant bits
unsigned char n = 255;
n = n + 1;
unsigned u = 1'000'000;
u = u * 10'000'000;
printf("n=%u, u=%u\n", (unsigned) n, u);
// => n=0, u=1316134912
255 = 0b1111'1111
+ 1 = 0b0000'0001
-------------
0b1'0000'0000
=> 0b0000'0000 = 0
1'000'000 = 0b1111'0100'0010'0100'0000
* 10'000'000 = 0b1001'1000'1001'0110'1000'0000
--------------------------------------------------------
0b1001'0001'1000'0100'1110'0111'0010'1010'0000'0000'0000
=> 0b0100'1110'0111'0010'1010'0000'0000'0000 = 1316134912
Unsigned arithmetic overflow, mathematically
- Mathematically what does this mean?
- The result is taken modulo 2B, where B is the number of bits in the type
- So if
c
is a computer 32-bit integer whose mathematical representation is c, thenc
{} = c \bmod 2^{32}
The mystery of negative numbers
- Say that
x
is an unsigned computer integer - Then
-x
is also an unsigned computer integer- In C and C++, arithmetic on unsigned integers gives unsigned results
- What’s the value of
-x
?
Negatives and overflow
- In mathematics, -x is the additive inverse
- The number so that x + (-x) = 0
- In computer arithmetic it’s the same!
-x
is the computer integer givingx + (-x) == 0
- For unsigned values, use overflow!
- In
unsigned char
,1 + 255 == 0
, so(unsigned char) 255 == (unsigned char) -1
- In
Two’s complement representation
- Principle: Addition and subtraction of signed integers shall use the same representation as addition and subtraction of unsigned integers
- Result:
-x
is represented in the same way as the unsigned number2B - x
, whereB
is the number of bits in the integer
Signed overflow: here be dragons
- What happens when a signed arithmetic computation overflows?
Undefined behavior and abstract machine
- A specification imposes requirements on an implementation and on the user of that implementation
- C and C++ have a specification—the language standard
- The user’s job: Write a program that obeys the specification’s rules
- The program is mostly for the abstract machine defined by the standard
- Example rule: You can’t name a function
!#$^&*!^$*&$^#!
- Try
cd cs61-lectures/datarep5; make nope.o
- Try
- Example rule: You can’t dereference a null pointer
- The compiler’s job: Translate any program that obeys the specification’s
rules into an executable that performs the corresponding computation on the
actual hardware
- The compiler must understand the abstract machine and the real machine
- The compiler may assume the user did their job
Broad-brush model of programmer psychology
People want their code to:
Priority 1. Work relatively correctly on most inputs
Priority 2. Run fast
Priority 3. Work exactly correctly on all inputs
Why?
- If code is wrong on most inputs, its speed doesn’t matter
- If code is too slow or expensive to use, its correctness doesn’t matter
- Sometimes exact correctness is difficult to define
- Of course you should aim for correctness and well-defined specifications though!
Compiler quality
- People want their code to go fast (Priority 2)
- Compiler writers are people and they get excited about producing faster code
- Compilers take advantage of many opportunities to speed up code by removing redundancy
The as-if rule
- “conforming implementations are required to emulate (only) the observable
behavior of the abstract machine”
- For instance, data written to files
- If the compiler can figure out that a function call is redundant, it can eliminate it!
- If the compiler can figure out that a computation is redundant, it can
eliminate it!
- Including dynamic memory allocation!
- If the compiler can figure out a cheaper way to perform a computation, it can implement it!
- Examples:
datarep5/opt1.cc
,datarep5/opt2.cc
Undefined behavior
- The C and C++ standards define certain errors as undefined behavior
- If a program executes undefined behavior:
Anything at all can happen; the Standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended. (ref)
- Undefined behavior is not the same as implementation-defined behavior
- Implementation-defined behavior: The value of
sizeof(int)
; the address value of a pointer - Undefined behavior: Dereferencing a null pointer
- Implementation-defined behavior: The value of
How undefined behavior and optimizing compilers collide
- Compilers may assume that their input programs never execute undefined behavior
- And they may optimize based on this assumption
Signed integer overflow is undefined behavior
- Therefore, signed computer integer arithmetic must not overflow
- In unsigned arithmetic,
4294967295U + 1U == 0U
- In signed arithmetic,
2147483647 + 1 == 💩🚽
- In unsigned arithmetic,
Example 1: Signed integer overflow
int x_incr = x + 1;
assert(x_incr > x);
printf("assertion passed, so %d > %d\n", x_incr, x);
./signedoverflow 1
prints?assertion passed, so 2 > 1
- Compiled without optimization by clang++,
./signedoverflow 2147483647
prints?signedoverflow.cc:17: int main(int, char **): Assertion `x_incr > x' failed.
- Because
2147483647 + 1 == -2147483648
in signed computer integer arithmetic!
- Compiled with optimization,
./signedoverflow 2147483647
prints?assertion passed, so -2147483648 > 2147483647
Sanitizer to the rescue
- Compilers with sanitizers have a different goal—making the code run as fast as possible, while detecting undefined behavior
$ make SAN=1 signedoverflow
...
$ ./signedoverflow 2147483647
[1msignedoverflow.cc:16:9: [31mruntime error:[30m signed integer overflow: 2147483647 + 1 cannot be represented in type 'int'[m
- Always check your code with sanitizers!
Example 2: Null pointer dereference
static void __devexit agnx_pci_remove (struct pci_dev *pdev)
{
struct ieee80211_hw *dev = pci_get_drvdata(pdev);
struct agnx_priv *priv = dev->priv;
if (!dev) return;
- This caused a serious Linux bug
- The compiler removed the
if (!dev) return;
check!- The earlier initialization
priv = dev->priv;
is only defined ifdev
points to a live object - So the compiler assumed that
dev
points to a live object - So
dev != nullptr
- So testing
dev
is redundant, and the check can be removed
- The earlier initialization
- Fix: set
priv = dev->priv;
only after checkingdev
for null
Example 3: Signed integer overflow in a loop
long sum = 0;
for (int i = n1; i <= n2; ++i) {
sum += f(i);
}
return sum;
- Consider
./signedloop 2147483647 2147483647
; what should this print? - It depends!
- As we saw in Example 1, when
i == 2147483647
, then perhapsi + 1 == -2147483648
- That would cause the loop to execute forever
- On the other hand, computing
i + 1
is undefined behavior, so maybe it does something else?
- As we saw in Example 1, when
- Compiled with optimization, it prints
2147483647
- Compiled without optimization (
make O=0 signedloop
), it runs forever - Compiled with sanitization (
make SAN=1 signedloop
), it reports a sanitizer error
Example 4: Out-of-bounds access
int awesome[4] = {0, 1, 2, 3};
bool is_awesome(int v) {
for (int i = 0; i <= 4; ++i) {
if (awesome[i] == v)
return true;
}
return false;
}
- What’s
is_awesome(5)
return?- It returns
true
! ./tablelookup 5
prints5 is awesome
- It returns
- There’s an off-by-one error in the
for
condition- Programmer meant
i < 4
- Programmer meant
- Table only has 4 elements: checking
awesome[4]
is an out-of-bounds read, and undefined behavior! - First compiler transforms code to something like this:
bool is_awesome(int v) { if (awesome[0] == v) { return true; } if (awesome[1] == v) { return true; } if (awesome[2] == v) { return true; } if (awesome[3] == v) { return true; } if (awesome[4] == v) { // UNDEFINED BEHAVIOR return true; } return false; }
- Compiler assumes that undefined behavior cannot happen
- So compiler concludes that function must return before line 14
- Which means that it could only have returned
true
- So final compiled code is simply
return true
More examples of undefined behaviors
- Dividing an integer by 0
- Using an uninitialized variable
- Violating the live object rule
- Modifying a
const
object - Comparing pointers to unrelated objects using
< <= > >=
- Accessing an invalid iterator
- Accessing an object using a pointer of incompatible type
- Executing an infinite loop that doesn’t have any observable behavior (!)