Arrays and pointers
One of C’s more interesting choices is that it explicitly relates pointers and arrays. Although arrays are laid out in memory in a specific way, they generally behave like pointers when they are used. This property probably arose from C’s desire to explicitly model memory as an array of bytes, and it has beautiful and confounding effects.
We’ve already seen one of these effects. The hexdump
function has this
signature (arguments and return type):
void hexdump(const void* ptr, size_t size);
But we can just pass an array as argument to hexdump
:
char c[10];
hexdump(c, sizeof(c));
When used in an expression like this—here, as an argument—the array magically changes into a pointer to its first element. The above call has the same meaning as this:
hexdump(&c[0], 10 * sizeof(c[0]));
C programmers transition between arrays and pointers very naturally.
A confounding effect is that unlike all other types, in C arrays are passed to and returned from functions by reference rather than by value. C is a call-by-value language except for arrays. This means that all function arguments and return values are copied, so that parameter modifications inside a function do not affect the objects passed by the caller—except for arrays. For instance:
void f(int a[2]) { a[0] = 1; } int main() { int x[2] = {100, 101}; f(x); printf("%d\n", x[0]); // prints 1! }
If you don’t like this behavior, you can get around it by using a struct or a C++
std::array
.#include <array> struct array1 { int a[2]; }; void f1(array1 arg) { arg.a[0] = 1; } void f2(std::array<int, 2> a) { a[0] = 1; } int main() { array1 x = {{100, 101}}; f1(x); printf("%d\n", x.a[0]); // prints 100 std::array<int, 2> x2 = {100, 101}; f2(x2); printf("%d\n", x2[0]); // prints 100 }
Pointer arithmetic
C++ extends the logic of this array–pointer correspondence to support arithmetic on pointers as well.
Pointer arithmetic rule. In the C abstract machine, arithmetic on pointers produces the same result as arithmetic on the corresponding array indexes.
Specifically, consider an array T a[n]
and pointers T* p1 =
&a[i]
and T* p2 = &a[j]
. Then:
Equality:
p1 == p2
if and only if (iff)p1
andp2
point to the same address, which happens iffi == j
.Inequality: Similarly,
p1 != p2
iffi != j
.Less-than:
p1 < p2
iffi < j
.Also,
p1 <= p2
iffi <= j
; andp1 > p2
iffi > j
; andp1 >= p2
iffi >= j
.Pointer difference: What should
p1 - p2
mean? Using array indexes as the basis,p1 - p2 == i - j
. (But the type of the difference is alwaysptrdiff_t
, which on x86-64 islong
, the signed version ofsize_t
.)Addition:
p1 + k
(wherek
is an integer type) equals the pointer&a[i + k]
. (k + p1
returns the same thing.)Subtraction:
p1 - k
equals&a[i - k]
.Increment and decrement:
++p1
meansp1 = p1 + 1
, which meansp1 = &a[i + 1]
. Similarly,--p1
meansp1 = &a[i - 1]
. (There are also postfix versions,p1++
andp1--
, but C++ style prefers the prefix versions.)
No other arithmetic operations on pointers are allowed. You can’t multiply
pointers, for example. (You can multiply addresses by casting the pointers
to the address type, uintptr_t
—so (uintptr_t) p1 * (uintptr_t) p2
—but why
would you?)
From pointers to iterators
Let’s write a function that can sum all the integers in an array.
int sum(int a[], int size) {
int sum = 0;
for (int i = 0; i != size; ++i) {
sum += a[i];
}
return sum;
}
This function can compute the sum of the elements of any int
array. But
because of the pointer–array relationship, its a
argument is really a
pointer. That allows us to call it with subarrays as well as with whole
arrays. For instance:
int a[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
int s1 = sum(a, 10); // 45
int s2 = sum(&a[0], 10); // same as s1
int s3 = sum(&a[1], 5); // sums s[1]...s[5], computing 15
int s4 = sum(a + 1, 5); // same as s3
This way of thinking about arrays naturally leads to a style that avoids sizes entirely, using instead a sentinel or boundary argument that defines the end of the interesting part of the array.
int sum(int* first, int* last) {
int sum = 0;
while (first != last) {
sum += *first;
++first;
}
return sum;
}
These expressions compute the same sums as the above:
int s1 = sum(a, a + 10);
int s2 = sum(&a[0], &a[0] + 10);
int s3 = sum(&a[1], &a[1] + 5);
int s4 = sum(a + 1, a + 6);
Note that the data from first
to last
forms a half-open range. iIn
mathematical notation, we care about elements in the range [first, last)
:
the element pointed to by first
is included (if it exists), but the element
pointed to by last
is not. Half-open ranges give us a simple and clear way
to describe empty ranges, such as zero-element arrays: if first == last
,
then the range is empty.
Note that given a ten-element array a
, the pointer a + 10
can be formed
and compared, but must not be dereferenced—the element a[10]
does not
exist. The C/C++ abstract machines allow users to form pointers to the
“one-past-the-end” boundary elements of arrays, but users must not dereference
such pointers.
So in C, two pointers naturally express a range of an array. The C++ standard
template library, or STL, brilliantly abstracts this pointer notion to allow
two iterators, which are pointer-like objects, to express a range of any
standard data structure—an array, a vector, a hash table, a balanced tree,
whatever. This version of sum
works for any container of int
s; notice how
little it changed:
template <typename It>
int sum(It first, It last) {
int sum = 0;
while (first != last) {
sum += *first;
++first;
}
return sum;
}
Some example uses:
std::set<int> set_of_ints;
int s1 = sum(set_of_ints.begin(), set_of_ints.end());
std::list<int> linked_list_of_ints;
int s2 = sum(linked_list_of_ints.begin(), linked_list_of_ints.end());
Addresses vs. pointers
What’s the difference between these expressions? (Again, a
is an array of
type T
, and p1 == &a[i]
and p2 == &a[j]
.)
ptrdiff_t d1 = p1 - p2;
ptrdiff_t d2 = (uintptr_t) p1 - (uintptr_t) p2;
The first expression is defined analogously to index arithmetic, so d1 == i -
j
. But the second expression performs the arithmetic on the addresses
corresponding to those pointers. We will expect d2
to equal sizeof(T) *
d1
. Always be aware of which kind of arithmetic you’re using. Generally
arithmetic on pointers should not involve sizeof
, since the sizeof
is
included automatically according to the abstract machine; but arithmetic on
addresses almost always should involve sizeof
.
Undefined behavior
Although C++ is a low-level language, the abstract machine is surprisingly strict about which pointers may be formed and how they can be used. Violate the rules and you’re in hell because you have invoked the dreaded undefined behavior.
Given an array a[N]
of N
elements of type T
:
Forming a pointer
&a[i]
(ora + i
) with0 ≤ i ≤ N
is safe.Forming a pointer
&a[i]
withi < 0
ori > N
causes undefined behavior.Dereferencing a pointer
&a[i]
with0 ≤ i < N
is safe.Dereferencing a pointer
&a[i]
withi < 0
ori ≥ N
causes undefined behavior.
(For the purposes of these rules, objects that are not arrays count as
single-element arrays. So given T x
, we can safely form &x
and &x + 1
and dereference &x
.)
What “undefined behavior” means is horrible. A program that executes undefined behavior is erroneous. But the compiler need not catch the error. In fact, the abstract machine says anything goes: undefined behavior is “behavior … for which this International Standard imposes no requirements.” “Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).” Other possible behaviors include allowing hackers from the moon to steal all of a program’s data, take it over, and force it to delete the hard drive on which it is running. Once undefined behavior executes, a program may do anything, including making demons fly out of the programmer’s nose.
Pointer arithmetic, and even pointer comparisons, are also affected by undefined behavior. It’s undefined to go beyond and array’s bounds using pointer arithmetic. And pointers may be compared for equality or inequality even if they point to different arrays or objects, but if you try to compare different arrays via less-than, like this:
int a[10];
int b[10];
if (a < b + 10) ...
that causes undefined behavior.
If you really want to compare pointers that might be to different arrays—for instance, you’re writing a hash function for arbitrary pointers—cast them to
uintptr_t
first.
Undefined behavior and optimization
A program that causes undefined behavior is not a C++ program. The abstract machine says that a C++ program, by definition, is a program whose behavior is always defined. The C++ compiler is allowed to assume that its input is a C++ program. (Obviously!) So the compiler can assume that its input program will never cause undefined behavior. Thus, since undefined behavior is “impossible,” if the compiler can prove that a condition would cause undefined behavior later, it can assume that condition will never occur.
Consider this program:
char* x = /* some value */;
assert(x + 1 > x);
printf("x = %p, x + 1 = %p\n", x, x + 1);
If we supply a value equal to (char*) -1
, we’re likely to see output like
this:
x = 0xffffffffffffffff, x + 1 = 0
with no assertion failure! But that’s an apparently impossible result.
The printout can only happen if x + 1 > x
(otherwise, the assertion will
fail and stop the printout). But x + 1
, which equals 0
, is less than
x
, which is the largest 8-byte value!
The impossible happens because of undefined behavior reasoning. When the
compiler sees an expression like x + 1 > x
(with x
a pointer), it can
reason this way:
“Ah,
x + 1
. This must be a pointer into the same array asx
(or it might be a boundary pointer just past that array, or just past the non-array objectx
). This must be so because forming any other pointer would cause undefined behavior.“The pointer comparison is the same as an index comparison.
x + 1 > x
means the same thing as&x[1] > &x[0]
. But that holds iff1 > 0
.“In my infinite wisdom, I know that
1 > 0
. Thusx + 1 > x
always holds, and the assertion will never fail.“My job is to make this code run fast. The fastest code is code that’s not there. This assertion will never fail—might as well remove it!”
Integer undefined behavior
Arithmetic on signed integers also has important undefined behaviors. Signed
integer arithmetic must never overflow. That is, the compiler may assume that
the mathematical result of any signed arithmetic operation, such as x + y
(with x
and y
both int
), can be represented inside the relevant type. It
causes undefined behavior, therefore, to add 1 to the maximum positive
integer. (The ubexplore.cc
program demonstrates how this can produce
impossible results, as with pointers.)
Arithmetic on unsigned integers is much safer with respect to undefined behavior. Unsigned integers are defined to perform arithmetic modulo their size. This means that if you add 1 to the maximum positive unsigned integer, the result will always be zero.
Dividing an integer by zero causes undefined behavior whether or not the integer is signed.
Sanitizers
Sanitizers, which in our makefiles are turned on by supplying SAN=1
, can
catch many undefined behaviors as soon as they happen. Sanitizers are built in
to the compiler itself; a sanitizer involves cooperation between the compiler
and the language runtime. This has the major performance advantage that the
compiler introduces exactly the required checks, and the optimizer can then
use its normal analyses to remove redundant checks.
That said, undefined behavior checking can still be slow. Undefined behavior allows compilers to make assumptions about input values, and those assumptions can directly translate to faster code. Turning on undefined behavior checking can make some benchmark programs run 30% slower [link].