## Arrays and pointers

One of C’s more interesting choices is that it explicitly relates pointers and arrays. Although arrays are laid out in memory in a specific way, they generally behave like pointers when they are used. This property probably arose from C’s desire to explicitly model memory as an array of bytes, and it has beautiful and confounding effects.

We’ve already seen one of these effects. The `hexdump`

function has this
*signature* (arguments and return type):

`void hexdump(const void* ptr, size_t size);`

But we can just pass an array as argument to `hexdump`

:

```
char c[10];
hexdump(c, sizeof(c));
```

When used in an expression like this—here, as an argument—the array magically changes into a pointer to its first element. The above call has the same meaning as this:

`hexdump(&c[0], 10 * sizeof(c[0]));`

C programmers transition between arrays and pointers very naturally.

A confounding effect is that unlike all other types, in C arrays are passed to and returned from functions

by referencerather than by value. C is a call-by-value language except for arrays. This means that all function arguments and return values are copied, so that parameter modifications inside a function do not affect the objects passed by the caller—except for arrays. For instance:`void f(int a[2]) { a[0] = 1; } int main() { int x[2] = {100, 101}; f(x); printf("%d\n", x[0]); // prints 1! }`

If you don’t like this behavior, you can get around it by using a struct or a C++

`std::array`

.`#include <array> struct array1 { int a[2]; }; void f1(array1 arg) { arg.a[0] = 1; } void f2(std::array<int, 2> a) { a[0] = 1; } int main() { array1 x = {{100, 101}}; f1(x); printf("%d\n", x.a[0]); // prints 100 std::array<int, 2> x2 = {100, 101}; f2(x2); printf("%d\n", x2[0]); // prints 100 }`

## Pointer arithmetic

C++ extends the logic of this array–pointer correspondence to support
*arithmetic* on pointers as well.

**Pointer arithmetic rule.** In the C abstract machine, arithmetic on
pointers produces the same result as arithmetic on the corresponding
array indexes.

Specifically, consider an array `T a[n]`

and pointers ```
T* p1 =
&a[i]
```

and `T* p2 = &a[j]`

. Then:

**Equality**:`p1 == p2`

if and only if (iff)`p1`

and`p2`

point to the same address, which happens iff`i == j`

.**Inequality**: Similarly,`p1 != p2`

iff`i != j`

.**Less-than**:`p1 < p2`

iff`i < j`

.Also,

`p1 <= p2`

iff`i <= j`

; and`p1 > p2`

iff`i > j`

; and`p1 >= p2`

iff`i >= j`

.**Pointer difference**: What should`p1 - p2`

mean? Using array indexes as the basis,`p1 - p2 == i - j`

. (But the type of the difference is always`ptrdiff_t`

, which on x86-64 is`long`

, the signed version of`size_t`

.)**Addition**:`p1 + k`

(where`k`

is an integer type) equals the pointer`&a[i + k]`

. (`k + p1`

returns the same thing.)**Subtraction**:`p1 - k`

equals`&a[i - k]`

.**Increment and decrement**:`++p1`

means`p1 = p1 + 1`

, which means`p1 = &a[i + 1]`

. Similarly,`--p1`

means`p1 = &a[i - 1]`

. (There are also postfix versions,`p1++`

and`p1--`

, but C++ style prefers the prefix versions.)

No other arithmetic operations on pointers are allowed. You can’t multiply
pointers, for example. (You can multiply *addresses* by casting the pointers
to the address type, `uintptr_t`

—so `(uintptr_t) p1 * (uintptr_t) p2`

—but why
would you?)

## From pointers to iterators

Let’s write a function that can sum all the integers in an array.

```
int sum(int a[], int size) {
int sum = 0;
for (int i = 0; i != size; ++i) {
sum += a[i];
}
return sum;
}
```

This function can compute the sum of the elements of any `int`

array. But
because of the pointer–array relationship, its `a`

argument is really a
*pointer*. That allows us to call it with *subarrays* as well as with whole
arrays. For instance:

```
int a[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
int s1 = sum(a, 10); // 45
int s2 = sum(&a[0], 10); // same as s1
int s3 = sum(&a[1], 5); // sums s[1]...s[5], computing 15
int s4 = sum(a + 1, 5); // same as s3
```

This way of thinking about arrays naturally leads to a style that avoids sizes
entirely, using instead a *sentinel* or *boundary* argument that defines the
*end* of the interesting part of the array.

```
int sum(int* first, int* last) {
int sum = 0;
while (first != last) {
sum += *first;
++first;
}
return sum;
}
```

These expressions compute the same sums as the above:

```
int s1 = sum(a, a + 10);
int s2 = sum(&a[0], &a[0] + 10);
int s3 = sum(&a[1], &a[1] + 5);
int s4 = sum(a + 1, a + 6);
```

Note that the data from `first`

to `last`

forms a *half-open range*. iIn
mathematical notation, we care about elements in the range `[first, last)`

:
the element pointed to by `first`

is included (if it exists), but the element
pointed to by `last`

is not. Half-open ranges give us a simple and clear way
to describe *empty* ranges, such as zero-element arrays: if `first == last`

,
then the range is empty.

Note that given a ten-element array `a`

, the pointer `a + 10`

can be formed
and compared, but *must not* be dereferenced—the element `a[10]`

does not
exist. The C/C++ abstract machines allow users to form pointers to the
“one-past-the-end” boundary elements of arrays, but users must not dereference
such pointers.

So in C, two pointers naturally express a range of an array. The C++ standard
template library, or STL, brilliantly *abstracts* this pointer notion to allow
two *iterators*, which are pointer-like objects, to express a range of *any*
standard data structure—an array, a vector, a hash table, a balanced tree,
whatever. This version of `sum`

works for any container of `int`

s; notice how
little it changed:

```
template <typename It>
int sum(It first, It last) {
int sum = 0;
while (first != last) {
sum += *first;
++first;
}
return sum;
}
```

Some example uses:

```
std::set<int> set_of_ints;
int s1 = sum(set_of_ints.begin(), set_of_ints.end());
std::list<int> linked_list_of_ints;
int s2 = sum(linked_list_of_ints.begin(), linked_list_of_ints.end());
```

## Addresses vs. pointers

What’s the difference between these expressions? (Again, `a`

is an array of
type `T`

, and `p1 == &a[i]`

and `p2 == &a[j]`

.)

```
ptrdiff_t d1 = p1 - p2;
ptrdiff_t d2 = (uintptr_t) p1 - (uintptr_t) p2;
```

The first expression is defined analogously to index arithmetic, so ```
d1 == i -
j
```

. But the second expression performs the arithmetic on *the addresses*
corresponding to those pointers. We will expect `d2`

to equal ```
sizeof(T) *
d1
```

. Always be aware of which kind of arithmetic you’re using. Generally
arithmetic on *pointers* should *not* involve `sizeof`

, since the `sizeof`

is
included automatically according to the abstract machine; but arithmetic on
*addresses* almost always *should* involve `sizeof`

.

## Undefined behavior

Although C++ is a low-level language, the abstract machine is surprisingly
strict about which pointers may be formed and how they can be used. Violate the rules and you’re in hell because you have invoked the dreaded **undefined behavior**.

Given an array `a[N]`

of `N`

elements of type `T`

:

Forming a pointer

`&a[i]`

(or`a + i`

) with`0 ≤ i ≤ N`

is safe.Forming a pointer

`&a[i]`

with`i < 0`

or`i > N`

causes undefined behavior.Dereferencing a pointer

`&a[i]`

with`0 ≤ i < N`

is safe.Dereferencing a pointer

`&a[i]`

with`i < 0`

or`i ≥ N`

causes undefined behavior.

(For the purposes of these rules, objects that are not arrays count as
single-element arrays. So given `T x`

, we can safely form `&x`

and `&x + 1`

and dereference `&x`

.)

What “undefined behavior” means is horrible. A program that executes undefined
behavior is erroneous. But the compiler need not catch the error. In fact, the
abstract machine says **anything goes**: undefined behavior is “behavior … for
which this International Standard imposes no requirements.” “Possible
undefined behavior ranges from ignoring the situation completely with
unpredictable results, to behaving during translation or program execution in
a documented manner characteristic of the environment (with or without the
issuance of a diagnostic message), to terminating a translation or execution
(with the issuance of a diagnostic message).” Other possible behaviors include
allowing hackers from the moon to steal all of a program’s data, take it over,
and force it to delete the hard drive on which it is running. Once undefined
behavior executes, a program may do anything, including making demons fly out
of the programmer’s nose.

Pointer arithmetic, and even pointer comparisons, are also affected by undefined behavior. It’s undefined to go beyond and array’s bounds using pointer arithmetic. And pointers may be compared for equality or inequality even if they point to different arrays or objects, but if you try to compare different arrays via less-than, like this:

```
int a[10];
int b[10];
if (a < b + 10) ...
```

that causes undefined behavior.

If you really want to compare pointers that might be to different arrays—for instance, you’re writing a hash function for arbitrary pointers—cast them to

`uintptr_t`

first.

## Undefined behavior and optimization

A program that causes undefined behavior **is not a C++ program**. The
abstract machine says that a C++ program, by definition, is a program whose
behavior is always defined. The C++ compiler is allowed to assume that its
input is a C++ program. (Obviously!) So the compiler can assume that its input
program will never cause undefined behavior. Thus, since undefined behavior is
“impossible,” if the compiler can prove that a condition would cause undefined
behavior later, it can assume that condition will never occur.

Consider this program:

```
char* x = /* some value */;
assert(x + 1 > x);
printf("x = %p, x + 1 = %p\n", x, x + 1);
```

If we supply a value equal to `(char*) -1`

, we’re likely to see output like
this:

```
x = 0xffffffffffffffff, x + 1 = 0
```

with no assertion failure! But that’s an apparently impossible result.
The printout can only happen if `x + 1 > x`

(otherwise, the assertion will
fail and stop the printout). But `x + 1`

, which equals `0`

, is **less than**
`x`

, which is the largest 8-byte value!

The impossible happens because of undefined behavior reasoning. When the
compiler sees an expression like `x + 1 > x`

(with `x`

a pointer), it can
reason this way:

“Ah,

`x + 1`

. This must be a pointer into the same array as`x`

(or it might be a boundary pointer just past that array, or just past the non-array object`x`

). This must be so because forming any other pointer would cause undefined behavior.“The pointer comparison is the same as an index comparison.

`x + 1 > x`

means the same thing as`&x[1] > &x[0]`

. But that holds iff`1 > 0`

.“In my infinite wisdom, I know that

`1 > 0`

. Thus`x + 1 > x`

always holds, and the assertion will never fail.“My job is to make this code run fast. The fastest code is code that’s not there. This assertion will never fail—might as well remove it!”

## Integer undefined behavior

Arithmetic on signed integers also has important undefined behaviors. Signed
integer arithmetic must never overflow. That is, the compiler may assume that
the *mathematical* result of any signed arithmetic operation, such as `x + y`

(with `x`

and `y`

both `int`

), can be represented inside the relevant type. It
causes undefined behavior, therefore, to add 1 to the maximum positive
integer. (The `ubexplore.cc`

program demonstrates how this can produce
impossible results, as with pointers.)

Arithmetic on *unsigned* integers is much safer with respect to undefined
behavior. Unsigned integers are defined to perform arithmetic modulo their
size. This means that if you add 1 to the maximum positive *unsigned* integer,
the result will always be zero.

Dividing an integer by zero causes undefined behavior whether or not the integer is signed.

## Sanitizers

Sanitizers, which in our makefiles are turned on by supplying `SAN=1`

, can
catch many undefined behaviors as soon as they happen. Sanitizers are built in
to the compiler itself; a sanitizer involves cooperation between the compiler
and the language runtime. This has the major performance advantage that the
compiler introduces exactly the required checks, and the optimizer can then
use its normal analyses to remove redundant checks.

That said, undefined behavior checking can still be slow. Undefined behavior allows compilers to make assumptions about input values, and those assumptions can directly translate to faster code. Turning on undefined behavior checking can make some benchmark programs run 30% slower [link].