Data representation 1: Introduction

Overview

This course investigates how systems software works: what makes programs work fast or slow, and how properties of the machines we program impact the programs we write. We discuss both general ideas and specific tools, and take an experimental approach.

Textbook readings

Outline

Data representation
- How do computers represent different kinds of information?
- How do data representation choices impact performance and correctness?
Assembly & machine programming
- What kind of language is understood by computer processors?
- How is code you write translated to code a processor runs?
Kernel programming
- How do hardware and software defend against bugs and attacks?
- How are operating systems interfaces implemented?
Storage & caching
- What kinds of computer data storage are available, and how do they perform?
- How can we improve the performance of a system that stores data?
Process management
- How can programs on the same computer cooperate and interact?
- What kinds of operating systems interfaces are useful?
Concurrency
- How can a single program safely use multiple processors?
- How can multiple computers safely interact over a network?

Your work

Six problem sets
Midterm and final
Section
- Starting mid-next week
- Attendance checked for simultaneously-enrolled students

Your grade

Rough breakdown: >50% assignments, <35% tests, 15% participation
Course grading: A means mastery

Collaboration

Discussion, collaboration, and the exchange of ideas are essential to doing academic work, and to engineering. You are encouraged to consult with your classmates as you work on problem sets. You are welcome to discuss general strategies for solutions as well as specific bugs and code structure questions, and to use Internet resources for general information.

However, the work you turn in must be your own—the result of your own efforts. You should understand your code well enough that you could replicate your solution from scratch, without collaboration.

In addition, you must cite any books, articles, online resources, and so forth that helped you with your work, using appropriate citation practices; and you must list the names of students with whom you have collaborated on problem sets and briefly describe how you collaborated. (You do not need to list course staff.)

On our programming language

We use the C++ programming language in this class.

C++ is a boring, old, and unsafe programming language, but boring languages are underrated. C++ offers several important advantages for this class, including ubiquitous availability, good tooling, the ability to demonstrate impactful kinds of errors that you should understand, and a good standard library of data structures.

Pset 0 links to several C++ tutorials and references, and to a textbook.

Today

Objects
Pointers
Addresses
Segments
Memory

Objects

Each program runs in a private data storage space. This is called its memory. The memory “remembers” the data it stores.

Programs work by manipulating values. Different programming languages have different conceptions of value; in C++, the primitive values are integers, like 12 or -100; floating-point numbers, like 1.02; and pointers, which are references to other objects.

An object is a region of memory that contains a value. (The C++ standard specifically says “a region of data storage in the execution environment, the contents of which can represent values”.)

Objects, values, and variables

datarep1/objects.cc

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
}

Which are the objects? Which are the values?

What does the program print?

i1: 61
i2: 62
i3: 63

Pointers

C and C++ pointer types allow programs to access objects indirectly. A pointer value is the address of another object. For instance, in this program, the variable i4 holds a pointer to the object named by i3:

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;
    int* i4 = &i3;

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
    printf("value pointed to by i4: %d\n", *i4);
}

Which are the objects? Which are the values?

What does this program print?

i1: 61
i2: 62
i3: 63
value pointed to by i4: 63

Here, the expressions i3 and *i4 refer to exactly the same object. Any modification to i3 can be observed through *i4 and vice versa. We say that i3 and *i4 are aliases: different names for the same object.

Addresses

We now use hexdump_object, a helper function declared in our hexdump.hh helper file, to examine both the contents and the addresses of these objects.

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;
    int* i4 = &i3;

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
    printf("i4: %p\n", i4); // note use of `%p` to print a pointer value
    printf("value pointed to by i4: %d\n", *i4);

    hexdump_object(i1);
    hexdump_object(i2);
    hexdump_object(i3);
    hexdump_object(i4);
}

Exactly what is printed will vary between operating systems and compilers. In Docker in class, on my Apple-silicon Macbook, we saw:

4000004010  3d 00 00 00                                       |=...|
4000002024  3e 00 00 00                                       |>...|
40018055ec  3f 00 00 00                                       |?...|
40018055f0  ec 55 80 01 40 00 00 00                           |.U..@...|

But on an Intel-based Amazon EC2 native Linux machine:

0060102c  3d 00 00 00                                       |=...|
00400b0c  3e 00 00 00                                       |>...|
7ffc388f5494  3f 00 00 00                                       |?...|
7ffc388f5498  94 54 8f 38 fc 7f 00 00                           |.T.8....|

The data bytes look similar—identical for i1 through i3—but the addresses vary.

But on Intel Mac OS X:

103c63020  3d 00 00 00                                       |=...|
103c5ef60  3e 00 00 00                                       |>...|
7ffeebfa4abc  3f 00 00 00                                       |?...|
7ffeebfa4ab0  bc 4a fa eb fe 7f 00 00                           |.J......|

And on Docker on an Intel Mac:

56499f239010  3d 00 00 00                                       |=...|
56499f23701c  3e 00 00 00                                       |>...|
7fffebf8b19c  3f 00 00 00                                       |?...|
7fffebf8b1a0  9c b1 f8 eb ff 7f 00 00                           |........|

A hexdump printout shows the following information on each line.

An address, like 4000004010. This is a hexadecimal (base-16) number indicating the value of the address of the object. A line contains one to sixteen bytes of memory starting at this address.
The contents of memory starting at the given address, such as 3d 00 00 00. Memory is printed as a sequence of bytes, which are 8-bit numbers between 0 and 255. All modern computers organize their memory in units of 8-bit bytes.
A textual representation of the memory contents, such as |=...|. This is useful when examining memory that contains textual data, and random garbage otherwise.

Dynamic allocation

Must every data object be given a name? No! In C++, the new operator allocates a brand-new object with no variable name. (In C, the malloc function does the same thing.) The C++ expression new T returns a pointer to a brand-new, never-before-seen object of type T. For instance:

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;
    int* i4 = new int{64};

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
    printf("i4: %p\n", i4);
    printf("value pointed to by i4: %d\n", *i4);

    hexdump_object(i1);
    hexdump_object(i2);
    hexdump_object(i3);
    hexdump_object(i4);
    hexdump_object(*i4);
}

This prints something like

i1: 61
i2: 62
i3: 63
i4: 0x4000016eb0
value pointed to by i4: 64
4000004010  3d 00 00 00                                       |=...|
4000002040  3e 00 00 00                                       |>...|
40018055ec  3f 00 00 00                                       |?...|
40018055f0  b0 6e 01 00 40 00 00 00                           |.n..@...|
4000016eb0  40 00 00 00                                       |@...|

The new int{64} expression allocates a fresh object with no name of its own, though it can be located by following the i4 pointer.

Segments

What do you notice about the addresses of these different objects?

i3 and i4, which are objects corresponding to variables declared local to main, are located very close to one another. In fact they are just 4 bytes part: i3 directly abuts i4. Their addresses are quite high. In native Linux, in fact, their addresses are close to 2⁴⁷!
i1 and i2 are at much lower addresses, and they do not abut. i2’s location is below i1, and about 0x2000 bytes away.
The anonymous storage allocated by new int is located between i1/i2 and i3/i4.

Although the values may differ on other operating systems, you’ll see qualitatively similar results wherever you run ./objects.

What’s happening is that the operating system and compiler have located different kinds of object in different broad regions of memory. These regions are called segments, and they are important because objects’ different storage characteristics benefit from different treatment.

i2, the const int global object, has the smallest address. It is in the code or text segment, which is also used for read-only global data. The operating system and hardware ensure that data in this segment is not changed during the lifetime of the program. Any attempt to modify data in the code segment will cause a crash.
i1, the int global object, has the next highest address. It is in the data segment, which holds modifiable global data. This segment keeps the same size as the program runs.
After a jump, the anonymous new int object pointed to by i4 has the next highest address. This is the heap segment, which holds dynamically allocated data. This segment can grow as the program runs; it typically grows towards higher addresses.
After a larger jump, the i3 and i4 objects have the highest addresses. They are in the stack segment, which holds local variables. This segment can also grow as the program runs, especially as functions call other functions; in most processors it grows down, from higher addresses to lower addresses.

Experimenting with the stack

How can we tell that the stack grows down? Do all functions share a single stack? This program uses a recursive function to test. Try running it; what do you see?

#include <cstdio>                                                               
#include "hexdump.hh"                                                           
                                                                                
int i1 = 61;                                                                    
const int i2 = 62;                                                              
                                                                                
int owen(int owens_argument) {                                                  
    int owens_local = owens_argument + 100;                                     
    hexdump_object(owens_local);                                                
    if (owens_argument > 0) {                                                   
        owens_local += owen(owens_argument - 1);                                
    }                                                                           
    return owens_local + rand();                                                
}                                                                               
                                                                                
int main() {                                                                    
    int i3 = 63;                                                                
    int* i4 = new int{64};                                                      
                                                                                
    printf("i1: %d\n", i1);                                                     
    printf("i2: %d\n", i2);                                                     
    printf("i3: %d\n", i3);
    printf("i4: %p\n", i4);
    printf("owen(10): %d\n", owen(10));

    hexdump_object(i1);
    hexdump_object(i2);
    hexdump_object(i3);
    hexdump_object(i4);
    hexdump_object(*i4);
}