Data representation 1: Introduction

Overview

This course investigates how systems software works: what makes programs work fast or slow, and how properties of the machines we program impact the programs we write. We discuss both general ideas and specific tools, and take an experimental approach.

Textbook readings

Outline

  1. Data representation
    • How do computers represent different kinds of information?
    • How do data representation choices impact performance and correctness?
  2. Assembly & machine programming
    • What kind of language is understood by computer processors?
    • How is code you write translated to code a processor runs?
  3. Kernel programming
    • How do hardware and software defend against bugs and attacks?
    • How are operating systems interfaces implemented?
  4. Storage & caching
    • What kinds of computer data storage are available, and how do they perform?
    • How can we improve the performance of a system that stores data?
  5. Process management
    • How can programs on the same computer cooperate and interact?
    • What kinds of operating systems interfaces are useful?
  6. Concurrency
    • How can a single program safely use multiple processors?
    • How can multiple computers safely interact over a network?

Your work

Your grade

Collaboration

Discussion, collaboration, and the exchange of ideas are essential to doing academic work, and to engineering. You are encouraged to consult with your classmates as you work on problem sets. You are welcome to discuss general strategies for solutions as well as specific bugs and code structure questions, and to use Internet resources for general information.

However, the work you turn in must be your own—the result of your own efforts. You should understand your code well enough that you could replicate your solution from scratch, without collaboration.

In addition, you must cite any books, articles, online resources, and so forth that helped you with your work, using appropriate citation practices; and you must list the names of students with whom you have collaborated on problem sets and briefly describe how you collaborated. (You do not need to list course staff.)

On our programming language

We use the C++ programming language in this class.

C++ is a boring, old, and unsafe programming language, but boring languages are underrated. C++ offers several important advantages for this class, including ubiquitous availability, good tooling, the ability to demonstrate impactful kinds of errors that you should understand, and a good standard library of data structures.

Pset 0 links to several C++ tutorials and references, and to a textbook.

Today

Objects

Each program runs in a private data storage space. This is called its memory. The memory “remembers” the data it stores.

Programs work by manipulating values. Different programming languages have different conceptions of value; in C++, the primitive values are integers, like 12 or -100; floating-point numbers, like 1.02; and pointers, which are references to other objects.

An object is a region of memory that contains a value. (The C++ standard specifically says “a region of data storage in the execution environment, the contents of which can represent values”.)

Objects, values, and variables

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
}

Which are the objects? Which are the values?

What does the program print?

Pointers

C and C++ pointer types allow programs to access objects indirectly. A pointer value is the address of another object. For instance, in this program, the variable i4 holds a pointer to the object named by i3:

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;
    int* i4 = &i3;

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
    printf("value pointed to by i4: %d\n", *i4);
}

Which are the objects? Which are the values?

What does this program print?

Here, the expressions i3 and *i4 refer to exactly the same object. Any modification to i3 can be observed through *i4 and vice versa. We say that i3 and *i4 are aliases: different names for the same object.

Addresses

We now use hexdump_object, a helper function declared in our hexdump.hh helper file, to examine both the contents and the addresses of these objects.

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;
    int* i4 = &i3;

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
    printf("i4: %p\n", i4); // note use of `%p` to print a pointer value
    printf("value pointed to by i4: %d\n", *i4);

    hexdump_object(i1);
    hexdump_object(i2);
    hexdump_object(i3);
    hexdump_object(i4);
}

Exactly what is printed will vary between operating systems and compilers. In Docker in class, on my Apple-silicon Macbook, we saw:

4000004010  3d 00 00 00                                       |=...|
4000002024  3e 00 00 00                                       |>...|
40018055ec  3f 00 00 00                                       |?...|
40018055f0  ec 55 80 01 40 00 00 00                           |.U..@...|

But on an Intel-based Amazon EC2 native Linux machine:

0060102c  3d 00 00 00                                       |=...|
00400b0c  3e 00 00 00                                       |>...|
7ffc388f5494  3f 00 00 00                                       |?...|
7ffc388f5498  94 54 8f 38 fc 7f 00 00                           |.T.8....|

The data bytes look similar—identical for i1 through i3—but the addresses vary.

A hexdump printout shows the following information on each line.

Dynamic allocation

Must every data object be given a name? No! In C++, the new operator allocates a brand-new object with no variable name. (In C, the malloc function does the same thing.) The C++ expression new T returns a pointer to a brand-new, never-before-seen object of type T. For instance:

#include <cstdio>
#include "hexdump.hh"

int i1 = 61;
const int i2 = 62;

int main() {
    int i3 = 63;
    int* i4 = new int{64};

    printf("i1: %d\n", i1);
    printf("i2: %d\n", i2);
    printf("i3: %d\n", i3);
    printf("i4: %p\n", i4);
    printf("value pointed to by i4: %d\n", *i4);

    hexdump_object(i1);
    hexdump_object(i2);
    hexdump_object(i3);
    hexdump_object(i4);
    hexdump_object(*i4);
}

This prints something like

i1: 61
i2: 62
i3: 63
i4: 0x4000016eb0
value pointed to by i4: 64
4000004010  3d 00 00 00                                       |=...|
4000002040  3e 00 00 00                                       |>...|
40018055ec  3f 00 00 00                                       |?...|
40018055f0  b0 6e 01 00 40 00 00 00                           |.n..@...|
4000016eb0  40 00 00 00                                       |@...|

The new int{64} expression allocates a fresh object with no name of its own, though it can be located by following the i4 pointer.

Segments

What do you notice about the addresses of these different objects?

Although the values may differ on other operating systems, you’ll see qualitatively similar results wherever you run ./objects.

What’s happening is that the operating system and compiler have located different kinds of object in different broad regions of memory. These regions are called segments, and they are important because objects’ different storage characteristics benefit from different treatment.

Experimenting with the stack

How can we tell that the stack grows down? Do all functions share a single stack? This program uses a recursive function to test. Try running it; what do you see?

#include <cstdio>                                                               
#include "hexdump.hh"                                                           
                                                                                
int i1 = 61;                                                                    
const int i2 = 62;                                                              
                                                                                
int owen(int owens_argument) {                                                  
    int owens_local = owens_argument + 100;                                     
    hexdump_object(owens_local);                                                
    if (owens_argument > 0) {                                                   
        owens_local += owen(owens_argument - 1);                                
    }                                                                           
    return owens_local + rand();                                                
}                                                                               
                                                                                
int main() {                                                                    
    int i3 = 63;                                                                
    int* i4 = new int{64};                                                      
                                                                                
    printf("i1: %d\n", i1);                                                     
    printf("i2: %d\n", i2);                                                     
    printf("i3: %d\n", i3);
    printf("i4: %p\n", i4);
    printf("owen(10): %d\n", owen(10));

    hexdump_object(i1);
    hexdump_object(i2);
    hexdump_object(i3);
    hexdump_object(i4);
    hexdump_object(*i4);
}