Data representation 1: Introduction

Overview

This course is about systems software programming, and how and why systems software works.

What is a system?

Computer systems software is the software that acts as a foundation for other computer applications. Systems software includes low-level operating systems code, support libraries for programming language runtimes (like Python), databases, and network servers.

Systems software runs in challenging circumstances. It often has stringent performance requirements: a modern Web server should be able to serve hundreds of thousands of Web pages a second. It often operates in hostile environments: an operating system must resist attack from malicious people or bots. And, unlike pure algorithms, it exists in the practical world, and the needs of the practical world are always changing. It runs on real hardware, which has interesting performance characteristics that change over time and that affect which algorithms work best. And users and developers always have new needs or provide new workloads.

These challenges make systems software an excellent context to learn computer programming. We think systems software programming is a critical skill for computer scientists. If you understand systems programming, you will be able to analyze and solve more software problems—you will have the tools to tame some of the most confusing bugs there are. Few computer scientists are full-time systems programmers, but every important program I’ve ever worked on has portions that demand a systems approach. And systems programming is really fun: it’s fun to figure out how software really works. You get there by building systems yourself.

Your work

(This and the next slides summarize several aspects of the course policies.)

Six problem sets
Two midterms and a final in person
Section
- Starting next week
- Attendance checked, especially for simultaneously-enrolled students

Grading

Rough breakdown: 50% assignments, 35% tests, 15% participation
Course grading: A means mastery
Grading with extra credit
- Each problem set has extra credit opportunities
- Final course grades are assigned by computing two scores, one without extra credit and one with extra credit
- You get the maximum of the two grades
No conversion to pass/fail after 5th Monday

Problem sets and collaboration

Collaboration on problem sets is encouraged
- Discuss general strategies, code structure, specific bugs with your classmates
But you must turn in your own code, and you must understand your code
- You should be able to replicate your solution, from scratch, without collaboration or AI help
Cite all help (except staff)
New this year
- Three distinct commits: Each problem set must show evidence of having been worked on over time, in the form of three different commits in the history that pass different numbers of tests
- We may ask students to answer oral questions about their code

AI

The goal of this course is to teach you a valuable way of thinking
You may code with an AI assistant, but:
- You must turn in your own code, and you must understand your code
- Cite any AI assistants you use
No collaboration with humans or AIs on tests and exams

Our programming language

We use the C++ programming language in this class.

C++ is a boring, old, and unsafe programming language, but boring languages are underrated. C++ offers several important advantages for this class, including ubiquitous availability, the ability to demonstrate impactful errors, and a good standard library of data structures.

Pset 0 links to several C++ tutorials and references, and to a textbook.

Class outline

Data representation
- How do computers represent different kinds of information?
- How does data representation impact performance and correctness?
Assembly & machine programming
- What language is understood by computer processors?
- How is code you write translated to code a processor runs?
Kernel programming
- How do hardware and software defend against bugs and attacks?
- How are operating systems interfaces implemented?
Storage & caching
- What kinds of computer data storage are available, and how do they perform?
- How can we improve the performance of a system that stores data?
Process management
- How can programs running on the same computer cooperate and interact?
- What kinds of operating systems interfaces are useful?
Concurrency
- How can a single program safely use multiple processors?
- How can multiple computers safely interact over a network?

Add

Let’s investigate the representation of integers and code.

datarep-add/add.cc

#include <cstdio>
#include <string>

int add(int a, int b) {
    return a + b;
}


int main(int argc, char* argv[]) {
    // we must have exactly 3 arguments (including the program name)
    assert(argc == 3);

    // convert texts to integers
    int a = std::stoi(argv[1]);
    int b = std::stoi(argv[2]);

    // print their sum
    printf("%d + %d = %d\n", a, b, add(a, b));
}

Questions

What are a and b?
Where are a and b?
What is even happening?

Primitive types, values, and objects

A type defines a set of related values in a programming language.
A primitive type is irreducible, meaning its values aren’t composed of smaller values.
- Different programming languages have different primitive types.
- In C++, the primitive types include integers, like 0 and 1; floating point numbers, like 0.5 and INFINITY; booleans true and false; and pointers.
An object is a value stored in memory.
- The standard says “a region of data storage in the execution environment, the contents of which can represent values”.
- Objects, unlike values, can change.
- Because, unlike values, they are present somewhere in the real physical world!

Where’s `a` and `b`?

M1 Mac board

From https://www.ifixit.com/News/46884/m1-macbook-teardowns-something-old-something-new

Computer addresses

Computer software in execution deals with zeroes and ones: bits
Bits are grouped into groups of eight called bytes
- A byte is a kind of integer with a value between 0 and 255
- 0b00000000 (0x00) is the byte (with decimal value) 0
- 0b00000001 (0x01) is the byte 1
- 0b00001101 (0x0d) is the byte 13
And the bytes are stored in memory
- Memory is made up of billions and billions of transistors—“MOSFETs” (metal–oxide–semiconductor field-effect transistors: don’t ask)
We need a way to refer to specific bytes of memory
- So we can refer to an object—or change it
- Which MOSFETs contain the bytes that make up a and b?
Each byte of memory has an address
- Which is stored as another integer!

Stored-program architecture

Modern computers use a stored-program architecture
- Instructions and data are both stored as bytes in the same underlying memory
Instructions work on primitive machine values
- Including one-byte, two-byte, four-byte, and eight-byte integers, and four-byte and eight-byte floating point numbers, as well as others
Mapping between programming language types and machine values is the work of the compiler
- The compiler takes a program and changes it into equivalent instructions
Can different programs have the same equivalent instructions?