Data representation—Hardware and abstract machines

Three questions

We saw several interesting ways to complete the simple task of adding two integers by writing C/C++ code. Let's make sure we don't think that there's any magic taking place in our computers (besides the magic of abstraction!) when we do this. In particular, let's answer the following three questions:

How does our machine represent numbers, for instance the numbers 2 and 3?
How does it actually add numbers in these representations and get the right answer? And
How do we know what a program's behavior will be?

While a program that simply adds two numbers might seem trivial, digging into it can teach you quite a bit. In particular, you will learn what has to go right to produce even something as simple as the sum of two integers. With a deep understanding of what has to go right to achieve the behavior you desire, you'll be in a position to make a good guess about what might have gone wrong when a program doesn't produce the behavior you'd like. Even more important, by understanding what has to go right to achieve the behavior you desire, you'll be in a position to create applications that are resilient to adversarial attacks, where someone tries to get our machines to do something we don't think is possible. Like adding two numbers by jumping into a GIF.

What are bits, physically?

The vast majority of modern computing systems work on binary values. Every datum in our digital worlds is comprised of nothing more than groups of physical things that we read and write as some combination of ones or zeros. On or off. High or low. Left or right.

Most of the digital devices you have use one or more of the following four basic techniques:

Charge on a capacitor, which is like asking if a glass of water is full (representing a one) or empty (zero). The DDR4 SDRAM main memory in your computer uses tiny capacitors to store electrons like the glass stores water. This SDRAM is volatile storage, meaning that it stores values only as long as the system is powered, but the SSD (solid-state drive) storage in that same computer is flash memory, which injects charge into floating gates to create (mostly) nonvolatile storage for the bits in your data files.
Current flowing or not through an electrical circuit, which is like asking if a water faucet is on or off. A light switch that's on has current flowing through it; when it's off, no current is flowing. ECL circuitry is built on this idea, which was an early chip technology and produced very fast circuits, but with very high power consumption.
Polarity of magnetic material, which is how we store data on magnetic tapes. And on floppy (and not so floppy) disks. If your hard drive contains rigid spinning disks, it stores data magnetically too. While the main memory of your computer is a random-access device (given an address you can immediately lookup the value stored at that address), a computer tape is not. You must read from the beginning of the tape until you find the values you want.
Presence or absence of something. This is a very simple idea. A hole in a paper tape represents a one; no hole is a zero. A burnt spot on a cdrom is a zero; no burnt spot is a one. A blown connection in an eerom is a zero; still connected it's a one. Punch cards are an interesting version of this idea that did not use a byte-sized encoding to store letter and numbers. Some of the most popular punch card devices used what's called a Hollerith Code, where it took 12 bits to encode a letter or number, and where no bit pattern ever had more than three 1s (i.e., three holes max in any column). This particular encoding helped maintain the physical integrity of the punch card no matter what was stored on it!

Each of these technologies for storing bits is hidden behind an abstraction boundary. We use these different approaches every day and ever only think about reading/writing 1s and 0s.

With the ability to represent zero and one, we can build any number by stringing together multiple digits. In a base-2 number, each digit is either zero or one, just as a base-10 number uses digits that are anything between 0-9.

The number 2 in base-10 becomes 10 in base-2. 3 in base-10 becomes 11 in base-2. 5 in base-10 becomes 101 in base-2. And so forth. See the beginning of Section 2.1 in CS:APP3e for more details..

Computer memories

Computer memories are basically an array of storage locations where we can store binary valuess. With the Lite Brite toy, we choose a location and place a peg there. In computer memories, we tell the computer which storage location we want by giving it the index of the location in this array. Using this index, we can read or write that particular memory location with a value. This index is called a storage location's address. This address is also made up of binary digits.

With computer memories, you have two key design decisions to make:

How wide is each addressible memory location? And
How many locations are in your memory array? This is actually two questions:

What is the maximum number of locations that you can have in your memory array?
And how many locations do you have in your current memory array?

Considering the questions in reverse order, you can't store more items in your memory array than you have physical locations in which to store them. This is what people mean when they ask how much physical memory do you want or do you have. We'll see this in detail in pset 4 when you build your memory translation in your own tiny OS.

The other sub-question limits the number of bits in the address we use to index into memory. Why do machines limit the number of bits in a memory address? Well, let's put this question aside for a moment and look at the other key design decision we make about memory arrays.

For maximum flexibility, you could make each bit in memory directly addressable, but we will see that most of the time we work with groupings of multiple bits. You can't even express the number 3 (base-10) without using two binary bits grouped together. On the other hand, if you make the number of bits in each addressable memory location too big, you waste a lot of hardware when you store a small value, like 3 (base-10). If each addressable memory location was, say, 64-bits wide, then 3 (base-10) would be stored as 63 0's followed by two 1's. That sounds expensive and wasteful. We will experiment with a similar tradeoff when we build buffer caches in pset 3.

There's no right answer here, just a tradeoff that eventually becomes a standard. Most computer systems today associate an address with each 8 bits (a byte) of data.

Why is each addressible memory location exactly 8-bits wide? This is the same question as, why do we set a maximum memory address? The simple answer is: It's really hard to write a program to run on a piece of hardware if you don't know the maximum width of things like memory locations or memory addresses. You could build something where you ask about the size, in bits, of a number's representation before you read those bits to determine the value of that number. And you could do this for every number you read, write, and opeate on, but it would be slow and tedious work.

In writing and then executing a program, we are making our abstract ideas concrete and real. A computer can't add 2 and 3 without making them into concrete representations.

Everybody can add

Speaking of which, how do we perform the magic of addition using these representations and get the right answer? Take a course like cs141 to learn in detail, but the short answer combines memory--circuits that store 1s and 0s--with a network of simple logic gates--ANDs, ORs, and NOTs.

Picture in your mind a memory block feeding a blob of logic, whose result feeds back to the memory. You have a mental model in your mind by now for memories, and you should think of the blob as just a network of inverters (~), and (&), or (|), and xor (^) gates. The parentheticals indicate the C bitwise operators for each logic gate. Left and right shifts (<< and >>) are just wiring in hardware.

In software, we can emulate the operation of a hardware adder using C variables (i.e., our memory) and C's bitwise operators (i.e., our logic blob). If we put these C statements in a procedure called add, we have built ourselves an adder without the magic of the C +-operator.

Program behavior

The datarep2 directory contains several programs that supposedly accomplish our simple add task. Do they all have the same behavior when run with lots of different inputs?

No, addc operates on 8-bit integers, and it fails in ways that add, which operates on 32-bit intergers, doesn't. You can successfully add 514 and 3 if the variables /(i.e., memory locations/) you use are 32 bits in width. You won't be successful if the variables are only 8-bits wide. By the way, all the 8-bit-wide adders, whether using bitwise operators or the C +-operator, fail in the same ways.

Unfortunately, 32-bit memory locations don't allow us to successfully add any integer to another. Adding 5000000000 and 1 fails to produce a correct answer in all our designs.

Let's put aside the problem that we didn't work hard enough to produce a program that can add any two arbitrary integers. In some sense, that's a task specification problem.

Instead, let's focus on the disturbing issue that we didn't see anything during these executions that might indicate that our simple addition failed on certain inputs. A programmer can add protections against such failures, as we did in addc-safe, but when do we know we need to include such protections?

To answer this question, we need to know what is expected behavior.

An abstract machine

Let's step back for a second and think about what we want. We want a programmer to be able to write a program (possibly several programs), which is compiled and linked with some libraries (possibly by different compilers and with different libraries and different levels of optimization), to run on some particular machine (possibly several different machines), and when run, we observe our program consistently producing a single observed behavior /(no matter which combination of libraries, compilers, optimizers, and hardwares we used/). And, by the way, the behavior we observed was the behavior we thought our program was going to produce when we wrote it.

In thinking about this, let's put aside the issue of writing a program with syntax errors. An example of a syntax error is labeling your main routine as Main instead of main. The compiler catches and alerts the programmer to syntax errors. On the other hand, you can write a syntactically correct C program that has a meaning and observed behavior different that what you wanted. Even worse, you can write a syntactically correct C program that has no meaning.

What good program language designers do for us as programmers is design programming languages that make it easy for us to understand the behavior of the programs we write. They define an abstract machine that all the other system components (i.e., the compilers, optimizers, and hardware chips) work together to implement. The goal is that the observable behavior of the real system matches the defined behavior of this abstract machine. In other words, what the programmer writes in C is given meaning by the operation of this abstract machine.

It's important to realize that standards for programming languages, like C and C++, don't just tell you how to write syntactically correct programs, but they also define what type of behavior will result from the code you write. The C/C++ Standards specify requirements for the implementations of the C/C++ programming languages. And it is up to the compiler designer, library writers, OS gurus, and hardware architects to make sure what they do while handling a program remains true to the abstract machine's defined operation.

Sounds simple enough.

Looking at the C++ Standard

Great, let's look at some parts of the C++ standard.

On page 22 (of the PDF), the Standard defines a memory model, in which the fundamental storage unit is a byte, which is defined to be at least 8 contiguous bits in size. Convenient given that our physical memories (for most machines) are 8 bits in size. Furthermore, each byte is defined to have a unique address.

On page 23, the C/C++ Standard also defines something called an object, which C/C++ programs create, destroy, refer to, access, and manipulate. These objects occupy a region of storage and contain values. We'll talk about these objects in detail in the upcoming lectures.

As we know, the root reason that adding 514 and 3 with 8-bit storage locations fails is because we cannot represent the 514 in 8 bits. 0b11111111 is 255; to store 514, we need at least 10 bits.

Ok, but at some point in the execution some piece of our computer must have known that 514 didn't fit in the program-specified storage. Why didn't entity alert the user of the program that things were going to fail? Why the silence?

What in this standard says that silence is the right response here?

Clause 4.6 (page 25) of the C++ Standard says, "The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...."

So far, so good. The compilers, libraries, optimizers and real machine have a lot of freedom to optimize for different goals and make different tradeoffs.

BUT the C++ Standard also says (pAGE 20) that a "well-formed program" is "C++ program constructed according to the syntax rules, diagnosable semantic rules, and the one-definition rule."

No syntax errors in our programs, check. One definition for each object? Without getting into details here, we never encounter a point in this program or its execution when we have an ambiguous definition of an object.

What's this about "diagnosable semantic rules"? The Standard says (page 21), "The set of diagnosable rules consists of all syntactic and semantic rules in this International Standard except for those rules containing an explicit notation that 'no diagnostic is required' or which are described as resulting in 'undefined behavior.'"

Oh no.

The Standard also says (page 25), "A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input. However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation)."

Examples of code with undefined behavior

So, what kind of C operations are specified as possibly resulting in undefined behavior?

Access outside the bounds of an array: A good coding practice is to check if an index value is within the bounds of an array before you do the array access using that index. In C/C++, it is not just good practice, but it is necessary if you want to avoid undefined behavior. C/C++ declare an out-of-bounds array access as undefined behavior. Since the abstraction machine says nothing about the behavior, the actual machine can do anything it wants.

Weird things happen under undefined behavior. If you make an array access and then check whether the index is within the bounds of the array, an optimizing compiler might determine that this bounds check is redundant code, since it can reason that the index must have been in bounds when you did the array access. Why? Because the program would not be well-formed if the index were out of bounds. This is convoluted logic, but that's what happens when you start wade into waters along the shores of undefined behaviors.

See the comments in the routines access1 and access2 in ubexplore.cc for more details.

Integer overflow:. This knowledge of undefined behavior and silence about it helps explain our add problem.

Clause 8 of the Standard says (page 107), "If during the evaluation of an expression, the result is not mathematically defined or not in the range of representable values for its type, the behavior is undefined."

Other modern languages (e.g., Java) rigidly prescribe behaviors that C/C++ leaves unspecified. While life is more predictable in these other languages, they also sometimes allow conforming implementations to choose among several behaviors. Your programming world is thus a more sane space, but still not perfectly safe. C/C++ is a good language in which to learn to be a careful and thoughtful programmer.

It is important to note that the C++ Standard makes a distinction between signed and unsigned integers. Operations on unsigned integers are completely defined (page 94): "Unsigned integers shall obey the laws of arithmetic modulo 2^n where n is the number of bits in the value representation of that particular size of integer." This means that overflow can never occur with unsigned arithmetic.

Once you know the defined behavior for unsigned arithmetic, perhaps the behavior you observe feels less jarring?

Let's think about your computing devices for a moment. The hardware adder in your laptop computer, even your phone, knows when addition of two binary values overflows, but the C/C++ standard says this overflow signal doesn't mean anything for unsigned addition, and the compiled program is going to ignore it for signed addition, even if the hardware signals, it because what the program does once a signed addition occurs is undefined.

The Standard basically says, programs shouldn't perform signed additions that will result in an overflow. And if an overflow could occur, it's up to the program to avoid the situation before it happens.

The routine safe_signed_inc in ubexplore.cc shows one way to fix a signed overflow issue that is crazy hard to catch, especially when compiler optimizers make seemingly sensible optimizations based on abstract facts like i + 1 is always greater than i. Except when you have fixed-length representations for i!

Personally, the instructors don't like to remember all the situations in which undefined behavior can ruin our programming day. Luckily, we don't have to and neither do you. You simply have to remember that undefined behavior can lead to misery and that some pretty smart programmers have built tools to help us avoid having undefined behavior lurking in our programs. Some of these tools are called sanitizers and CS61 makefiles make it easy for you to use them. Just type make SAN=1 your_executable_target.