Datarep Section 1: C++ data structures

This section covers concepts from the C++ standard library, especially container data structures and iterators, using an example that might remind you of part of Problem Set 1.

Prework: Container data structures and std::map

How section works

We release section material via the CS 61 course site (where you’re reading this now) and the cs61-sections repository (obtain a local copy with a command like git clone git@github.com:cs61/cs61-sections). TFs will work through the section material at their own pace, stopping at exercises, questions, and discussions so students can work through the material together. It’s best if you bring your laptop. Solutions for most section problems are available here, but you will learn more if you try to solve the problems yourself first.

Section works well if students ask questions, and sometimes TFs will take the section in an unexpected direction based on student requests. Please speak up! If you have questions from lecture, now’s the time to ask.

Students will be expected to have digested all material from section handouts.

Container data structures (prework and reference)

C++’s standard containers library (ref1, ref2, ref3) offers several powerful container data structures, including:

std::map<K, V>, a balanced tree structure that can find items by key and can enumerate all items in sorted order by key;
std::vector<T>, a growable contiguous array of objects;
std::list<T>, a linked list of objects;
std::unordered_map<K, V>, a hash table that can quickly find items by key and can enumerate all items, but not in any particular order;
std::multimap<K, V>, a version of std::map that can store multiple values with the same key;
and several more.

Container data structures have similar APIs, and the algorithms library (ref1, ref2) offers generic algorithms that can operate on many containers.

C++ container APIs resemble APIs in other languages, but do have some unique features, especially in the way they use iterators (ref) to represent positions in a data structure.

In this section, we’ll concentrate on std::map.

`std::map`

std::map<K, V> (ref1, ref2, ref3) is a container that maps keys of type K to values of type V. The key and value types can be almost any C++ types, but K objects must support comparison with the less-than operator <. (Integers, pointers, and C++ std::strings all support <; so do many other types.)

This code creates an empty map, then prints every element of the map (there are none, so it prints nothing).

std::map<uintptr_t, const char*> m;

for (auto it = m.begin(); it != m.end(); ++it) {
    fprintf(stderr, "key %zu, value %s\n", it->first, it->second);
}

The it object is an iterator that represents a position in, or cursor into, the map. A map iterator “points at” a key–value pair; the key is in it->first, the value in it->second. (The iterator is actually an object, not a pointer, but it supports pointer notation anyway.)

Several map functions return iterators. The m.begin() iterator points at the first key–value pair in the map. The special m.end() iterator represents the absence of an item. Its position is “one past” of all the valid positions in the map; it’s illegal to refer to m.end()->first or m.end()->second. There’s also m.find(K), which returns an iterator pointing at the element with a given key, and m.lower_bound(K) and a.upper_bound(K), which return iterators pointing at elements nearest a given key (in a sense defined more precisely below).

Iterators can also be moved and compared. Move an iterator to the next position with ++it (as you see above), and to the previous position with --it. Compare map iterators with == and !=; two iterators are equal if and only if they point to the same element.

The type of m.begin() is std::map<uintptr_t, const char*>::iterator. Ugh! Rather than type that whole thing, in most situations you can just say auto and the compiler will figure it out.

Insert elements with the insert member function, which takes a key–value pair argument:

datareps1/map1.cc

#include <map> // gain access to `std::map`
#include <cstdio>
#include <cstdint> // gain access to `uintptr_t`

int main() {
    std::map<uintptr_t, const char*> m;
    m.insert({0, "Hello"});
    m.insert({2, "!"});
    m.insert({1, "world"});

    for (auto it = m.begin(); it != m.end(); ++it) {
        fprintf(stderr, "key %zu, value %s\n", it->first, it->second);
    }
}

// ==> key 0, value Hello
//     key 1, value world
//     key 2, value !

insert({K, V}) does nothing if the element with key K already exists. If you want to modify any existing element, use m.insert_or_assign(K, V) or, shorter, m[K] = V.

Question. m.find(K) returns an iterator pointing at the element with key K. What does it return if K isn’t contained in the map? It’s got to return something!

It returns the iterator representing the absence of an element, m.end().

Question. How can you tell whether an element with key K is in a map?

Using what we’ve discussed so far, you can check whether m.find(K) != m.end(). There are also shorthands that might be a little faster or easier to read: m.contains(K) == true when K is in the map, and m.count(K) != 0 when K is in the map.

Question. Write a test program to help figure out the exact sense in which m.lower_bound(K) returns an iterator pointing at the element “nearest” a given key.

I’d want to know if lower_bound returns the nearest iterator by going to the nearest lower key, or the nearest higher key, or the nearest key by distance. I’m also curious of when lower_bound return m.end(), if it ever does. Here’s a program testing those questions:

datareps1/map2.cc

#include <map>
#include <cstdio>
#include <cstdint>
std::map<uintptr_t, const char*> m;

void print_lower_bound(uintptr_t key) {
    auto it = m.lower_bound(key);
    if (it == m.end()) {
        fprintf(stderr, "lower_bound for %zu is at end\n", key);
    } else {
        fprintf(stderr, "lower_bound for %zu is at key %zu\n", key, it->first);
    }
}

int test() {
    m.insert({1, ""});
    m.insert({3, ""});
    m.insert({11, ""});
    print_lower_bound(0);
    print_lower_bound(1);
    print_lower_bound(2);
    print_lower_bound(4);
    print_lower_bound(10);
    print_lower_bound(11);
    print_lower_bound(12);
}

// ==> lower_bound for 0 is at key 1
//     lower_bound for 1 is at key 1
//     lower_bound for 2 is at key 3
//     lower_bound for 4 is at key 11
//     lower_bound for 10 is at key 11
//     lower_bound for 11 is at key 11
//     lower_bound for 12 is at end

Question. What does m.lower_bound(K) return?

It appears that m.lower_bound(K) returns the iterator pointing at the first element whose key is greater than or equal to K, or m.end() if K is greater than all keys in the map.

Question. What does m.upper_bound(K) return?

m.upper_bound(K) returns the iterator pointing at the first element whose key is greater than K, or m.end() if K is greater than or equal to all keys in the map.

Question. Say that m.lower_bound(k) == m.upper_bound(k) for some key k. What can you infer about the contents of map m?

That k is not included in the map (so m.find(k) == m.end()).

Erase elements with erase, which takes either a key or one or two iterators. m.erase(K) erases the element with key K, if there is one. m.erase(FIRST, LAST) erases all elements between iterator FIRST and iterator LAST; m.erase(m.begin(), m.end()) clears the container. m.erase(IT) erases the single element at iterator IT.

Cloud Flap Sensor 5000

Flappy birds flapping

Source image from Shutterstock

In this section, you’ll help build the cloud storage back end for the Flappy Bird Company’s new Cloud Flap Sensor 5000. This high-tech sensor straps onto a bird’s back and counts the number of times it flaps its wings. When possible, the sensor sends the number of recent flaps back to your server over the cell phone network.

A sample from the sensor has three components:

The start time, an integer measured in seconds since January 1, 1970.
The duration, an integer number of seconds.
The flap count, which is the integer number of times the bird flapped its wings between time start and time start + duration.

(A real sensor message would have more components, like a sensor ID, but we’ll assume for now there’s only one interesting bird.)

The sensor typically sends messages with increasing start time and no gaps: a message with start == S and duration == D would be followed by a message with start == S+D. However, birds go all over the place, so sometimes messages are lost and there will be gaps. A valid sensor will never generate samples that overlap in time or samples with zero duration.

These are the important functions:

void add_sample(uintptr_t start, size_t duration, size_t flapcount);
bool has_sample(uintptr_t t);

(uintptr_t and size_t are types of unsigned integer, where “unsigned” means they cannot represent negative numbers. On x86-64 machines, they are both synonyms for unsigned long, and they are large enough to represent numbers from 0 to 18,446,744,073,709,551,615. More information.)

add_sample records a sample from the sensor.
has_sample returns true if a sample has been recorded that covers time t (that is, a sample with start <= t && t < start + duration), and false if not.

You’ll implement add_sample and has_sample, using a global data structure you’ll define.

Question. Think about how would you implement these functions. What data structures might you use? What factors might affect that choice?

Cloud Flap Sensor Map

In the rest of this section, we’ll implement Cloud Flap Sensor 5000 using std::map. Here’s our proposed type for the global data structure (see in datareps1/flapmap.hh):

datareps1/flapmap.hh

struct sample {
    uintptr_t start;
    size_t duration;
    size_t flapcount;
};

std::map<uintptr_t, sample> flapmap;
using flapmap_iter = std::map<uintptr_t, sample>::iterator; // less typing later

The flapmap holds every sample received from the sensor, indexed by start time.

Exercise. Implement add_sample(uintptr_t start, size_t duration, size_t flapcount) in flapmap1.cc. Test your work.

datareps1/solutions/flapmaps1.cc

void add_sample(uintptr_t start, size_t duration, size_t flapcount) {
    flapmap.insert({start, {start, duration, flapcount}});
}

Or, more verbosely:

void add_sample(uintptr_t start, size_t duration, size_t flapcount) {
    sample s = { .start = start, .duration = duration, .flapcount = flapcount };
    flapmap.insert({start, s});
}

Question. What does your add_sample do with a duplicate sample (meaning a sample with the same start time as a previously-recorded sample)?

Ours throws the new sample away! If you used insert_or_assign, yours would replace the old sample with the new one.

(Offline) Discussion. If the Flap Sensor 5000 were to break, it might start reporting invalid samples. It’s important to think about robustness while you program. What kinds of arguments to add_sample might indicate a broken sensor? What should add_sample do when passed an invalid sample?

We’d be suspicious of an add_sample call with several kinds of weird arguments:

duration is zero.

start represents a time unreasonably far in the past.

start + duration represents a time unreasonably far into the future.

flapcount represents an unrealistically large count.

The start—start + duration interval overlaps a previously recorded sample.

There’s an art to determining what’s “unreasonably far” or “unrealistically large”; picking good thresholds can require domain knowledge or real-world experience. For example, the threshold for an unrealistically large flap count depends on the kind of bird—for a seagull, which usually flaps 2.5/sec, 20/sec is very unrealistic, but hummingbirds can go up to 80/sec. You might think that any sample in the future is invalid, but it’s super common for clocks in the real world to get out of sync, so it’d be wise to accept future samples with some error tolerance.

As for what to do with an invalid sample, there’s no right answer! Perhaps it would be best to log an error message, or to simply ignore the sample? What do you think? The answer might depend on the number of invalid samples received, on the way in which they’re invalid, and on the consequences of recording them. You’ll see later, for example, that some of our code depends on the flapmap containing no overlapping samples, so it is important not to store those (or to clip them to avoid overlaps). For the rest of this section, though, we’ll assume all received samples are valid.

Question. How can you tell if a sample with start time start is in the map?

Test whether flapmap.find(start) != flapmap.end().

Question. has_sample is answering a slightly different question than whether a sample with start time start is in the map. How is it different?

has_sample(t) is supposed to return true if t is contained in a sample. It will be true if flapmap.find(start) != flapmap.end(), but it will also be true in other situations.

Exercise. Implement has_sample(t). Test your work.
One good first question: What’s the simplest, most obviously correct way to implement has_sample(t)? A natural answer is by iterating through all the samples!

datareps1/solutions/flapmaps2.cc
bool has_sample_iter(uintptr_t t) {
    for (auto it = flapmap.begin(); it != flapmap.end(); ++it) {
        if (it->first <= t && t < it->first + it->second.duration) {
            return true;
        }
    }
    return false;
}
We can check the correctness of this “obviously correct” implementation by writing some tests. Try to select a range of tests that catch edge cases, as well as obvious cases. For this problem, you might start with:

Test an empty flapmap (any time should return false).

Test a single-sample flapmap with:

A time before the sample (expect false);

A time matching the start time (expect true);

A time in the middle of the sample (expect true);

A time matching the end time (expect false);

A time after the end time (expect false).

Test a multi-sample flapmap. In addition to the above, perhaps:

Test a time in a gap between samples (expect false).

Test the border time between adjacent samples (expect true).

A good way implement tests is with the assert macro. The argument to assert is an expression that you, the programmer, believe is definitely true. The running program checks whether the expression is actually true, and crashes if it’s not.

datareps1/solutions/flapmaps2.cc
int main() {
    // empty flapmap tests
    assert(!has_sample(0));
    assert(!has_sample(-19348));
    assert(!has_sample(192132471));
    // single-sample tests
    add_sample(10, 10, 1);
    assert(!has_sample(0));
    assert(has_sample(10));
    assert(has_sample(15));
    assert(!has_sample(20));
    assert(!has_sample(25));
    // multi-sample tests
    add_sample(20, 10, 1);
    add_sample(40, 10, 1);
    assert(!has_sample(0));
    assert(has_sample(10));
    assert(has_sample(15));
    assert(has_sample(20));
    assert(has_sample(25));
    assert(!has_sample(30));
    assert(!has_sample(35));
    assert(has_sample(40));
    assert(has_sample(45));
    assert(!has_sample(50));
}
A successful run of this test program prints nothing! But an unsuccessful run prints an error and an indication of a crash.

A full testing strategy would add some more tests—such as random tests—but the code is so simple that a small set of tests can give us good confidence.

It’s really useful to have a known-good implementation, even a slow one, because we can test that implementation for consistency with an optimized version. Nevertheless, the iterator version is awfully slow: it tests every sample. We are guaranteed that samples in the flapmap never overlap; can we make use of that to speed up has_sample?

We spent a long time on lower_bound and upper_bound so we could use them here. Let’s start with upper_bound.
bool has_sample(uintptr_t t) {
    auto it = flapmap.upper_bound(t);
Now it points at the first sample whose start time is greater than t (or to flapmap.end() if all samples have start time ≤t). If that sample exists, it will never contain t; but the previous sample might! So let’s check the previous sample.

datareps1/solutions/flapmaps3.cc
bool has_sample(uintptr_t t) {
    auto it = flapmap.upper_bound(t);
    --it;
    return it->first <= t && t < it->first + it->second.duration;
}
Is checking just the one previous sample enough? Well, let’s run the tests to see how they do. You should get a crash! Do you see the bug in this code?

Our function might try to move it before flapmap.begin(). This is super illegal: there’s nothing there. If it == flapmap.begin(), then the first sample’s start time is greater than t, so the function should return false.

datareps1/solutions/flapmaps4.cc
bool has_sample(uintptr_t t) {
    auto it = flapmap.upper_bound(t);
    if (it == flapmap.begin()) {
        return false;
    } else {
        --it;
        return it->first <= t && t < it->first + it->second.duration;
    }
}
This is correct: the tests pass, and we can reason through its correctness.

Any samples at or after it cannot contain t, because they have start time >t.

If it == flapmap.begin(), then all samples have start time >t, so no sample can contain t.

Otherwise, let prev be the sample immediately before it. This is the last sample with start time ≤ t.

We explicitly check whether prev contains t.

No sample before prev can contain t, because samples don’t overlap. Any sample preceding prev has end time ≤ prev.start ≤ t, and so cannot contain t.

You can also use lower_bound. The code is a little uglier, but still fine.
bool has_sample_lower_bound(uintptr_t t) {
    auto it = flapmap.lower_bound(t);
    if (it->first == t) {
        return true;
    }
    // Otherwise, might be contained in the previous sample
    if (it == flapmap.begin() /* no previous sample */) {
        return false;
    } else {
        --it;
        return it->first <= t && t < it->first + it->second.duration;
    }
}

Offline Exercise. Implement the function sample_overlaps(start, duration), which tests whether a sample with start time start and duration duration would overlap any sample in flapmap. Implement some tests of your function. Start by drawing some pictures: How many different kinds of overlapping samples are there? Then implement the corresponding tests.
bool sample_overlaps(uintptr_t start, size_t duration) {
    return has_sample(start)
        || flapmap.lower_bound(start) != flapmap.lower_bound(start + duration);
}
We can count four distinct classes of overlapping, plus some interesting cases where there is no overlapping. In the diagrams below, ==== represents existing samples and <---> represents the queried range.
1.   ============
           <------------->               overlapping at beginning



2.                   =============
           <------------->               overlapping at end



3.   =============================
           <------------->               contained within


4.              ====
           <------------->               containing


5.   ====                    =====
           <------------->               between (should not overlap)


6.   ======               ========
           <------------->               touching (should not overlap)
These tests cover those classes:
add_sample(0, 5, 1);
add_sample(10, 5, 1);

assert(sample_overlaps(4, 4));
assert(sample_overlaps(8, 4));
assert(sample_overlaps(1, 2));
assert(sample_overlaps(8, 10));
assert(!sample_overlaps(6, 3));
assert(!sample_overlaps(5, 5));

Too many samples

As you test Flap Sensor 5000 in the wild, you encounter a “success disaster”: the sensor reports so many samples that your program runs out of memory!¹

Analysis shows that the problem is that the sensor is reporting more information than you need: the sensor reports a flap count every second, but your users only need average flap counts per hour. That said, your users still want to make sure that all sample gaps are preserved (you should not average over a gap in the sample stream).

Discussion. How would you propose changing the data collection process to reduce memory usage under these constraints?

Sample coalescing

You decide to implement a sample coalescing strategy: add_sample will combine adjacent samples by adding their flap counts together.

For simplicity’s sake, the rest of this section will not limit how many samples can be combined, but it’s not too difficult to limit coalescing to at most one hour at a time. And remember, you can assume that samples never overlap.

Exercise. Implement can_coalesce_up(flapmap_iter it), which tests whether the sample at position it can be combined with the next sample in time. If it has no next sample, can_coalesce_up(it) should return false.
bool can_coalesce_up(flapmap_iter it) {
    assert(it != flapmap.end());

    // Check if next sample exists
    auto next = it;
    ++next;
    if (next == flapmap.end()) {
        return false;
    }

    return it->first + it->second.duration == next->first;
}

Exercise. Implement coalesce_up(flapmap_iter it), which coalesces the sample at position it with the next sample in time. You can assume that can_coalesce_up(it).
void coalesce_up(flapmap_iter it) {
    assert(can_coalesce_up(it));
    auto next = it;
    ++next;
    it->second.duration += next->second.duration;
    it->second.flapcount += next->second.flapcount;
    flapmap.erase(next);
}

Note. It’s important when modifying a container to keep track of a property called iterator validity, which means whether an existing iterator into the container remains valid. Different containers have different validity properties. Maps are relatively generous: after an insert operation, all previous iterators into the container remain valid (they still point at the same element). Iterators are also valid after map::erase, except that any iterators pointing at erased elements become invalid and illegal to access.

Question. Say that flapmap is completely coalesced, meaning that there are no adjacent samples in flapmap (so can_coalesce_up(it) would return false for every iterator it in the map). A new sample arrives for times [start, start+duration). How many coalesce_up operations would be required to ensure that the new flapmap was completely coalesced?
Two, if samples can arrive out of order! For example, consider these samples:
add_sample(0, 10, 1);
add_sample(20, 10, 1);
// map now contains samples [0, 10) and [20, 30)
add_sample(10, 10, 1);
// map should contain just one sample, for [0, 30): this requires two coalescings

Exercise. Implement can_coalesce_down(flapmap_iter it), which tests whether the sample at position it can be combined with the previous sample in time. If it has no previous sample, can_coalesce_down(it) should return false.
bool can_coalesce_down(flapmap_iter it) {
    assert(it != flapmap.end());

    // Check if previous sample exists
    if (it == flapmap.begin()) {
        return false;
    }

    auto prev = it;
    --prev;
    return prev->first + prev->second.duration == it->first;
}

Exercise. Implement a version of add_sample that maximally coalesces the new sample with adjacent samples.
This version is probably easiest to read, which is a really important property.

datareps1/solutions/flapmaps5.cc
void add_sample(uintptr_t start, size_t duration, size_t flapcount) {
    // Strategy: first insert, then coalesce
    flapmap.insert({start, {start, duration, flapcount}});
    auto it = flapmap.find(start);
    while (can_coalesce_down(it)) {
        --it;
    }
    while (can_coalesce_up(it)) {
        coalesce_up(it);
    }
}

Discussion. Can you think of any ways to optimize this coalescing add_sample? Would you recommend implementing them?

The version above isn’t the absolute fastest: it creates a new sample object even if the new sample can be coalesced immediately. The simplest optimization might be to check if the new sample can be coalesced with the most recent sample in the map. That case seems likely to happen often, since samples will generally arrive in increasing order.

void add_sample_faster_common_case(uintptr_t start, size_t duration, size_t flapcount) {
    if (!flapmap.empty()) {
        auto it = flapmap.end();
        --it;
        if (it->first + it->second.duration == start) {
            it->second.duration += duration;
            it->second.flapcount += flapcount;
            return;
        }
    }
    add_sample_original(start, duration, flapcount);
}

With a small tweak, this can coalesce samples with previous samples even if samples are inserted out of order.

void add_sample_faster(uintptr_t start, size_t duration, size_t flapcount) {
    auto it = flapmap.lower_bound(start);
    if (it != flapmap.begin()) {
        --it;
        if (it->first + it->second.duration == start) {
            it->second.duration += duration;
            it->second.flapcount += flapcount;
            return;
        }
    }
    add_sample_original(start, duration, flapcount);
}

But that version doesn’t coalesce maximally! A sample that fills a gap will be coalesced down, but not up.

We can do better by combining the original add_sample code with our faster version, reasoning carefully about where iterators point. Here’s the code:

datareps1/solutions/flapmaps6.cc

void add_sample_integrated(uintptr_t start, size_t duration, size_t flapcount) {
    auto it = flapmap.lower_bound(start);
    // `it` points at the first sample with start time >= start.
    // Change `it` to point at the previous sample: if our sample can be coalesced
    // down, that’ll be the one we want.
    if (it != flapmap.begin()) {
        --it;
    }
    // Now `it` points at the *last* sample with start time *less than* start,
    // if there is such a sample.
    // Try coalescing down.
    if (it != flapmap.end()
        && it->first + it->second.duration == start) {
        it->second.duration += duration;
        it->second.flapcount += flapcount;
        // Now `it` points at the sample containing the new sample.
    } else {
        // Can’t coalesce down, so insert.
        flapmap.insert({start, {start, duration, flapcount}});
        it = flapmap.find(start);
        // Now `it` points at the sample containing the new sample.

        // (The following line is equivalent to `insert + find`, but even faster,
        // because insert_or_assign returns the relevant iterator:)
        // it = flapmap.insert_or_assign(start, {start, duration, flapcount});
    }
    // `it` always points at the new sample.
    // Coalesce up.
    while (can_coalesce_up(it)) {
        coalesce_up(it);
    }
}

Is this worth it? Well, that’s a difficult question. I’d say no, unless you experience performance problems with the simpler solution. Sometimes it’s emotionally hard to avoid optimizing, but before doing something complicated, you should always develop a good testing strategy.

Discussion. How does the Cloud Flap Sensor 5000 problem relate to Problem Set 1?

This is not realistic unless the sensor is broken. ↩︎