This section covers concepts from the C++ standard library, especially container data structures and iterators, using an example that might remind you of part of Problem Set 1.
Prework: Container data structures and std::map
How section works
We release section material via the CS 61 course site (where you’re reading
this now) and the cs61-sections repository (obtain a local copy with a
command like git clone git@github.com:cs61/cs61-sections
). TFs will work
through the section material at their own pace, stopping at exercises,
questions, and discussions so students can work through the material together.
It’s best if you bring your laptop. Solutions for most section problems are
available here, but you will learn more if you try to solve the problems
yourself first.
Section works well if students ask questions, and sometimes TFs will take the section in an unexpected direction based on student requests. Please speak up! If you have questions from lecture, now’s the time to ask.
Students will be expected to have digested all material from section handouts.
Container data structures (prework and reference)
C++’s standard containers library (ref1, ref2, ref3) offers several powerful container data structures, including:
std::map<K, V>
, a balanced tree structure that can find items by key and can enumerate all items in sorted order by key;std::vector<T>
, a growable contiguous array of objects;std::list<T>
, a linked list of objects;std::unordered_map<K, V>
, a hash table that can quickly find items by key and can enumerate all items, but not in any particular order;std::multimap<K, V>
, a version ofstd::map
that can store multiple values with the same key;- and several more.
Container data structures have similar APIs, and the algorithms library (ref1, ref2) offers generic algorithms that can operate on many containers.
C++ container APIs resemble APIs in other languages, but do have some unique features, especially in the way they use iterators (ref) to represent positions in a data structure.
In this section, we’ll concentrate on std::map
.
std::map
std::map<K, V>
(ref1,
ref2,
ref3) is a
container that maps keys of type K
to values of type V
. The key
and value types can be almost any C++ types, but K
objects must support
comparison with the less-than operator <
. (Integers, pointers, and C++
std::string
s all support <
; so do many other types.)
This code creates an empty map, then prints every element of the map (there are none, so it prints nothing).
std::map<uintptr_t, const char*> m;
for (auto it = m.begin(); it != m.end(); ++it) {
fprintf(stderr, "key %zu, value %s\n", it->first, it->second);
}
The it
object is an iterator that represents a position in, or cursor into,
the map. A map iterator “points at” a key–value pair; the key is in
it->first
, the value in it->second
. (The iterator is actually an object,
not a pointer, but it supports pointer notation anyway.)
Several map functions return iterators. The m.begin()
iterator points at the
first key–value pair in the map. The special m.end()
iterator represents the
absence of an item. Its position is “one past” of all the valid positions in
the map; it’s illegal to refer to m.end()->first
or m.end()->second
.
There’s also m.find(K)
, which returns an iterator pointing at the element
with a given key, and m.lower_bound(K)
and a.upper_bound(K)
, which return
iterators pointing at elements nearest a given key (in a sense defined
more precisely below).
Iterators can also be moved and compared. Move an iterator to the next
position with ++it
(as you see above), and to the previous position with
--it
. Compare map iterators with ==
and !=
; two iterators are equal if
and only if they point to the same element.
The type of m.begin()
is std::map<uintptr_t, const char*>::iterator
. Ugh!
Rather than type that whole thing, in most situations you can just say auto
and the compiler will figure it out.
Insert elements with the insert
member function, which takes a key–value
pair argument:
#include <map> // gain access to `std::map`
#include <cstdio>
#include <cstdint> // gain access to `uintptr_t`
int main() {
std::map<uintptr_t, const char*> m;
m.insert({0, "Hello"});
m.insert({2, "!"});
m.insert({1, "world"});
for (auto it = m.begin(); it != m.end(); ++it) {
fprintf(stderr, "key %zu, value %s\n", it->first, it->second);
}
}
// ==> key 0, value Hello
// key 1, value world
// key 2, value !
insert({K, V})
does nothing if the element with key K
already exists. If
you want to modify any existing element, use m.insert_or_assign(K, V)
or,
shorter, m[K] = V
.
Question.
m.find(K)
returns an iterator pointing at the element with keyK
. What does it return ifK
isn’t contained in the map? It’s got to return something!It returns the iterator representing the absence of an element,
m.end()
.
Question. How can you tell whether an element with key
K
is in a map?Using what we’ve discussed so far, you can check whether
m.find(K) != m.end()
. There are also shorthands that might be a little faster or easier to read:m.contains(K) == true
whenK
is in the map, andm.count(K) != 0
whenK
is in the map.
Question. Write a test program to help figure out the exact sense in which
m.lower_bound(K)
returns an iterator pointing at the element “nearest” a given key.I’d want to know if
lower_bound
returns the nearest iterator by going to the nearest lower key, or the nearest higher key, or the nearest key by distance. I’m also curious of whenlower_bound
returnm.end()
, if it ever does. Here’s a program testing those questions:#include <map> #include <cstdio> #include <cstdint> std::map<uintptr_t, const char*> m; void print_lower_bound(uintptr_t key) { auto it = m.lower_bound(key); if (it == m.end()) { fprintf(stderr, "lower_bound for %zu is at end\n", key); } else { fprintf(stderr, "lower_bound for %zu is at key %zu\n", key, it->first); } } int test() { m.insert({1, ""}); m.insert({3, ""}); m.insert({11, ""}); print_lower_bound(0); print_lower_bound(1); print_lower_bound(2); print_lower_bound(4); print_lower_bound(10); print_lower_bound(11); print_lower_bound(12); } // ==> lower_bound for 0 is at key 1 // lower_bound for 1 is at key 1 // lower_bound for 2 is at key 3 // lower_bound for 4 is at key 11 // lower_bound for 10 is at key 11 // lower_bound for 11 is at key 11 // lower_bound for 12 is at end
Question. What does
m.lower_bound(K)
return?It appears that
m.lower_bound(K)
returns the iterator pointing at the first element whose key is greater than or equal toK
, orm.end()
ifK
is greater than all keys in the map.
Question. What does
m.upper_bound(K)
return?
m.upper_bound(K)
returns the iterator pointing at the first element whose key is greater thanK
, orm.end()
ifK
is greater than or equal to all keys in the map.
Question. Say that
m.lower_bound(k) == m.upper_bound(k)
for some keyk
. What can you infer about the contents of mapm
?That
k
is not included in the map (som.find(k) == m.end()
).
Erase elements with erase
, which takes either a key or one or two iterators.
m.erase(K)
erases the element with key K
, if there is one. m.erase(FIRST, LAST)
erases all elements between iterator FIRST
and iterator LAST
;
m.erase(m.begin(), m.end())
clears the container. m.erase(IT)
erases the
single element at iterator IT
.
Cloud Flap Sensor 5000
Source image from Shutterstock
In this section, you’ll help build the cloud storage back end for the Flappy Bird Company’s new Cloud Flap Sensor 5000. This high-tech sensor straps onto a bird’s back and counts the number of times it flaps its wings. When possible, the sensor sends the number of recent flaps back to your server over the cell phone network.
A sample from the sensor has three components:
- The start time, an integer measured in seconds since January 1, 1970.
- The duration, an integer number of seconds.
- The flap count, which is the integer number of times the bird flapped
its wings between time
start
and timestart + duration
.
(A real sensor message would have more components, like a sensor ID, but we’ll assume for now there’s only one interesting bird.)
The sensor typically sends messages with increasing start
time and no gaps:
a message with start == S
and duration == D
would be followed by a message
with start == S+D
. However, birds go all over the place, so sometimes
messages are lost and there will be gaps. A valid sensor will never generate
samples that overlap in time or samples with zero duration
.
These are the important functions:
void add_sample(uintptr_t start, size_t duration, size_t flapcount);
bool has_sample(uintptr_t t);
(uintptr_t
and size_t
are types of unsigned integer, where “unsigned”
means they cannot represent negative numbers. On x86-64 machines, they are
both synonyms for unsigned long
, and they are large enough to represent
numbers from 0 to 18,446,744,073,709,551,615. More
information.)
add_sample
records a sample from the sensor.has_sample
returnstrue
if a sample has been recorded that covers timet
(that is, a sample withstart <= t && t < start + duration
), andfalse
if not.
You’ll implement add_sample
and has_sample
, using a global data structure
you’ll define.
Question. Think about how would you implement these functions. What data structures might you use? What factors might affect that choice?
Cloud Flap Sensor Map
In the rest of this section, we’ll implement Cloud Flap Sensor 5000 using
std::map
. Here’s our proposed type for the global data structure (see in
datareps1/flapmap.hh
):
struct sample {
uintptr_t start;
size_t duration;
size_t flapcount;
};
std::map<uintptr_t, sample> flapmap;
using flapmap_iter = std::map<uintptr_t, sample>::iterator; // less typing later
The flapmap
holds every sample
received from the sensor, indexed by
start
time.
Exercise. Implement
add_sample(uintptr_t start, size_t duration, size_t flapcount)
inflapmap1.cc
. Test your work.void add_sample(uintptr_t start, size_t duration, size_t flapcount) { flapmap.insert({start, {start, duration, flapcount}}); }
Or, more verbosely:
void add_sample(uintptr_t start, size_t duration, size_t flapcount) { sample s = { .start = start, .duration = duration, .flapcount = flapcount }; flapmap.insert({start, s}); }
Question. What does your
add_sample
do with a duplicate sample (meaning a sample with the samestart
time as a previously-recorded sample)?Ours throws the new sample away! If you used
insert_or_assign
, yours would replace the old sample with the new one.
(Offline) Discussion. If the Flap Sensor 5000 were to break, it might start reporting invalid samples. It’s important to think about robustness while you program. What kinds of arguments to
add_sample
might indicate a broken sensor? What shouldadd_sample
do when passed an invalid sample?We’d be suspicious of an
add_sample
call with several kinds of weird arguments:
duration
is zero.start
represents a time unreasonably far in the past.start + duration
represents a time unreasonably far into the future.flapcount
represents an unrealistically large count.- The
start
—start + duration
interval overlaps a previously recorded sample.There’s an art to determining what’s “unreasonably far” or “unrealistically large”; picking good thresholds can require domain knowledge or real-world experience. For example, the threshold for an unrealistically large flap count depends on the kind of bird—for a seagull, which usually flaps 2.5/sec, 20/sec is very unrealistic, but hummingbirds can go up to 80/sec. You might think that any sample in the future is invalid, but it’s super common for clocks in the real world to get out of sync, so it’d be wise to accept future samples with some error tolerance.
As for what to do with an invalid sample, there’s no right answer! Perhaps it would be best to log an error message, or to simply ignore the sample? What do you think? The answer might depend on the number of invalid samples received, on the way in which they’re invalid, and on the consequences of recording them. You’ll see later, for example, that some of our code depends on the
flapmap
containing no overlapping samples, so it is important not to store those (or to clip them to avoid overlaps). For the rest of this section, though, we’ll assume all received samples are valid.
Question. How can you tell if a sample with start time
start
is in the map?Test whether
flapmap.find(start) != flapmap.end()
.
Question.
has_sample
is answering a slightly different question than whether a sample with start timestart
is in the map. How is it different?
has_sample(t)
is supposed to returntrue
ift
is contained in a sample. It will be true ifflapmap.find(start) != flapmap.end()
, but it will also be true in other situations.
Exercise. Implement
has_sample(t)
. Test your work.One good first question: What’s the simplest, most obviously correct way to implement
has_sample(t)
? A natural answer is by iterating through all the samples!bool has_sample_iter(uintptr_t t) { for (auto it = flapmap.begin(); it != flapmap.end(); ++it) { if (it->first <= t && t < it->first + it->second.duration) { return true; } } return false; }
We can check the correctness of this “obviously correct” implementation by writing some tests. Try to select a range of tests that catch edge cases, as well as obvious cases. For this problem, you might start with:
- Test an empty
flapmap
(any time should return false).- Test a single-sample
flapmap
with:
- A time before the sample (expect false);
- A time matching the start time (expect true);
- A time in the middle of the sample (expect true);
- A time matching the end time (expect false);
- A time after the end time (expect false).
- Test a multi-sample
flapmap
. In addition to the above, perhaps:
- Test a time in a gap between samples (expect false).
- Test the border time between adjacent samples (expect true).
A good way implement tests is with the
assert
macro. The argument toassert
is an expression that you, the programmer, believe is definitely true. The running program checks whether the expression is actually true, and crashes if it’s not.int main() { // empty flapmap tests assert(!has_sample(0)); assert(!has_sample(-19348)); assert(!has_sample(192132471)); // single-sample tests add_sample(10, 10, 1); assert(!has_sample(0)); assert(has_sample(10)); assert(has_sample(15)); assert(!has_sample(20)); assert(!has_sample(25)); // multi-sample tests add_sample(20, 10, 1); add_sample(40, 10, 1); assert(!has_sample(0)); assert(has_sample(10)); assert(has_sample(15)); assert(has_sample(20)); assert(has_sample(25)); assert(!has_sample(30)); assert(!has_sample(35)); assert(has_sample(40)); assert(has_sample(45)); assert(!has_sample(50)); }
A successful run of this test program prints nothing! But an unsuccessful run prints an error and an indication of a crash.
A full testing strategy would add some more tests—such as random tests—but the code is so simple that a small set of tests can give us good confidence.
It’s really useful to have a known-good implementation, even a slow one, because we can test that implementation for consistency with an optimized version. Nevertheless, the iterator version is awfully slow: it tests every sample. We are guaranteed that samples in the
flapmap
never overlap; can we make use of that to speed uphas_sample
?
We spent a long time on
lower_bound
andupper_bound
so we could use them here. Let’s start withupper_bound
.bool has_sample(uintptr_t t) { auto it = flapmap.upper_bound(t);
Now
it
points at the first sample whose start time is greater thant
(or toflapmap.end()
if all samples have start time ≤t
). If that sample exists, it will never containt
; but the previous sample might! So let’s check the previous sample.bool has_sample(uintptr_t t) { auto it = flapmap.upper_bound(t); --it; return it->first <= t && t < it->first + it->second.duration; }
Is checking just the one previous sample enough? Well, let’s run the tests to see how they do. You should get a crash! Do you see the bug in this code?
Our function might try to move
it
beforeflapmap.begin()
. This is super illegal: there’s nothing there. Ifit == flapmap.begin()
, then the first sample’s start time is greater thant
, so the function should return false.bool has_sample(uintptr_t t) { auto it = flapmap.upper_bound(t); if (it == flapmap.begin()) { return false; } else { --it; return it->first <= t && t < it->first + it->second.duration; } }
This is correct: the tests pass, and we can reason through its correctness.
- Any samples at or after
it
cannot containt
, because they have start time >t
.- If
it == flapmap.begin()
, then all samples have start time >t
, so no sample can containt
.- Otherwise, let
prev
be the sample immediately beforeit
. This is the last sample with start time ≤t
.
- We explicitly check whether
prev
containst
.- No sample before
prev
can containt
, because samples don’t overlap. Any sample precedingprev
has end time ≤prev.start
≤t
, and so cannot containt
.You can also use
lower_bound
. The code is a little uglier, but still fine.bool has_sample_lower_bound(uintptr_t t) { auto it = flapmap.lower_bound(t); if (it->first == t) { return true; } // Otherwise, might be contained in the previous sample if (it == flapmap.begin() /* no previous sample */) { return false; } else { --it; return it->first <= t && t < it->first + it->second.duration; } }
Offline Exercise. Implement the function
sample_overlaps(start, duration)
, which tests whether a sample with start timestart
and durationduration
would overlap any sample inflapmap
. Implement some tests of your function. Start by drawing some pictures: How many different kinds of overlapping samples are there? Then implement the corresponding tests.bool sample_overlaps(uintptr_t start, size_t duration) { return has_sample(start) || flapmap.lower_bound(start) != flapmap.lower_bound(start + duration); }
We can count four distinct classes of overlapping, plus some interesting cases where there is no overlapping. In the diagrams below,
====
represents existing samples and<--->
represents the queried range.1. ============ <-------------> overlapping at beginning 2. ============= <-------------> overlapping at end 3. ============================= <-------------> contained within 4. ==== <-------------> containing 5. ==== ===== <-------------> between (should not overlap) 6. ====== ======== <-------------> touching (should not overlap)
These tests cover those classes:
add_sample(0, 5, 1); add_sample(10, 5, 1); assert(sample_overlaps(4, 4)); assert(sample_overlaps(8, 4)); assert(sample_overlaps(1, 2)); assert(sample_overlaps(8, 10)); assert(!sample_overlaps(6, 3)); assert(!sample_overlaps(5, 5));
Too many samples
As you test Flap Sensor 5000 in the wild, you encounter a “success disaster”: the sensor reports so many samples that your program runs out of memory!1
Analysis shows that the problem is that the sensor is reporting more information than you need: the sensor reports a flap count every second, but your users only need average flap counts per hour. That said, your users still want to make sure that all sample gaps are preserved (you should not average over a gap in the sample stream).
Discussion. How would you propose changing the data collection process to reduce memory usage under these constraints?
Sample coalescing
You decide to implement a sample coalescing strategy: add_sample
will
combine adjacent samples by adding their flap counts together.
For simplicity’s sake, the rest of this section will not limit how many samples can be combined, but it’s not too difficult to limit coalescing to at most one hour at a time. And remember, you can assume that samples never overlap.
Exercise. Implement
can_coalesce_up(flapmap_iter it)
, which tests whether the sample at positionit
can be combined with the next sample in time. Ifit
has no next sample,can_coalesce_up(it)
should returnfalse
.bool can_coalesce_up(flapmap_iter it) { assert(it != flapmap.end()); // Check if next sample exists auto next = it; ++next; if (next == flapmap.end()) { return false; } return it->first + it->second.duration == next->first; }
Exercise. Implement
coalesce_up(flapmap_iter it)
, which coalesces the sample at positionit
with the next sample in time. You can assume thatcan_coalesce_up(it)
.void coalesce_up(flapmap_iter it) { assert(can_coalesce_up(it)); auto next = it; ++next; it->second.duration += next->second.duration; it->second.flapcount += next->second.flapcount; flapmap.erase(next); }
Note. It’s important when modifying a container to keep track of a property called iterator validity, which means whether an existing iterator into the container remains valid. Different containers have different validity properties. Maps are relatively generous: after an
insert
operation, all previous iterators into the container remain valid (they still point at the same element). Iterators are also valid aftermap::erase
, except that any iterators pointing at erased elements become invalid and illegal to access.
Question. Say that
flapmap
is completely coalesced, meaning that there are no adjacent samples inflapmap
(socan_coalesce_up(it)
would returnfalse
for every iteratorit
in the map). A new sample arrives for times [start
,start+duration
). How manycoalesce_up
operations would be required to ensure that the newflapmap
was completely coalesced?Two, if samples can arrive out of order! For example, consider these samples:
add_sample(0, 10, 1); add_sample(20, 10, 1); // map now contains samples [0, 10) and [20, 30) add_sample(10, 10, 1); // map should contain just one sample, for [0, 30): this requires two coalescings
Exercise. Implement
can_coalesce_down(flapmap_iter it)
, which tests whether the sample at positionit
can be combined with the previous sample in time. Ifit
has no previous sample,can_coalesce_down(it)
should returnfalse
.bool can_coalesce_down(flapmap_iter it) { assert(it != flapmap.end()); // Check if previous sample exists if (it == flapmap.begin()) { return false; } auto prev = it; --prev; return prev->first + prev->second.duration == it->first; }
Exercise. Implement a version of
add_sample
that maximally coalesces the new sample with adjacent samples.This version is probably easiest to read, which is a really important property.
void add_sample(uintptr_t start, size_t duration, size_t flapcount) { // Strategy: first insert, then coalesce flapmap.insert({start, {start, duration, flapcount}}); auto it = flapmap.find(start); while (can_coalesce_down(it)) { --it; } while (can_coalesce_up(it)) { coalesce_up(it); } }
Discussion. Can you think of any ways to optimize this coalescing
add_sample
? Would you recommend implementing them?The version above isn’t the absolute fastest: it creates a new sample object even if the new sample can be coalesced immediately. The simplest optimization might be to check if the new sample can be coalesced with the most recent sample in the map. That case seems likely to happen often, since samples will generally arrive in increasing order.
void add_sample_faster_common_case(uintptr_t start, size_t duration, size_t flapcount) { if (!flapmap.empty()) { auto it = flapmap.end(); --it; if (it->first + it->second.duration == start) { it->second.duration += duration; it->second.flapcount += flapcount; return; } } add_sample_original(start, duration, flapcount); }
With a small tweak, this can coalesce samples with previous samples even if samples are inserted out of order.
void add_sample_faster(uintptr_t start, size_t duration, size_t flapcount) { auto it = flapmap.lower_bound(start); if (it != flapmap.begin()) { --it; if (it->first + it->second.duration == start) { it->second.duration += duration; it->second.flapcount += flapcount; return; } } add_sample_original(start, duration, flapcount); }
But that version doesn’t coalesce maximally! A sample that fills a gap will be coalesced down, but not up.
We can do better by combining the original
add_sample
code with our faster version, reasoning carefully about where iterators point. Here’s the code:void add_sample_integrated(uintptr_t start, size_t duration, size_t flapcount) { auto it = flapmap.lower_bound(start); // `it` points at the first sample with start time >= start. // Change `it` to point at the previous sample: if our sample can be coalesced // down, that’ll be the one we want. if (it != flapmap.begin()) { --it; } // Now `it` points at the *last* sample with start time *less than* start, // if there is such a sample. // Try coalescing down. if (it != flapmap.end() && it->first + it->second.duration == start) { it->second.duration += duration; it->second.flapcount += flapcount; // Now `it` points at the sample containing the new sample. } else { // Can’t coalesce down, so insert. flapmap.insert({start, {start, duration, flapcount}}); it = flapmap.find(start); // Now `it` points at the sample containing the new sample. // (The following line is equivalent to `insert + find`, but even faster, // because insert_or_assign returns the relevant iterator:) // it = flapmap.insert_or_assign(start, {start, duration, flapcount}); } // `it` always points at the new sample. // Coalesce up. while (can_coalesce_up(it)) { coalesce_up(it); } }
Is this worth it? Well, that’s a difficult question. I’d say no, unless you experience performance problems with the simpler solution. Sometimes it’s emotionally hard to avoid optimizing, but before doing something complicated, you should always develop a good testing strategy.
Discussion. How does the Cloud Flap Sensor 5000 problem relate to Problem Set 1?
-
This is not realistic unless the sensor is broken. ↩︎