Section 3: StackSmash

Skills

Learn how to use tools like gdb (GNU Debugger)
Extract instructions from memory
Learn about simple cyber attacks

Memory layout

We have already gone over memory layout in lecture. This diagram may help you to better understand it.

File:Screen_Shot_2017-09-14_at_4.10.56_PM.png

Remember that the stack (on most architectures) grows down, meaning the addresses farther into the stack are smaller than than those towards the base. The opposite is true for the heap, which grows up towards the stack. Also remember that local variables—automatic-lifetime data—go in the stack, and dynamically allocated memory goes in the heap.

Return addresses

The stack stores automatic-lifetime data, such as a running function’s local variables and scratch space. The part of the stack that belongs to a function execution is called that function’s stack frame.

Part of the compiler’s job is to lay out a function’s stack frame. It analyzes the function and decides where local variables are stored.

But a stack frame stores some control information about the function’s execution, as well as the function’s data. Specifically, each stack frame contains a return address: the address of the instruction that should run after the function completes its work.

For instance, consider this function:

int main() {
    int x = func() + 3 * func();
    ... printf("%d\n", x);
}

The compiler implements main using a sequence of simple instructions. For instance, it might do something like this:

1: call func()
2: store result of func() in temporary variable t1
3: call func() again
4: store result of func() in temporary variable t2
5: set t2 := 3 * t2
6: set t1 := t1 + t2
7: store t1 in x
8: ...

Each time func() returns, main should pick up where it left off. Thus, the first time func() is called, it should return to instruction #2, but the second time, it should return to instruction #4. But func() itself is the same in both cases! We can see that func() must return to an instruction defined at runtime, rather than a single instruction whose address is known in advance.

On x86-64 (and many modern processors), the return address is stored on the stack. The x86-64 callq instruction (which implements steps #1 and #3) pushes the return address onto the stack as part of its operation. Note that the return address for a function’s stack frame points to an instruction in its caller, so here func’s return addresses are in main, not in func itself.

Since the return address is on the stack, we can use gdb or cheat (by accessing out-of-bounds portions of the stack) to examine or even modify the return address. Hee hee.

Tools

Useful gdb commands

gdb is a useful tool that allows the programmers to step through the execution of the program and examine the memory at each point. There are many options and commands within gdb that allow one to accomplish specific tasks and fulfills different needs. Here are some commands to get you started. You can use either the full name of the command (e.g., run) or the shortcut (e.g., r) in parentheses.

run (r): Executes file passed as command line argument to gdb
backtrace (bt): Prints the call stack
frame <NUMBER>: Examines the stack frame at <NUMBER> given by backtrace
up <NUMBER>: Moves up the call stack <NUMBER> frames
down <NUMBER>: Moves down the call stack <NUMBER> frames
- frame is absolute while up and down move relative to the current frame
break (b): Pauses execution when a particular point in the code is reached
- break <FILENAME>:<LINE NUMBER>
- break <FUNCTION NAME>
- break <FILENAME>:<FUNCTION NAME>
step (s): Steps into a line of code (enters function calls)
next (n): Steps to next line of code (does not enter function calls)
continue (c): Resumes execution of the program but stops at the breakpoints; useful for passing over irrelevant code or finishing loops
advance <NUMBER> (adv): Steps to a specific line of code (does not enter function calls); useful to break out of loops
finish: Runs until the current function returns then break and print out the return value
examine (x): Examines memory
- Syntax: x/\<# of Elements\><Type of Elements><Additional Size Info>
- Simple Example: x/d 0x123: Prints memory at address 0x123 as an integer
- Complicated Example: x/20xg 0x345: Prints memory at address 0x345 as 20 hexadecimal "giant" words (8 bytes)
print <something> (p): Prints content of a variable/memory location/register
display <something> (disp) - Just like print, but gives the information on each step through the program; useful for keeping track of the value of something throughout execution.
set var <VARIABLE_NAME>=<VALUE>: Sets a value in the executing program; useful for keeping track of values
- Example: set var addr=0x123
thread <THREAD_NUMBER> : Switches between threads in a given program (we will cover threads later in the semester)
q - Quits gdb
disassemble <FUNCTION_NAME> or disassemble <START_ADDR>,<END_ADDR> or disassemble <START_ADDR>,+<LENGTH> (disas): Prints assembly instructions within a function name, or within specific addresses
Ctrl + X 2: Switch into the TUI, which allows you to view assembly and source code alongside the debug console
- Ctrl + X 1: Similar, but gives you only one view
- Ctrl + X O: Switch the focus between views

gdb cheatsheet: http://darkdust.net/files/GDB%20Cheat%20Sheet.pdf

Valgrind

gdb runs your program as usual, but you can set breakpoints which will give gdb control so you can examine the state of memory, and gdb will take control if the program segmentation faults. valgrind does not run your program as usual, instead it emulates your program. Your program will now run much slower, but

valgrind` will catch some memory errors much sooner than `gdb

would, and it will even report some nasal demons that have no visible effect on most executions. You should fix all nasal demons because most of them are "time bombs" that an attacker could trigger later using a specially-crafted input. We will see examples later in this section.

valgrind provides a set of tools that perform debugging, profiling, or other similar tasks that help you improve your programs. We have seen memcheck tool in class, which is a memory debugging tool used primary for Linux and Mac. It simulates an executable and tracks memory errors (e.g., uninitialized variables, memory leaks, and buffer overflows). valgrind does not detect static (stack) memory errors, but only the ones that occur dynamically during runtime.

Usage

valgrind --tool=memcheck --leak-check=full

Example Output

Uninitialized Variables:

==9388== Syscall param exit_group(status) contains uninitialised byte(s) 
==9388== at 0x455F7364: _Exit (in /lib/libc-2.14.90.so) 
==9388== Use --track-origins=yes to see where uninitialised values come from

Memory leaks:

==9892== LEAK SUMMARY:
==9892== definitely lost: 64 bytes in 4 blocks

Buffer Overflows/Invalid Writes:

==9736== Invalid write of size 4 
==9736== at 0x8048435: main 
==9736== total heap usage: 1 allocs, 1 frees, 16 bytes 
==9736== All heap blocks were freed -- no leaks are possible 
==9736== ERROR SUMMARY: 1 errors from 1 contexts

Check out http://valgrind.org/docs/manual/valgrind_manual.pdf for other useful tools in valgrind!

objdump

objdump displays information about object file (usually files with extension .o). If used as a disassembler, it works like disas in gdb.

Usage:objdump -d <PROGRAM_NAME>(d means disassemble)

Assignment

Pull the latest version of the sections repository by entering in the terminal

git pull

The cs61-section/s03 directory contains a few programs that we will use to investigate representations of objects in memory and assembly language instructions. We will also learn about one of the most common types of security vulnerabilities: the buffer overflow, and what programs that are vulnerable to it might look like.

A buffer overflow is a type of attack in which someone takes advantage of C's memory layout in order to write past the end of a buffer, allowing them to make the computer execute malicious code. At the beginning of each function call, a return address is pushed onto the stack. Once the machine hits a perceived return address, it will begin to execute the next instructions it sees. An attacker must "simply" write past the end of a buffer (many functions in C, like gets(), have no boundary checking) and overwrite the return address to the address in memory where the malicious code is stored. Usually, the attacker has put this code in the buffer itself, but sometimes they will try to use existing code to construct an attack. The malicious code is frequently a system call to open a shell, which when executed will allow the attacker to potentially steal information or cause more trouble.

This section material requires Linux (although you can do the factorial portion on any x86-64 machine). Before running later parts, run sudo apt install gifsicle.

factorial.c (Part 1)

In this exercise, open factorial.c file and change the factorial(int n) function to print out not only the content of the buffer buf but also the return address of this function.

To do this exercise, you need to:

Have a clear picture of the memory layout in the stack
Understand the concept of return addresses
Know how to use gdb

factorial.c (Part 2)

Now that you know the place where the factorial(int n) function stores its return address, your job is to modify the return address. Your goal is to add code to the end of factorial—without changing anything else!—so that running ./factorial prints HIJACKED! 24 instead of 5! == 120.

To do this exercise, you need to:

Know the exact place where factorial(int n) stores its return address
Know pointer arithmetic
Know how to use gdb and objdump

StackSmash (Part 3)

Stacksmash reads from the standard input. If you just run it, it will sit there, waiting for you to type something. So type something and see what it does! As an example:

echo Hi | ./stacksmash

Now try the following commands:

yes | tr -d '\n' | head -c 120 > y120.txt  //see section 'Helpful Unix Utility Programs'
./stacksmash < y120.txt
./stacksmash.safe < y120.txt

What is the difference between the output of ./stacksmash and ./stacksmash.safe?

Now, let's try different input file lengths to both programs. Try, for example, file lengths 99 and 119 (make yxxx.txt files as before).

Are the outputs different among various input file lengths? Why?

AttackMe (Part 4)

Now we proceed to this grand challenge.

First, examine the code in attackme.c and understand what it does. Then, run ./attackme program. This program has a secret. If you pass it a carefully-crafted input you can trigger it.

Can you figure out the trigger input using gdb, nm (see below section), objdump, printf, and whatever else?

Second, how would you change ./attackme so it became unattackable?

Helpful Unix Utility Programs

Print an endless stream of "y" lines.

yes

Character-level (really byte-level) translator.

tr -d '\n'

tr -d <STRING> deletes (-d) from the input all characters in <STRING>. trunderstands backslash escapes. Thusyes | tr -d '\n' removes all newlines ('\n'characters) from the output ofyes. The result is an endless stream of y` characters.

Print the first 10 lines of the input.

head
head -n 1000 //Print the first 1000 lines of the input.
head -c 1000 //Print the first 1000 characters of the input.

So, yes \| tr -d '\n' \| head -c 124 produces a file comprising 124 'y' characters with no terminating newline.

A command-line version of the familiar printf() function

printf `<FORMAT>` <ARGS...>

It can be convenient for printing special characters. For instance, try printf '\x36\x31\n'.

Print out a list of all the names defined by <PROGRAM>:

nm

This is called the program's symbol table. It includes functions and global variables defined by <PROGRAM>, as well as functions and global variables defined by system libraries and only referenced by <PROGRAM>. Here's some of its output:

U __libc_start_main@@GLIBC_2.0
08048567 T main
U printf@@GLIBC_2.0
U puts@@GLIBC_2.0
08048540 T read_line

The first column is the link address: where that symbol is located at run time. The second column defines the symbol type:

T means a text symbol (read-only global data, such as a function definition)
D means a data symbol (read/write global data)
U means undefined.
There are many more.

You can also use nm on individual object files.

Comprehension Questions

Notice from the makefile that the .unsafe extension is the same source file as the .c file, except that it has been compiled without boundary checking. What difference does this make during a buffer overflow? What additional information do you get from the "safe" version?
How might one use gdb's disassemble function or objdump to make a buffer overflow attack more strategic?
What sort of strategies have we learned so far for checking when memory has been overwritten? Which do you think might be useful for checking if a buffer overflow has taken place?