Overview
We introduce data movement and arithmetic instructions on x86-64.
Full lecture notes on assembly — Textbook readings
Machine code and assembly
- A computer processor reads instructions from memory
- The instructions tell the processor what to do
- Instructions have a byte representation (machine code)
- And a textual representation (assembly language)
How machine code is executed: simple model
- The processor (CPU—Central Processing Unit) reads instructions from memory
- It decodes each instruction and performs the corresponding operation before going to the next
- It executes instructions sequentially unless redirected explicitly by an instruction (a branch instruction—like “
goto
”)
How machine code is generated: simple model
How machine code is generated: assembler model
How machine code is generated: linking
Assembly example
0000000000401210 <add>:
401210: 8d 04 3e leal (%rsi,%rdi), %eax
401213: c3 retq
401214: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
40121e: 66 90 nop
- Left: Address or offset at which code appears
- Middle: Machine code representation
- Right: Assembly language
Assembly flavors
- Compiler generated
make FILE.s
,gcc -S
- Includes symbolic names
- Includes labels and directives (e.g.,
## %bb.o:
,.LFB0:
) - Does not include machine code or offsets
- Read from object file
objdump -d file.o
- Offsets can be weird because linker hasn’t set them yet
- (For example, library function calls may show up as
callq 31 <add+0x31>
rather thancallq 401090 <open@plt>
)
- Read from executable
objdump -d exefile
,gdb
- Has final offsets, has fewer symbolic names
- Has garbage at the end of functions (any idea why?)
Reading assembly
- Dive in and make assumptions! Assembly makes some sense
- Confused by an instruction? Look it up in our notes or more broadly
- Or even in the Intel x86-64 manual
Simple functions
.file "f00.cc"
.text
.globl _Z1fv
.type _Z1fv, @function
_Z1fv:
ret
.size _Z1fv, .-_Z1fv
.ident "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
.section .note.GNU-stack,"",@progbits
Directives, labels, instructions
- This is compiler-generated assembly
- Comprises directives, labels, and instructions
- Directive: an instruction to the assembler; controls aspects of the output that aren’t machine code
.file
: What the source file was.text
: Which segment should store the generated instructions.globl
,.type
: Information for the linker about the function
- Label: marks the next instruction, making it referenceable by other instructions and files
_Z1fv:
- Instruction: assembly language
ret
f00.s
, f01.s
- In the body of this lecture, we look at assembly files generated by the compiler and try to reason through what the source files might be!
ret
- Three classes of instruction
- Arithmetic: perform computations on values
- Data movement: move data to and from primary memory
- Control flow: change the instruction sequence
ret
returns from the current function- It’s a control flow instruction
f02.s
, f03.s
mov
- The
mov
instruction is a data movement instruction - Format:
mov SRC, DST
movl $100, %eax
Registers
- Registers comprise the fastest kind of memory available to the CPU
- Machines have tons of memory but few registers
- x86-64 has just 14 general-purpose registers!
- Each 64 bits wide
- Registers have no addresses
- They have names like
%rax
- They cannot be dereferenced using a numeric address or pointer
- Their format and layout is not prescribed by the C++ memory model
- They have names like
Register slices
- Although registers are 64 bits wide, the data we handle is often smaller
- Names are provided for slices of each register
%rax
: the entire register (bits 0–63)%eax
: the lowest 32 bits (bits 0–31)%ax
: the lowest 16 bits (bits 0–15)%al
: the lowest 8 bits (bits 0–7)%ah
: bits 8–15
- Instructions must match sizes
- In compiler-generated assembly, an instruction suffix indicates size
movl $100, %eax
: move the 32-bit number100
into the 32-bit register%eax
- (This sets bits 32–63 to zero)
movq $100, %rax
: move the 64-bit number100
into the 64-bit register%rax
movl $100, %rax
: syntax error
f04.s
, f05.s
, f06.s
, f07.s
Data operands and address modes
$X
: an immediate value (a constant)%X
: a register valuea(%rip)
: a global symbol(%X)
: an indirect reference (dereferencing a “pointer”)8(%X)
: an offset indirect reference (dereferencing a structure or array)N(R)
means dereference memory at addressR
+N
f08.s
Arithmetic (computation) instructions
OP SRC, DST
meansDST := DST OP SRC
xorl %eax, %eax
means%eax := %eax ^ %eax
- Which means…
f09.s
, f10.s
f11.s
Moving into register slices
mov[SIGN][SRCSIZE][DSTSIZE]
SIGN
isz
(extend with zeros) ors
(extend with sign bit)SRCSIZE
/DSTSIZE
isb
(byte),w
(short),l
(int), orq
(long)
f12.s
f13.s
, f14.s
f15.s
, f16.s
, f17.s
More data formats
(%X,%Y,Z)
: an array indirect reference- Dereference memory at
%X + %Y * %Z
- Dereference memory at
- Full format:
offset(base,index,scale)
offset + base + index * scale
offset
must be a constantscale
must be 1, 2, 4, or 8- Default
offset
,base
, andindex
are 0; defaultscale
is 1
f18.s
The lea
instruction
lea
stands for Load Effective Address- It performs an address computation, but does not dereference
- Often used by compiler as a parsimonious alternative to array arithmetic
leal (%rdi,%rsi,8), %eax
movl %esi, %eax; shll $3, %eax; addl %edi, %eax