g | x | w | all
Bytes Lang Time Link
nan170717T220148Zბიმო
nan180329T181850Zqwr
nan210920T132614Zecm
nanUseful onebyte instructions240909T151426Zqwr
nan240908T012122Zqwr
nanUse 32bit x86 instead of 64bit x8664240907T055727Zqwr
nan230518T214846Zl4m2
nan190312T184356Z640KB
nan210218T035601ZEasyasPi
nan221221T102102Zl4m2
nan"Free bypass" If you already have an instruction with an immediate on a register that you only care about the low part of220322T075050Zm90
nanUse specialcase shortform encodings for AL/AX/EAX180329T204407ZPeter Co
nan220123T082031Zm90
nanEntry point doesn't necessarily have to be first byte of submission210127T212048Z640KB
nan210616T224202ZEasyasPi
nan210326T181723ZNoah
nan180518T050305ZPeter Co
nan210201T042639ZEasyasPi
nan210128T213521ZEasyasPi
nan210127T231808ZEasyasPi
nan201207T151942Zanatolyg
nan170718T100833Zanatolyg
nan200520T120134ZKamila S
nanTry XLAT for byte memory access200212T153212Z640KB
nanTry AAM or AAD for byte division operations200212T152834Z640KB
003To copy a 64bit register190822T161356ZPeter Co
nanSave on jmp bytes by arranging into if/then rather than if/then/else190517T170122ZDaniel S
nan180406T195922Zqwr
128Subtract 128 instead of add180518T054321Zl4m2
nan180414T161507Zqwr
nan171111T024026Zpeter fe
nanThe loop and string instructions are smaller than alternative instruction sequences. Most useful is loop 170718T174647Zuser2301
nan180329T174303Zqwr
nanTo add or subtract 1170718T174137Zuser2301
nan180329T180504Zqwr
nan180329T175749Zqwr
nanYou can fetch sequential objects from the stack by setting esi to esp171114T002925Zpeter fe
001In a lot of cases170717T214734ZGovind P

mov-immediate is expensive for constants

This might be obvious, but I'll still put it here. In general it pays off to think about the bit-level representation of a number when you need to initialize a value.

Initializing eax with 0:

b8 00 00 00 00          mov    $0x0,%eax

should be shortened (for performance as well as code-size) to

31 c0                   xor    %eax,%eax

Initializing eax with -1:

b8 ff ff ff ff          mov    $-1,%eax

can be shortened to (32-bit mode)

31 c0                   xor    %eax,%eax
48                      dec    %eax      # 2 bytes in 64-bit mode

or (any mode)

83 c8 ff                or     $-1,%eax

Or more generally, any 8-bit sign-extended value can be created in 3 bytes with push -12 (2 bytes) / pop %eax (1 byte). This even works for 64-bit registers with no extra REX prefix; push/pop default operand-size = 64.

6a f3                   pushq  $0xfffffffffffffff3
5d                      pop    %rbp

Or given a known constant in a register, you can create another nearby constant using lea 123(%eax), %ecx (3 bytes). This is handy if you need a zeroed register and a constant; xor-zero (2 bytes) + lea-disp8 (3 bytes).

31 c0                   xor    %eax,%eax
8d 48 0c                lea    0xc(%eax),%ecx

See also Set all bits in CPU register to 1 efficiently

lea for multiplications by particular small constants

A well-known trick is that lea can be used to do multiplication by 2, 3, 5, or 9 (and store in a new register) in 3 bytes, optionally adding a displacement register for 1 additional byte. All examples assume 32-bit mode.

For example, to calculate ebx = 9 * eax in 3 bytes:

8d 1c c0                lea    ebx, [eax, 8*eax]

ebx = 9*eax + 3 in 4 bytes:

8d 5c c0 03             lea    ebx, [eax + 8*eax + 3]

For multiplication by 2, it saves 1 byte compared to shl/mov:

8d 1c 00                lea    ebx, [eax + eax]

89 c3                   mov    ebx, eax
d1 e3                   shl    ebx, 1

Of course, if you shift in-place, you only need shl.

Interestingly, lea multiplication by 4 or 8 isn't size efficient at all because it ends up using 0x00000000 as an absolute displacement. Another difference is that lea doesn't affect flags at all, which may or may not be a good thing depending on the flags use-case.

8d 1c 85 00 00 00 00    lea     ebx, [4*eax]

89 c3                   mov     ebx, eax
c1 e3 02                shl     ebx, 2

(GCC -Oz showed me if you are returning or starting in eax and are ok with clobbering the original register, you can save a byte over mov with xchg. That has nothing to do with shl.)

If the result ends in eax, you can use imul on any constant from 2 to 127 for 3 bytes.

More info on x86's addressing modes: https://blog.yossarian.net/2020/06/13/How-x86_64-addresses-memory

Skipping instructions

Skipping instructions are opcode fragments that combine with one or more subsequent opcodes. The subsequent opcodes can be used with a different entrypoint than the prepended skipping instruction. Using a skipping instruction instead of an unconditional short jump can save code space, be faster, and set up incidental state such as NC (No Carry).

My examples are all for 16-bit Real/Virtual 86 Mode, but a lot of these techniques can be used similarly in 16-bit Protected Mode, or 32- or 64-bit modes.

Quoting from my ACEGALS guide:

11: Skipping instructions

The constants __TEST_IMM8, __TEST_IMM16, and __TEST_OFS16_IMM8 are defined to the respective byte strings for these instructions. They can be used to skip subsequent instructions that fit into the following 1, 2, or 3 bytes. However, note that they modify the flags register, including always setting NC. The 16-bit offset plus 16-bit immediate test instruction is not included for these purposes because it might access a word at offset 0FFFFh in a segment. Also, the __TEST_OFS16_IMM8 as provided should only be used in 86M, to avoid accessing data beyond a segment limit. After the db instruction using one of these constants, a parenthetical remark should list which instructions are skipped.

The 86 Mode defines in lmacros1.mac 323cc150061e (2021-08-29 21:45:54 +0200):

%define __TEST_IMM8 0A8h                        ; changes flags, NC
%define __TEST_IMM16 0A9h                       ; changes flags, NC
                                        ; Longer NOPs require two bytes, like a short jump does.
                                        ; However they execute faster than unconditional jumps.
                                        ; This one reads random data in the stack segment.
                                        ;  (Search for better ones.)
%define __TEST_OFS16_IMM8 0F6h,86h              ; changes flags, NC

The 0F6h,86h opcode in 16-bit modes is a test byte [bp + disp16], imm8 instruction. I believe I am not using this one anywhere actually. (A stack memory access might actually be slower than an unconditional short jump, in fact.)

0A8h is the opcode for test al, imm8 in any mode. The 0A9h opcode changes to an instruction of the form test eax, imm32 in 32- and 64-bit modes.

Two use cases in ldosboot boot32.asm 07f4ba0ef8cd (2021-09-10 22:45:32 +0200):

First, chain two different entrypoints for a common function which both need to initialise a byte-sized register. The mov al, X instructions take 2 bytes each, so __TEST_IMM16 can be used to skip one such instruction. (This pattern can be repeated if there are more than two entrypoints.)

error_fsiboot:
        mov al,'I'

        db __TEST_IMM16 ; (skip mov)
read_sector.err:
        mov al, 'R'     ; Disk 'R'ead error

error:

Second, a certain entrypoint that needs two bytes worth of additional teardown but can otherwise be shared with the fallthrough case of a later code part.

                mov bx, [VAR(para_per_sector)]
                sub word [VAR(paras_left)], bx
                jbe @F          ; read enough -->

                loop @BB
                pop bx
                pop cx

                call clust_next
                jnc next_load_cluster
                inc ax
                inc ax
                test al, 8      ; set in 0FFF_FFF8h--0FFF_FFFFh,
                                ;  clear in 0, 1, and 0FFF_FFF7h
                jz fsiboot_error_badchain
                db __TEST_IMM16
@@:
                pop bx
                pop cx
                call check_enough
                jmp near word [VAR(fsiboot_table.success)]

Here's a use case in inicomp lz4.asm 4d568330924c (2021-09-03 16:59:42 +0200) where we depend on the test al, X instruction clearing the Carry Flag:

.success:
        db __TEST_IMM8                  ; (NC)
.error:
        stc
        retn

Further, here's a very similar use of a skipping instruction in DOSLFN Version 0.41c (11/2012). Instead of test ax, imm16 they're using mov cx, imm16 which has no effect on the status flags but clobbers the cx register instead. (Opcode 0B9h is mov ecx, imm32 in non-16-bit modes, and writes to the full ecx or rcx register.)

;THROW-Geschichten... [english: THROW stories...]
SetErr18:
        mov     al,18
        db      0B9h            ;mov cx,nnnn
SetErr5:
        mov     al,5
        db      0B9h            ;mov cx,nnnn
SetErr3:
        mov     al,3
        db      0B9h            ;mov cx,nnnn
SetErr2:
        mov     al,2
SetError:

Finally, the FAT12 boot loader released on 2002-11-26 as fatboot.zip/fat12.asm by Chris Giese (which I based my FAT12, FAT16, and FAT32 loaders on) uses cmp ax, imm16 as a skipping instruction in its error handler. This is similar to my lDOS boot error handlers but cmp leaves an indeterminate Carry Flag state rather than always setting up No Carry. Also note the comment referring to "Microsoft's Color Computer BASIC":

    mov al,'F'  ; file not found; display blinking 'F'

; 'hide' the next 2-byte instruction by converting it to CMP AX,NNNN
; I learned this trick from Microsoft's Color Computer BASIC :)
    db 3Dh
disk_error:
    mov al,'R'  ; disk read error; display blinking 'R'
error:

Useful one-byte instructions

WIP. Most are listed here. This answer will assume x86 (32-bit), with some notes for x86-64 (usually more bytes). I checked these with Compiler Explorer (godbolt.org) running NASM.

Stack Operations

Register / Integer Operations

String Operations

Word (16-bit) versions of the instructions take 2 bytes. In 64-bit mode, defaults to 64-bit operands, and qword versions also take 2 bytes.

Opcode Prefixes (technically one byte)

LOOP according to ECX

This is the famously slow instruction that decrements CX/ECX/RCX as a counter, then short jumps if not zero, all in 2 bytes! LOOPcc variants also check ZF. Compare this to regular short branching, which takes 3 bytes:

E2 cb   loop    label   ; cb is rel8
4A      dec     ecx     ; sets ZF
75 cb   jnz     label   

JECXZ according to ECX

JCXZ/JECXZ/JRCXZ are weird instructions in the Jcc family that jump short based on if CX/ECX/RCX is 0, in 2 bytes. If you had an instruction before that sets ZF such as dec, you could use a standard JZ.

Use 32-bit x86 instead of 64-bit x86-64, if you can

This is a bit silly, but many code golf challenges only require 32-bit inputs, and 32-bit programs execute just fine on x86-64 processors (almost all instructions work, you can't use the lower 8 bits SIL, DIL, SPL, BPL). Assemble with nasm -f elf32 and link with ld -m elf_i386. You avoid extra bytes from REX prefixes without even thinking about it. MUL can multiply two 32-bit numbers and puts the full result in EDX:EAX, and DIV can divide a 64-bit dividend by r/m32.

...or 16-bit DOS, if you can

NASM can assemble raw COM files with -f bin. You'll need an emulator like emu2 or DOSBox to run these, so it won't run natively on a modern system, which is half the fun of x86.

AND to clear memory

0:  c7 01 00 00 00 00       mov    DWORD PTR [ecx],0x0
6:  31 c0 89 01             xor    eax,eax / mov DWORD PTR [ecx],eax
a:  83 21 00                and    DWORD PTR [ecx],0x0 

CPU registers and flags are in known startup states

For a full/standalone program, we can assume that the CPU is in a known and documented default state based on platform and OS.

For example:

DOS http://www.fysnet.net/yourhelp.htm

Linux x86 ELF http://asm.sourceforge.net/articles/startup.html - in _start in a static executable, most registers are zero other than the stack pointer, to avoid leaking info into a fresh process. pop will load argc which is a small non-negative integer, 1 if run normally from a shell with no args.

Same applies for x86-64 processes on Linux.

Use a good assembler

There are dozens of x86 assemblers out there, and they are not created equal.

Not only can a bad assembler be painful to use, but they might not always output the most optimal code.

Most x86 instructions have multiple valid encodings, some shorter than others.

For example, I saw one user with a 16-bit assembler that emitted different code depending on the order of xchg's operands. It is a commutative operation, it shouldn't make a difference.

87 D8   xchg    ax, bx
93      xchg    bx, ax

Life is too short for bad assemblers, and it should not be the thing getting in the way of golfing.

The three assemblers I would suggest are:

I haven't used enough of the other assemblers to give a good opinion, as I am more than satisfied with those three.

FASM is also well-regarded, using very nearly the same syntax as NASM, and is supported on https://TIO.run/ ; It's able to make 32-bit executables on TIO, unlike with other assemblers, using the directive format ELF executable 3 to emit a 32-bit ELF executable, not a .o object file that would need linking.

EuroAssembler is also open-source; its maintainer is active on Stack Overflow in the [assembly] and [x86] tags.

pop ax and throw a value from stack

pop eax

push word 0; push word 8

push dword 8

push dword 9999

push word 0
push word 9999

"Free bypass": If you already have an instruction with an immediate on a register that you only care about the low part of, making the immediate longer than necessary could allow you to insert into its high part other instructions that can be jumped to (but don't execute when coming from before). This works because of little-endianness. Example; another example.

Use special-case short-form encodings for AL/AX/EAX, and other short forms and single-byte instructions

Examples assume 32 / 64-bit mode, where the default operand size is 32 bits. An operand-size prefix changes the instruction to AX instead of EAX (or the reverse in 16-bit mode).

8-bit inc bl is 2 bytes, using the inc r/m8 opcode + ModR/M operand encoding. So use inc ebx to increment bl, if it's safe. (e.g. if you don't need the ZF result in cases where the upper bytes might be non-zero).

xlatb exists, but is rarely useful. A large lookup table is something to avoid. I've also never found a use for AAA / DAA or other packed-BCD or 2-ASCII-digit instructions, except for a hacky use of DAS as part of converting a 4-bit integer to an ASCII hex digit, thanks to Peter Ferrie.

1-byte lahf / sahf are rarely useful. You could lahf / and ah, 1 as an alternative to setc ah, but it's typically not useful.

And for CF specifically, there's sbb eax,eax to get a 0/-1, or even un-documented but universally supported 1-byte salc (set AL from Carry) which effectively does sbb al,al without affecting flags. (Removed in x86-64). I used SALC in User Appreciation Challenge #1: Dennis ♦.

1-byte cmc / clc / stc (flip ("complement"), clear, or set CF) are rarely useful, although I did find a use for cmc in extended-precision addition with base 10^9 chunks. To unconditionally set/clear CF, usually arrange for that to happen as part of another instruction, e.g. xor eax,eax clears CF as well as EAX. There are no equivalent instructions for other condition flags, just DF (string direction) and IF (interrupts). The carry flag is special for a lot of instructions; shifts set it, adc al, 0 can add it to AL in 2 byte, and I mentioned earlier the undocumented SALC.

std / cld rarely seem worth it. Especially in 32-bit code, it's better to just use dec on a pointer and a mov or memory source operand to an ALU instruction instead of setting DF so lodsb / stosb go downward instead of up. Usually if you need downward at all, you still have another pointer going up, so you'd need more than one std and cld in the whole function to use lods / stos for both. Instead, just use the string instructions for the upward direction. (The standard calling conventions guarantee DF=0 on function entry, so you can assume that for free without using cld.)


8086 history: why these encodings exist

In original 8086, AX was very special: instructions like lodsb / stosb, cbw, mul / div and others use it implicitly. That's still the case of course; current x86 hasn't dropped any of 8086's opcodes (at least not any of the officially documented ones, except in 64-bit mode). But later CPUs added new instructions that gave better / more efficient ways to do things without copying or swapping them to AX first. (Or to EAX in 32-bit mode.)

e.g. 8086 lacked later additions like movsx / movzx to load or move + sign-extend, or 2 and 3-operand imul cx, bx, 1234 that don't produce a high-half result and don't have any implicit operands.

Also, 8086's main bottleneck was instruction-fetch, so optimizing for code-size was important for performance back then. 8086's ISA designer (Stephen Morse) spent a lot of opcode coding space on special cases for AX / AL, including special (E)AX/AL-destination opcodes for all the basic immediate-src ALU- instructions, just opcode + immediate with no ModR/M byte. 2-byte add/sub/and/or/xor/cmp/test/... AL,imm8 or AX,imm16 or (in 32-bit mode) EAX,imm32.

But there's no special case for EAX,imm8, so the regular ModR/M encoding of add eax,4 is shorter.

The assumption is that if you're going to work on some data, you'll want it in AX / AL, so swapping a register with AX was something you might want to do, maybe even more often than copying a register to AX with mov.

Everything about 8086 instruction encoding supports this paradigm, from instructions like lodsb/w to all the special-case encodings for immediates with EAX to its implicit use even for multiply/divide.


Don't get carried away; it's not automatically a win to swap everything to EAX, especially if you need to use immediates with 32-bit registers instead of 8-bit. Or if you need to interleave operations on multiple variables in registers at once. Or if you're using instructions with 2 registers, not immediates at all.

But always keep in mind: am I doing anything that would be shorter in EAX/AL? Can I rearrange so I have this in AL, or am I currently taking better advantage of AL with what I'm already using it for.

Mix 8-bit and 32-bit operations freely to take advantage whenever it's safe to do so (you don't need carry-out into the full register or whatever).

Combinations with CDQ for certain piecewise-linear functions

CDQ sign-extends EAX into EDX, making EDX 0 if EAX is nonnegative and -1 (all 1s) if EAX is negative. This can be combined with several other instructions to apply certain piecewise-linear functions to a value in EAX in 3 bytes:

CDQ + AND\$ \min(x, 0) \$ (in either EAX or EDX). (I have used this here.)

CDQ + OR\$ \max(x, -1) \$.

CDQ + XOR\$ \max(x, -x-1) \$.

CDQ + MUL EDX\$ \max(-x, 0) \$ in EAX and \$ \left\{ \begin{array}{ll} 0 & : x \ge 0 \\ x - 1 & : x < 0 \end{array} \right.\$ in EDX.

Entry point doesn't necessarily have to be first byte of submission

I came across this answer, and didn't understand it at first until I realized that the intention is:

; ---- example calling code starts here -------------

                MOV     ECX, 1
                CALL    entry
                RET

; ---- code golf answer code starts here (5 bytes) --

       41       INC     ECX
entry: E3 FD    JECXZ   SHORT $-1
       91       XCHG    EAX,ECX
       C3       RETN

; ---- code golf answer code ends here -------------

Does not seem to conflict with any of the conditions of "Choose your calling convention" and is otherwise valid assembly language.

Use interrupts and syscalls wisely

In general, unlike the C calling conventions, most syscalls and interrupts will preserve your registers and flags unless noted otherwise, except for a return value usually in AL/EAX/RAX depending on the OS. (e.g. the x86-64 syscall instruction itself destroys RCX and R11)

Linux specific:

DOS/BIOS specific:

        ; AH: 0x25 (set interrupt)
        ; AL: 0x03 (INT3)
        ; Use 0x2504 for INTO, or 0x2501 for INT1
        mov     ax, 0x2503
        ; DX: address to function (make sure to use IRET instead of RET)
        mov     dx, my_interrupt_func
        int     0x21

        ; Now instead of this 3 byte opcode...
        call    my_func
        ; ...just do this 1 byte opcode.
        int3 ; or int1, into

Align Registers to by Power of 2 Value with or/inc

Say you want to aligned a pointer for a load. I.e saying aligning for xmm load:

This is a pretty common idiom:

addq $16, %rdi   // 4b
andq $-16, %rdi  // 4b

A cheaper way:

orq $15, %rdi    // 4b
incq %rdi        // 3b

If your pointer is in RAX, the special 2-byte op al, imm8 no-modrm encoding is useful. Writing a byte or word register leaves the upper bytes untouched. On older Intel CPUs (before Sandybridge) this can cause a partial-register stall, but even for performance it's safe on modern CPUs.

or  $0xf, %al     // 2b   (leave upper bytes untouched)
inc %rax          // 3b   (carry into the upper bytes is possible)

or  $0xf, %dl     // 3b
inc %rdx          // 3b

or  $0xf, %dil    // 4b  REX + opcode + modrm + imm8
inc %rdi          // 3b  no savings

Also works with and $0xf0, %al to round down to the previous alignment boundary.


Also worth noting the or alone will round to value minus 1 so it can be used to keep address offsets in imm8 short encoding range i.e:

andq $-16, %rdi
// Stuff on (%rdi)[0, 79]
vmovdqa 80(%rdi), %xmm0

vs.

orq $15, %rdi
// Stuff on -15(%rdi)[0, 79]
vmovdqa 65(%rdi), %xmm0
# vmovdqa 80 - 15(%rdi), %xmm0   # same thing, but with the -15 in the asm source

Choose your calling convention to put args where you want them.

The language of your answer is asm (actually machine code), so treat it as part of a program written in asm, not C-compiled-for-x86. Your function doesn't have to be easily callable from C with any standard calling convention. That's a nice bonus if it doesn't cost you any extra bytes, though.

In a pure asm program, it's normal for some helper functions to use a calling convention that's convenient for them and for their caller. Such functions document their calling convention (inputs/outputs/clobbers) with comments.

In real life, even asm programs do (I think) tend to use consistent calling conventions for most functions (especially across different source files), but any given important function could do something special. In code-golf, you're optimizing the crap out of one single function, so obviously it's important/special.


To test your function from a C program, you can write a wrapper that puts args in the right places, saves/restores any extra registers you clobber, and puts the return value into e/rax if it wasn't there already.


The limits of what's reasonable: anything that doesn't impose an unreasonable burden on the caller:

Note 1: or have some well-defined sensible rule for how RSP is modified: e.g. callee-pops stack args like with ret 8. (Although stack args and a larger ret imm16 encoding are usually not what you want for code-golf). Or even returning an array by value on the stack is an unconventional but usable calling-convention. e.g. pop the return address into a register, then push in a loop, then jmp reg to return. Probably only justifiable with a size in a register, else the caller would have to save the original RSP somewhere. RBP or some other reg would have to be call-preserved so a caller could use it as a frame pointer to easily clean up the result. Also probably not smaller than using stos to write an output pointer passed in RDI.

Borderline cases: write a function that produces a sequence in an array, given the first 2 elements as function args. I chose to have the caller store the start of the sequence into the array and just pass a pointer to the array. This is definitely bending the question's requirements. I considered taking the args packed into xmm0 for movlps [rdi], xmm0, which would also be a weird calling convention and ever harder to justify / more of a stretch.


Return a boolean in FLAGS (condition codes)

OS X system calls do this (CF=0 means no error): Is it considered bad practice to use the flags register as a boolean return value?.

Any condition that can be checked with one jcc is perfectly reasonable, especially if you can pick one that has any semantic relevance to the problem. (e.g. a compare function might set flags so jne will be taken if they weren't equal).


Require narrow args (like a char) to be sign or zero extended to 32 or 64 bits.

This is not unreasonable; using movzx or movsx to avoid partial-register slowdowns is normal in modern x86 asm. In fact clang/LLVM already makes code that depends on an undocumented extension to the x86-64 System V calling convention: args narrower than 32 bits are sign or zero extended to 32 bits by the caller.

You can document/describe extension to 64 bits by writing uint64_t or int64_t in your prototype if you want, e.g. so you can use a loop instruction, which uses the whole 64 bits of rcx unless you use an address-size prefix to override the size down to 32 bit ecx (yes really, address-size not operand-size).

Note that long is only a 32-bit type in the Windows 64-bit ABI, and the Linux x32 ABI; uint64_t is unambiguous and shorter to type than unsigned long long.


Existing calling conventions:

Use shr for first occurrence lookup tables

Often times you need to do something like this to check whether a byte has been encountered:

bool found[256]={false};
for (uint8_t x : data) {
    // Check for first occurrence
    if (!found[x]) {
        // mark it as encountered
        found[x] = true;
        // do something
        func(x);
    }
}

There is a very easy way to do this in x86, and that is via shr.

Setup

Setting it up is easy thanks to push -1 which can create a block of 1 bits. I use x86_64 here, but it is the same on x86.

You can switch it to rep stos if you have that set up, but it isn't worth it to set up just for that loop.

        push    TABLE_SIZE/8
        pop     rcx
.Lpush_loop:
        push    -1
        loop    .Lpush_loop

Usage

shr will both mark the occurrence and indicate that it was unique by setting OF.

.Lloop:
        lodsb # example

        # 3 bytes (2 bytes if you use RDI/RBX/RBP/RSI or a register src)
        shr     byte ptr [rsp + rax]
        # OF == first occurrence
        jno     .Lfound

.Lfirst_occurrence:
        ...
.Lfound:
        loop    .Lloop # example

Magic.

See my answer for Type uniqchars! where I thoroughly explain it.

Avoid registers which need prefixes

Quite a simple tip I haven't seen mentioned before.

Avoid r8-r15 (as well as dil, sil, bpl, and spl) on x86_64 like the plague. Even just thinking about these registers requires an extra REX prefix. The only exception is if you are using them exclusively for 64-bit arithmetic (which also needs REX prefixes). Even still, you are usually better off using a low register since some operations can be done using the implicit zero extension.

Note that this tip also applies to ARM Thumb-2.

Additionally, be careful when using 16-bit registers in 32/64-bit mode (as well as 32-bit registers in 16-bit mode, but this is rare), as these need a prefix byte as well.

However, unlike the extra x86_64 registers, 16-bit instructions can be useful: Many instructions which would otherwise need a full 32-bit immediate argument will only use a 16-bit argument. So, if you were to bitwise and eax by 0xfffff00f, and ax, 0xf00f would be smaller.

Take advantage of the x86_64 code model

Linux's default code model will put all of your code and globals in the low 31 bits of memory, so 32-bit pointer arithmetic here is perfectly safe. The stack, libraries, and any dynamically allocated pointers are not, though. Try it online!

Make sure to still use the entire 64 bits in memory operands (including lea), because using [eax] requires a 67 prefix byte.

Use multiplication for hashing

IMUL, multiplication by an immediate signed number, is a powerful instruction which can be used for hashing.

The regular multiplication instruction hard-codes one of the input operands and the output operand to be in eax (or ax or al). This is inconvenient; it requires instructions for setup and sometimes also to save and restore eax and edx. But if one of the operands is a constant, the instruction becomes much more versatile:

I used this many times (I hope I can be excused for these shameless plugs: 1 2 3 ...)

Use fastcall conventions

x86 platform has many calling conventions. You should use those that pass parameters in registers. On x86_64, the first few parameters are passed in registers anyway, so no problem there. On 32-bit platforms, the default calling convention (cdecl) passes parameters in stack, which is no good for golfing - accessing parameters on stack requires long instructions.

When using fastcall on 32-bit platforms, 2 first parameters are usually passed in ecx and edx. If your function has 3 parameters, you might consider implementing it on a 64-bit platform.

C function prototypes for fastcall convention (taken from this example answer):

extern int __fastcall SwapParity(int value);                 // MSVC
extern int __attribute__((fastcall)) SwapParity(int value);  // GNU   

Note: you can also use other calling conventions, including custom ones. I never use custom calling conventions; for any ideas related to these, see here.

(way too many) ways of zeroing a register

I remember being taught these by a certain person (I "invented" some of these myself); I don't remember who did I get them from, anyways these are the most interesting; possible use cases include restricted source code challenges or other bizzare stuff.

=> Zero mov:

mov reg, 0
; mov eax, 0: B800000000

=> push+pop:

push [something equal to zero]
pop reg
; push 0 / pop eax: 6A0058

; note: if you have a register equal to zero, it will be
; shorter but also equal to a mov.

=> sub from itself:

sub reg, reg
; sub eax, eax: 29C0

=> mul by zero:

imul reg, 0
; imul eax, 0: 6BC000

=> and by zero:

and reg, 0
; and eax, 0: 83E000

=> xor by itself:

xor reg, reg
; xor eax, eax: 31C0

; possibly the best way to zero an arbitrary register,
; I remembered this opcode (among other).

=> or and inc / not:

or reg, -1
inc reg     ; or not reg

; or eax, -1 / inc eax: 83C8FF40

=> reset ECX:

loop $

; loop $: E2FE

=> flush EDX:

shr eax, 1
cdq

; D1E899

=> zero AL (AH = AL, AL = 0)

aam 1
; D401

=> reset AH:

aad 0
; D500

=> Read 0 from the port

mov dx, 81h
in al, dx
; 66BA8100EC

=> Reset AL

stc
setnc al
; F90F93C0

=> Use the zero descriptor from gdt:

sgdt [esp-6]
mov reg, [esp-4]
mov reg, [reg]
; with eax: 0F014424FA8B4424FC8B00

=> Read zero from the fs segment (PE exe only)

mov reg, fs:[10h]
; with eax: 64A110000000

=> The brainfuck way

inc reg
jnz $-1
; with eax: 4075FD

=> Utilize the coprocessor

fldz
fistp dword ptr [esp-4]
mov eax, [esp-4]
; D9EEDB5C24FC8B4424FC

Another possible options:

There are way cooler and potentially useful ways to execute this operation; although I didn't come up with them, therefore I'm not posting.

Try XLAT for byte memory access

XLAT is a one byte instruction that is equivalent to AL = [BX+AL]. Yes, that's right, it lets you use AL as an index register for memory access.

Try AAM or AAD for byte division operations

If you are working with only 8 bit values, using the AAM instruction can sometimes save several bytes over DIV reg8 since it will take an imm8 and returns remainder and quotient in opposite AH/AL registers as DIV.

D4 0A    AAM        ; AH = AL / 10, AL = AL % 10

It can also accept any byte value as the divisor as well by altering the second byte.

D4 XX    AAM  XX    ; AH = AL / XX, AL = AL % XX

And AAD is the inverse of this, which is two operations in one.

D5 XX    AAD  XX    ; AL = AH * XX + AL

To copy a 64-bit register, use push rcx ; pop rdx instead of a 3-byte mov.
The default operand-size of push/pop is 64-bit without needing a REX prefix.

  51                      push   rcx
  5a                      pop    rdx
                vs.
  48 89 ca                mov    rdx,rcx

(An operand-size prefix can override the push/pop size to 16-bit, but 32-bit push/pop operand-size is not encodeable in 64-bit mode even with REX.W=0.)

If either or both registers are r8..r15, use mov because push and/or pop will need a REX prefix. Worst case this actually loses if both need REX prefixes. Obviously you should usually avoid r8..r15 anyway in code golf.


You can keep your source more readable while developing with this NASM macro. Just remember that it steps on the 8 bytes below RSP. (In the red-zone in x86-64 System V). But under normal conditions it's a drop-in replacement for 64-bit mov r64,r64 or mov r64, -128..127

    ; mov  %1, %2       ; use this macro to copy 64-bit registers in 2 bytes (no REX prefix)
%macro MOVE 2
    push  %2
    pop   %1
%endmacro

Examples:

   MOVE  rax, rsi            ; 2 bytes  (push + pop)
   MOVE  rbp, rdx            ; 2 bytes  (push + pop)
   mov   ecx, edi            ; 2 bytes.  32-bit operand size doesn't need REX prefixes

   MOVE  r8, r10             ; 4 bytes, don't use
   mov   r8, r10             ; 3 bytes, REX prefix has W=1 and the bits for reg and r/m being high

   xchg  eax, edi            ; 1 byte  (special xchg-with-accumulator opcodes)
   xchg  rax, rdi            ; 2 bytes (REX.W + that)

   xchg  ecx, edx            ; 2 bytes (normal xchg + modrm)
   xchg  rcx, rdx            ; 3 bytes (normal REX + xchg + modrm)

The xchg part of the example is because sometimes you need to get a value into EAX or RAX and don't care about preserving the old copy. push/pop doesn't help you actually exchange, though.

Save on jmp bytes by arranging into if/then rather than if/then/else

This is certainly very basic, just thought I would post this as something to think about when golfing. As an example, consider the following straightforward code to decode a hexadecimal digit character:

    cmp $'A', %al
    jae .Lletter
    sub $'0', %al
    jmp .Lprocess
.Lletter:
    sub $('A'-10), %al
.Lprocess:
    movzbl %al, %eax
    ...

This can be shortened by two bytes by letting a "then" case fall into an "else" case:

    cmp $'A', %al
    jb .digit
    sub $('A'-'0'-10), %eax
.digit:
    sub $'0', %eax
    movzbl %al, %eax
    ...

Use whatever calling conventions are convenient

System V x86 uses the stack and System V x86-64 uses rdi, rsi, rdx, rcx, etc. for input parameters, and rax as the return value, but it is perfectly reasonable to use your own calling convention. __fastcall uses ecx and edx as input parameters, and other compilers/OSes use their own conventions. Use the stack and whatever registers as input/output when convenient.

Example: The repetitive byte counter, using a clever calling convention for a 1 byte solution.

Meta: Writing input to registers, Writing output to registers

Other resources: Agner Fog's notes on calling conventions

Subtract -128 instead of add 128

0100 81C38000      ADD     BX,0080
0104 83EB80        SUB     BX,-80

Samely, add -128 instead of subtract 128

Use do-while loops instead of while loops

This is not x86 specific but is a widely applicable beginner assembly tip. If you know a while loop will run at least once, rewriting the loop as a do-while loop, with loop condition checking at the end, often saves a 2 byte jump instruction. In a special case you might even be able to use loop.

Create 3 zeroes with mul (then inc/dec to get +1 / -1 as well as zero)

You can zero eax and edx by multiplying by zero in a third register.

xor   ebx, ebx      ; 2B  ebx = 0
mul   ebx           ; 2B  eax=edx = 0

inc   ebx           ; 1B  ebx=1

will result in EAX, EDX, and EBX all being zero in just four bytes. You can zero EAX and EDX in three bytes:

xor eax, eax
cdq

But from that starting point you can't get a 3rd zeroed register in one more byte, or a +1 or -1 register in another 2 bytes. Instead, use the mul technique.

Example use-case: concatenating the Fibonacci numbers in binary.

Note that after a LOOP loop finishes, ECX will be zero and can be used to zero EDX and EAX; you don't always have to create the first zero with xor.

The loop and string instructions are smaller than alternative instruction sequences. Most useful is loop <label> which is smaller than the two instruction sequence dec ECX and jnz <label>, and lodsb is smaller than mov al,[esi] and inc si.

The FLAGS are set after many instructions

After many arithmetic instructions, the Carry Flag (unsigned) and Overflow Flag (signed) are set automatically (more info). The Sign Flag and Zero Flag are set after many arithmetic and logical operations. This can be used for conditional branching.

Example:

d1 f8                   sar    %eax

ZF is set by this instruction, so we can use it for condtional branching.

To add or subtract 1, use the one byte inc or dec instructions which are smaller than the multibyte add and sub instructions.

Use conditional moves CMOVcc and sets SETcc

This is more a reminder to myself, but conditional set instructions exist and conditional move instructions exist on processors P6 (Pentium Pro) or newer. There are many instructions that are based on one or more of the flags set in EFLAGS.

mov small immediates into lower registers when applicable

If you already know the upper bits of a register are 0, you can use a shorter instruction to move an immediate into the lower registers.

b8 0a 00 00 00          mov    $0xa,%eax

versus

b0 0a                   mov    $0xa,%al

Use push/pop for imm8 to zero upper bits

Credit to Peter Cordes. xor/mov is 4 bytes, but push/pop is only 3!

6a 0a                   push   $0xa
58                      pop    %eax

You can fetch sequential objects from the stack by setting esi to esp, and performing a sequence of lodsd/xchg reg, eax.

In a lot of cases, accumulator-based instructions (i.e. those that take (R|E)AX as the destination operand) are 1 byte shorter than general-case instructions; see this question on StackOverflow.