X86 Tips And Tricks
Laura Fairhead (April 2000)
I find it strange that an awful lot of ASM programmers who write
real-mode code seem to think that they (i) have to be 8086 compatible?
(ii) can only use 16bit functions.
The first point is just a disease but the 2nd I think is due to
In real-mode you have access to ALMOST all of the functionality of
the processor, in fact you typically have MORE access than the average
p-mode program (because the O/S will block you).
Worse still it has been common knowledge for a long time now that
you can even access the WHOLE of memory (that is ALL 4GB) without
any help from special drivers. (Mind you you do have to toggle that
bit in CR0 albeit momentarily....)
Any assembler will assemble code that uses 32-bit operand=size
or address-size by adding the prefixes 066h and 067h for you. Of course
this is the hit, you add one byte each time you access a 32-bit register,
but the same goes if you were accessing a 16-bit register in 32-bit code.
Apart from using 32-bit registers/operands you also have access
to a 32-bit address size. This holds so many benefits, for example
if you want to access some data on the stack you DONT have to do that
MOV BP,SP rubbish just:-
One of the notable deficiencies of the x86 architecture is that a
MOV does not effect the flags. This is the most natural state of affairs
since we have an instant test for zero/sign.
One of the ways around:-
Where regX was=0, adding the data is equivalent to a MOV that sets
the flags for us.
Do you like writing "cryptic" code ? :-
So what's the point other than to make your code "unreadable" ?
One of the keys to understanding assembly language is the flags.
Efficient use of the flags is also the way to write very compact code.
I remember the major block I had for grasping assembly was the conditional
branch, in particular:-
(it was actually BEQ, the equivalent 6502 instruction, but the principle
is the same.)
Here I remember asking the question, yes but jump if equal WHAT??!
The fact is that most people seem to view CMP and JE/JNE as inseperable
partners and they are not. This is machine code-- it is NOT a language,
all we have is the machine state.
There is a synonym in x86 for JE called JZ, also its partner JNE can
be called JNZ. I always use these synonymns for the basic reason that
the name more accurately describes what the instruction actually does
(as opposed to what it is generally used for). JZ => Jump if Zero, the
JZ instruction branches iff ZF=1, there is no EQUAL !!
will branch to gohereifaxwas1 if AX was equal 1 at the start, since
the DEC instruction will set the zero flag if it results in a 0, and
the MOV doesn't affect the flags (many of the intel instructions don't).
When you have arrays where the elements can only take two possible
values it is best to use bits. The x86 from 386+ has very nice bit test
instruction which allows you to address individual bits without messing
about. God knows why but a lot of ASM programmers are still programming
8086 compatible code, didn't someone tell them the 286 is dead?
BT [mem],AX Tests the AX'th bit of a bit array that starts at mem.
You have to use a 16/32bit register but otherwise this instruction is
extremely useful; it sets the carry flag to the previous state of the bit.
The others BTC,BTS,BTR also complement, set, and reset the respective bit
(AFTER getting it into the CF!).
Do you know how many C instructions are needed to perform this
Although documented as being useful for endian conversion this
instruction also allows the programmer to make the hi-word of a 32bit
register more accessible.
before 1 2 3 4
after 4 3 2 1
Of course you get all your words al-reverso, but if you were dealing
with 4-bytes of data now you have the other 2 in AL and AH.
For example if you are copying DS:SI->ES:DI but each byte in the target
you want to be 0FFh if the source was non-zero and 000h otherwise, it is
best to do it in dwords:-
Not the MOST efficient way, granted, but you can see my point. For
further examples of BSWAP in action take a look at the graphic primitives
in some decent demo source.
SIGN EXTENSION ENCODING
The instruction family CMP/SUB/OR/XOR/AND/ADD/SBB/ADC have an encoding
where the immediate 8-bit source operand is sign extended to the destination
size. This can be used to set word/dword memory location to 0 or -1 using
less bytes than the equivalent MOV :-
sets memory to -1
82 xx FF OR WORD PTR [mem],-1
82 xx FF OR DWORD PTR [mem],-1
C7 xx FF FF MOV WORD PTR [mem],-1
C7 xx FF FF FF FF MOV DWORD PTR [mem],-1
sets memory to 0
82 xx 00 AND WORD PTR [mem],0
82 xx 00 AND DWORD PTR [mem],0
C7 xx 00 00 MOV WORD PTR [mem],0
C7 xx 00 00 00 00 MOV DWORD PTR [mem],0
LAHF/SAHF are under-used instructions which offer a wonderful service.
LAHF copies the least significant byte of EFLAGS to the AH register,
likewise SAHF copies the AH register to the least significant byte of
The least significant byte of EFLAGS is:-
b7 b6 b5 b4 b3 b2 b1 b0
SF ZF x AF x PF x CF
So you can use this to save all of the general-purpose flags minus OF.
This can be used instead of PUSHF/POPF in a lot of cases, you don't save
any bytes but your code will be quicker as you are not accessing memory
There is a special encoding for INC/DEC that only takes 1 byte. This
is INC/DEC reg16 (or INC/DEC reg32 in 32-bit mode). There are many places
which you can use this, note that :-
is the same as (*)
So if we don't care about AH, we can substitute INC AL for INC AX, and
save a byte.
(*) (well, okay not the flags but you get my point)
The same thing works for DEC :-
is the same as
So if we don't care about AH, we can substitue DEC AL for DEC AX, and
save another byte.
If you know that AL will never = 0FFh, then INC AX is doing the same
operation as INC AL.
Running out of registers? You are probably just not using them
So we have EAX,EBX,ECX,EDX,ESI,EDI,EBP
Yes 32-bit, I only rarely even see people use these in 16-bit code,
don't know why because they are still there. Okay so you waste a byte with
that operand prefix, true. 7*32bits, thats 224bits, almost enough to
write a tiny program!! Don't believe me? Think of a bit as a matchbox
which can have a pebble in it or not, just imagine how much state information
you have with 224 of these, could you fit them all on the kitchen table?
It's really amazing what you can do with just 3 registers let alone,
how many??, XCHG comes in very handy you know.
Oh! Do watch out!
I didn't warn you; when an interrupt occurs the processor will only
save the 16-bit register set to the stack. So if your interrupt routine
uses any 32-bit registers it must save/restore them itself. Or you could
catch a bug like Windoze did.
JUST PLAIN BAD ?
I am a horror really; do you know that I once used SP as an extra
register. Naughty, naughty... Of course when you get an interrupt, SPLAT!,
all that stack space used by the routine is actually writing over your
most important data structures.
Using SP/ESP is really fine though if and only if you don't use the
stack. So NO interrupts. Even the hottest hackers start to cringe if
they have to get this dirty, it's just PLAIN INELEGANT, however it is
important to be aware of possibilities even if you are sure you are
not going to use them.
SO ARE THERE REALLY ANY EXTRA REGISTERS ?
Oh yes, right in front of your eyes. DS, ES, FS, GS. 4 lovely 16-bit
chunks of memory right in the processor core. Still I'm not absolutely
sure anymore whether its worth it to use these. The way the processors
are being built these days it would probably have been ( a lot ) quicker
to use the stack frame....
Okay so you've run out of registers in that inner loop. Your trying
to blit your super-rotating-warping sprites at the speed of light and
it's no good if we get all those L1 cache accesses (which do take TIME).
Other places? Well how many of these registers you need are going to be
constant for the period of the loop? I bet you there are at least a few.
Here is where we get off:-
say this is part of your loop, but you know that CX is never going to
change. What a waste !!
MOV WORD PTR CS:[k1],konstant
JMP SHORT $+2
k1 EQU $+2
You just saved a register. The JMP SHORT is a just-in-case, if the
code at offset k1 in the loop is in the prefetch queue when it is modified
you could end up with the processor modifying the memory but not the
prefetch queue, so the 1st time around the loop executes improperly.
This doesn't happen post-Pentium since the Pentium flushes the prefetch
queue if you write to it (oh, all those debugger traps that are now
LEA is useful for so many things, I find it quaint now that on first
encountering it I thought all it was was a MOV instruction.
LEA allows you to access no less than 3 adders in the CPU simultaneously.
One of these adders can be scaled by 1,2,4,8 allowing multiplication.
There are two address formats with x86 you can use BOTH regardless of
the code size.
If the address is 32-bit and the operand size is only 16-bit, the
effective address gets truncated into the destination. If the address
is 16-bit and the operand size is 32-bit then the effective address gets
zero-extended into the destination. This later point allows you to get
a MOVZX instruction on immediate values, ala:-
which is equivalent to:-
only the LEA is taking a byte less, as the data is represented as a
LEA can give you various multiples of a register
LEA EAX,[EAX*2] *2
LEA EAX,[EAX*2+EAX] *3
LEA EAX,[EAX*4] *4
LEA EAX,[EAX*4+EAX] *5
LEA EAX,[EAX*8] *8
LEA EAX,[EAX*8+EAX] *9
Of course the source and destination don't have to be the same so
you can get an extra MOV out of it. In addition you have the displacement
factor, so instead of:-
You can do:
Pretty impressive !
LEA doesn't affect the flags, so if you need to add without affecting
flags here is your instruction.
As an aside it is at this point where MASM becomes rather embarrased.
It will not assemble the following instruction:-
Instead requires you to put:-
LEA EAX,DWORD PTR DS:[01234h]
Which makes your code look right up the garden path (what the hell
does the DWORD and DS: have to do with this instruction??).
But then again MASM specializes in bulk and gimmicks rather than
precision in functionality.