programmers resources
  http://www.intel-assembler.it/  (c)2017 intel-assembler.it   info@intel-assembler.it
 
Search :  
Lingua Italiana    English Language   
Index
 
just an empty assembly space
just an arrow Intel Platform
just an arrow Article & Guides
just an arrow Download Software


23/01/2009 Featured Article: How to remove Buzus Virus (permalink)




:::3061899:::
Bottone Scambio Directory Pubblicitaonline.it
Home Page | Articles & Guides | Download | Intel Platform | Contacts

Google
 


Bookmark and Share
Download 
Tell a friend



Pentium Code Optimization using U-pipe V-pipe

A cross reference by instructions

(by quantasm)

A reference that explains how to get better performance writing assembly code thanks to optimized Pentium architecture. Shows instructions which can take advantage of pipelining via U-pipe and V-pipe parallel execution to save cycles.
This article is online from 2439 days and has been seen 5222 times





  Pentium
  Optimization Cross-Reference by Instruction
                    

The following is a list of optimizations that may come in handy.
Each one is listed alphabetically (more or less) in the first column.

The second column lists the CPU or CPU's that this optimization is
applicable to; alternatively it may be noted as applicable to 16-bit
code or 32-bit code.  

The third column contains one or more replacement sequences of code
that is either faster or smaller (sometimes both) than the first
column. For some obscure optimizations, the action of the first column
instruction is explained.  

The forth column contains a description and/or examples.   

                           Replacement
Instruction     CPU's      or Action            Description / Notes
-------------------------------------------------------------------------------

aad (imm8)      all        AL = AL+(AH*imm8)    If imm8 is blank uses 10.
                           AH = 0               AAD is almost always slower,
                                                but only 2 bytes long

aam (imm8)      all        AH = AL/imm8         Same as AAD
                           AL = AL MOD imm8

add             16-bit     lea reg, [reg+reg+disp]

                                               Use LEA to add
                                               base + index + displacement
                                               Also preserves flags;
                                               for example:

                                                 add bx, 4

                                               can be replaced by:

                                                 lea  bx, [bx+4]

                                               when the flags must not
                                               be changed

add             32-bit     lea reg, [reg+reg*scale+disp]

                                               Use LEA to add
                                               base + scaled index + disp
                                               Also preserves flags.
                                               (See previous example)
                                               The 32-bit form of LEA
                                               is much more powerful
                                               than the 16-bit version
                                               because of the scaling
                                               and the fact that almost
                                               all of the 8 General purpose
                                               registers can be used
                                               as base and index registers

and reg, reg    Pent       test reg, reg       Use TEST instead of AND
                                               on the Pentium because
                                               fewer register conflict
                                               will result in better pairing

bswap           Pent       ror eax, 16         Pairs in U pipe, BSWAP
                                               doesn't pair.
                                               disadvantage: modifies flags
                                               (Not a direct replacement)

call dest1      286+       push offset dest2   When CALL is followed by
jmp  dest2                 jmp  dest1          a JMP, change the return
                                               address to the JMP destination

call dest1      all        jmp  dest1          When a CALL is followed by a
ret                                            RET, the CALL can be replaced
                                               by a JMP

cbw             386+       mov ah, 0           When you know AL < 128
                                               use MOV AH, 0 for speed.
                                               But use CBW for smaller
                                               code size

cdq             486+       xor edx, edx        When you know EAX is positive
                                               Faster, better pairing;
                                               disadvantage: modifies flags

                Pent       mov edx, eax        When EAX value could be
                           sar edx, 31         positive or negative
                                               because of better pairing

cmp mem, reg    286        cmp reg, mem        reg, mem is 1 cycle faster

cmp reg, mem    386        cmp mem, reg        mem, reg is 1 cycle faster

dec reg16                  lea reg16, [reg16 - 1]  Use to preserve flags
                                                   for BX, BP, DI, SI

dec reg32                  lea reg32, [reg32 - 1]  Use to preserve flags
                                                   for EAX, EBX, ECX, EDX
                                                   EDI, ESI, EBP

div <op>        8088       shr accum, 1        When <op> resolves to
 2, use
                                               shift for division
                                               (use CL for 4, 8, etc)

div <op>        186+       shr accum, n        When <op> resolves to
 a power
                                               of 2 use shifts for division

enter imm16, 0  286+       push bp             ENTER is always slower
                           mov  bp, sp         and 4 bytes in length
                           sub  sp, imm16      if imm16 = 0 then push/mov
                                               is smaller

                386+       push ebp
                32-bit     mov  ebp, esp
                           sub  esp, imm16

inc reg16                  lea reg16, [reg16 + 1]  Use to preserve flags
                                                   for BX, BP, DI, SI

inc reg32                  lea reg32, [reg32 + 1]  Use to preserve flags
                                                   for EAX, EBX, ECX, EDX
                                                   EDI, ESI, EBP

jcxz <dest>:    486+       test cx, cx           JCXZ is faster and
                           je   <dest>:          smaller on 8088-286.
                                                 On the 386 it is the
                                                 about the same speed

                486+       test ecx, ecx         Never use JCXZ on 486
                           je   <dest>:          or Pentium except for
                                                 compactness

lea reg, mem   8088-286    mov reg, OFFSET mem   MOV reg, imm is faster
                                                 on 8088 - 286. 386+
                                                 they are the same
Note: There are many uses for LEA, see: add, inc, dec, mov, mul

leave           486+      mov sp, bp           LEAVE is only 1 byte
                          pop bp               long and is faster
                                               on the 186-386. The
                          mov esp, ebp         MOV/POP is much faster
                          pop ebp              on 486 and Pentium

lodsb           486+      mov al, [si]         LODS is only 1 byte long
                          inc si               and is faster on 8088-386,
                                               much slower on the 486.
                                               On the Pentium the MOV/INC
                                               or MOV/ADD instructions
                                               pair, taking only 1 cycle

lodsw           486+      mov ax, [si]         see lodsb
                          add si, 2

lodsd           486+      mov eax, [esi]       see lodsb
                          add esi, 4

loop <dest>:    386+      dec cx               LOOP is faster and
                          jnz <dest>:          smaller on 8088-286.
                                               on 386+ DEC/JNZ is
loopd <dest>:             dec ecx              much faster. On the Pentium
                          jnz <dest>:          the DEC/JNZ instructions
                                               pair taking only 1 cycle

loopXX <dest>:  486+      je  $+5              The 3 replacement instructi
ons
                          dec cx               ( XX = e,ne,z or nz) are much 
                                               faster on the 486+
                          jnz <dest>:          LOOPxx is smaller and
                                               faster on 8088-286

loopdXX <dest>: 486+      je  $+5              The speed is about the
                          dec ecx              same on the 386
                          jnz <dest>:

mov reg2, reg1  286+      lea reg2, [reg1+n]   LEA is faster, smaller and
                                               preserves flags. This is a
inc/dec/add/sub reg2                           way to do a MOV and ADD/SUB
                                               of a constant

mov acc, reg    all        xchg acc, reg       Use XCHG for smaller code
                                               when one of the registers
                                               final value can be ignored
                                               Note that acc = AL, AX or EAX.

mov mem, 1      Pent      lea bx, mem          Displacement/immediate does
                          mov [bx], 1          not pair. LEA/MOV can be used
                                               if other code can be placed
                                               in between to prevent AGI's.
                          mov ax, 1            MOV/MOV may be easier to pair
                          mov mem, ax

mov [bx+2], 1   Pent      mov ax, 1            Better pairing because
                          mov [bx+2], ax       displacement/immediate
                                               instructions do not pair

                          lea bx, [bx+2]
                          mov [bx], 1

movsb           486+      mov al, [si]         MOVS is faster and
                          inc si               smaller to move a single
                          mov [di], al         byte, word or dword
                          inc di               on the 8088-386.
                                               On the 486+ the MOV/INC
                                               method is faster
                                               NOTE: REP MOVS is always
                                               faster to move a large block

movsw           486+      mov ax, [si]         see MOVSB
                          add si, 2
                          mov [di], ax
                          add di, 2

movsd           486+      mov eax, [esi]       see MOVSB
                          add esi, 4
                          mov [edi], eax
                          add edi, 4

movzx r16, rm8  486+      xor bx, bx           MOVZX is faster and
                          mov bl, al           smaller on the 386.
                                               On the 486+ XOR/MOV
movzx r32, rm8  486+      xor ebx, ebx         is faster. Possible
                          mov bl, al           pairing on the Pentium.
                                               (source can be reg or mem)
movzx r32, rm16 486+      xor ebx, ebx         disadvantage: modifies flags
                          mov bx, ax

mul n           8088+     shl ax, cl           Use shifts or ADDs instead of
                                               multiply when n is a power of 2

mul n           Pent      add ax, ax           ADD is better than single
                                               shift because it pairs better

mul             32-bit    lea                  Use LEA to multiply by
                                               2, 3, 4, 5, 7, 8, 9
                          lea eax, [eax+eax*4] (ex: multiply EAX * 5)
                                               LEA is better than SHL on the
                                               Pentium because it pairs in
                                               both pipes, SHL pairs only in
                                               the U pipe

or reg, reg     Pent      test reg, reg        Better pairing because
                                               OR writes to register.
                                               (This is for src = dest)

pop mem         486+      pop reg              Faster on 486+
                          mov mem, reg         Better pairing on Pentium

push mem        486+      mov  reg, mem        Faster on 486
                          push reg             Better pairing on Pentium

pushf           486+      rcr reg, 1           To save only the carry flag
                                               use a rotate (RCR or RCL)
                             or                into a register.
RCR and RCL
                                               are pairiable (U pipe only)
                          rcl reg, 1           and take 1 cycle.
PUSHF is
                                               slow and not pairable.

popf            486+       rcl reg, 1          To restore only the carry flag.
                                               See PUSHF.
                             or

                          rcr reg, 1

rep scasb       Pent       loop1:              REP SCAS is faster and
                           mov al, [di]        smaller on 8088-486.
                           inc di              Expanded code is faster
                           cmp al, reg2        on Pentium due to pairing
                           je  exit
                           dec cx
                           jnz loop1
                           exit:

shl reg, 1      Pent       add reg, reg        ADD pairs better. SHL
                                               only pairs in the U pipe

stosb           486+       mov [di], al        STOS is faster and smaller
                           inc di              on the 8088-286, and the same
                                               speed on the 386
On the 486+
stosw           486+       mov [di], ax        the MOV/INC is slightly
                           add di, 2           faster

stosd           486+       mov [edi], eax      REP STOS is faster on 8088-386.
                           add edi, 4          MOV/INC or MOV/ADD is faster
                                               on the 486+
                                               Note: use LEA SI, [SI+n]
                                               to advance LEA without
                                               changing the flags.

xchg            all                            Use xchg acc, regto do a
                                               1 byte MOV when one register
                                               can be ignored.

xchg reg1, reg2 Pent       push reg1            pushes and pops are 1 cycle
                           push reg2            faster on Pentium due to
                           pop  reg1            pairing;
                           pop  reg2 disadvantage: uses stack

                Pent       mov  reg3, reg1      Faster and better pairing
                           mov  reg1, reg2      if reg3 is available.
                           mov  reg2, reg3

xlatb           486+       mov bh, 0            XLAT is faster and smaller
                           mov bl, al           on 8088-386. MOV's are faster
                           mov al, [bx]         on 486+. Best to rearrange
                                                instructions to prevent AGI's
xlatb           486+       xor ebx, ebx         and get pairing on Pentium
                           mov bl, al           Force high part of BX/EBX
                           mov al, [ebx]        to zero outside of loop;
                                                disadvantage: modifies flags
  
Copyright Quantasm.




Top
Download 
Tell a friend
Bookmark and Share



Similar Articles

FPU timing
8087-Pentium coprocessor timing and pairing
(by Quantasm)

How to optimize code on a 386/486/Pentium
Intel Assembler Code Optimization and Pipelining
(by Michael Kunstelj)

Notes on Intel Pentium Processor
CMPXCHG8B CPUID MOV RDMSR RDTSC RSM WRMSR
(by Microsoft)

Optimizations for Intel's 32-Bit Processors
A 49 page guide on Intel Asm Code Optimization
(by Bev Zaharie)

Pairing Pentium Instructions
A brief doc on Pentium optimized programming
(by Quantasm / Mike Schmit)

The Complete Pentium Instruction Set Table
(32 Bit Addressing Mode Only)
(by Sang Cho)

Tips on Saving Bytes in ASM Programs
Tricks on How to write compact code
(by Larry Hammick)

Write Optimized Pentium Code
A series of document on optimizing asm code
(by Agner Fog)

 Tags: pentium, optimize


webmaster jes
writers rguru, tech-g, aiguru, drAx

site optimized for IE/Firefox/Chrome with 1024x768 resolution

Valid HTML 4.01 Transitional


ALL TRADEMARKS ® ARE PROPERTY OF LEGITTIMATE OWNERS.
© ALL RIGHTS RESERVED.

hosting&web - www.accademia3.it

grossocactus
find rguru on
http://www.twitter.com/sicurezza3/
... send an email ...
Your name

Destination email

Message

captcha! Code