assembly programming

210nt22.pdf

Home >Computer Science homework help >assembly programming

Chapter 1

Syllabus

Catalog Description: Computer structure, machine representation of data, addressing and indexing, computation and control instructions, assembly language and assemblers; procedures (subroutines) and data segments, linkages and subroutine calling conventions, loaders; practical use of an assembly language for computer implementation of illustrative examples.

Course Goals

0 Knowledge of the basic structure of microcomputers - registers, mem- ory, addressing I/O devices, etc.

1 Knowledge of most non-privileged hardware instructions for the Ar- chitecture being studied.

2 Ability to write small programs in assembly language

3 Knowledge of computer representations of data, and how to do simple arithmetic in binary & hexadecimal, including conversions

4 Being able to implementing a moderately complicated algorithm in assembler, with emphasis on efficiency.

5 Knowledge of procedure calling conventions and interfacing with high- level languages.

Optional Text: Kip Irvine, Assembly Language for the IBM PC, Prentice Hall, 4th or 5th edition

Additional References: Intel and DOS API documentation as presented in Intel publications and online at www.x86.org; lecture notes (to be sup- plied as we go).

Prerequisites by Topic. Working knowledge of some programming lan- guage (102/103: C/C++); Minimal programming experience

Major Topics Covered in the Course:

1 Low-level and high-level languages; why learn assembler?

2 How does one study a new computer: the CPU, memory, addressing modes, operation modes.

3 History of the Intel family of microprocessors.

4-5 Registers; simple arithmetic instructions; byte order; Arithmetic and logical operations.

6 Implementing longer integer type support; carry and overflow.

7 Shifts, multiplication and division.

8 Memory layout.

9 Direct video memory access; discussion of the first project.

10 Assembler syntax; how to use the tools.

11-13 Conditional & unconditional jumps; loops; emulating high-level lan- guage constructions; Stack; call and return; procedures

14-15 String instructions: effcient memory-to-memory operations.

16 Interrupts overview: interrupt table; how do interrupts work; classif- cation.

17 Summary of the most important interrupts.

18-20 DOS interrupt; File I/O functions; file-copy program; discussion of the second project

21 Interrupt handlers; keyboard drivers; timer-driven processes; viruses and virus-protection software.

22 Debug interrupts; how do debuggers and profilers work.

23-24 (Optional).interfacing with high level languages; Protected mode fun- damentals

Grading The grading is based on two projects, midterm project is 49% and the final is 51%. Please note that the projects are individual, submitting projects that are similar to submissions of others and/or are essentially downloads from the Web would result in a fail.

Office Hours My hours this term for CSc 210 will be 3:45 ¶Ł 4:45 on Mondays.

Zoom links:

11am https://ccny.zoom.us/j/85378437821

2pm https://ccny.zoom.us/j/87625527827

https://ccny.zoom.us/j/87625527827

https://ccny.zoom.us/j/85378437821

Chapter 2

Preliminary material

: Why assembler? • Why take this class?

• Why program assembler?

• Why know assembler?

: NOTE: think Binary! Why binary? Binary numbers (WIKI) (brief answer: because this is easy to implement) Why hex? Hexadecimal numbers (WIKI) (brief answer: because it is much easier to work with shorter strings) What about DNA?

https://en.wikipedia.org/wiki/Hexadecimal

https://en.wikipedia.org/wiki/Binary_number

2.1 Introduction #1: looking at new hard- ware

• CPU, general purpose (arithmetic) registers

– How large?

– How many?

– Are they all the same?

– Modes?

• Memory Model

– Is all memory the same?

– Flat?

– Segmented?

– Paged?

• Other hardware (peripherals)

• OS

• Special features

2.2 Introduction #2: History Intel Processors Over the Years

The History Of Intel CPUs

-1971 before Intel

1971 4004

• Intention

• Name

• Usage

• What can you do with 4 bits?

1972 8008

- Doubling – what can you do with 8 bits

1974 8080

1975 8085

1975 Z80

1974 CP/M – Digital Research, Gary Kildall

1978

- 8086 – X86 architecture. 8 8bit registers, 8(+6) 16 bit registers. 1mb limit. 1mb mystery?

1979 8088 – cost cutting

1981 iAPX 432 – an attempted 32 bit processor

1982 80186 – minor improvements/corrections

1981 IBM PC

https://en.wikipedia.org/wiki/CP/M

https://www.tomshardware.com/picturestory/710-history-of-intel-cpus.html

https://www.businessnewsdaily.com/10817-slideshow-intel-processors-over-the-years.html

2.3 Introduction #3: Fundamentals Data types

1 bit

4 nibble

8 byte

16 word

32 dword, doubleword

64 qword, quadword

80 tenbyte

2.4 x86 CPU

Registers Overlap! Problem: Let AH = 2,AL = 3. What is AX? Solution:

00000010 00000011

AH=00000010b=02h

AL=00000011b=03h

AX=0000001000000011b = 0203h = 515d

Note: suffices b, h and d are part of the Assembly language syntax; d(ecimal) is the default. Assignment syntax, however, is different, it is only used for an illustration here.

Fast solution:

AX = 2*256+3

AX = (2<<8)+3

Problem: Let AX = 2020. What are AL and AH ?

Registers BX,CX, DX are divided similarly.

General purpose aka Arithmetic registers: Sequence A,B,C,D is an illusion, these letters stand for Accumulator,

Base, Count, Data. 8 8bit registers: AH,AL,BH,BL,CH,CL,DH,DL 8 16bit registers: AX,BX,CX,DX

and SI,DI,BP(?),SP(??) SP generally cannot be used for calculations, BP usually cannot be used

either. (32 bit to be described later)

IP – Instruction pointer points to the first byte of the current instruction. Code: B B B B B B B B B B B B B B B B B B

Code is essentially a one dimensional array of bytes (in C/C++ – un- signed char type).

IP initially is 0, after one instruction is executed it should be 2, then 5, then 6, ....

Simplified logic (one instruction)

byte code[MAXCODE];

byte opcode;

opcode=code[IP++];

switch (opcode) {

case 0x00: ...

case 0x01: ...

...

case 0xFF: ...

}

each subcase will read additional bytes if needed to complete reading of the instruction.

Simplified logic (full execution)

byte code[MAXCODE];

byte opcode;

while(true) {

opcode=code[IP++];

switch(opcode) {

case 0x00: ...

....

case 0xFF: ...

}

Why is this simplified? • CS is also used.

• how do we terminate?

• how do we change the executed sequence?

What do we do with this? Is switch efficient? Question: what would IP=k do (if such instruction exists).

FLAGS register Should be seen not as a single 16-bit register but as a collection of 16

1-bit registers. More important ones: ZF, SF, CF, DF Neither FLAGS nor the names above are keywords.

Segment registers : CS, DS, SS, ES – specify where segments (“parts”) of the program are located.

• CS Code Segment

• DS Data Segment

• SS Stack Segment

• ES Extra Segment

2.5 8086 registers – full list • AX Accumulator eXtended

• AL Accumulator Low

• AH Accumulator High

• BX Base eXtended

• BL Base Low

• BH Base High

• CX Count eXtended

• CL Count Low

• CH Count High

• DX Data eXtended

• DL Data Low

• DH Data High

• SI Source Index

• DI Destination Index

• BP Base Pointer

• SP Stack Pointer

• CS Code Segment

• DS Data Segment

• SS Stack Segment

• ES Extra Segment

• IP Instruction Pointer (not a keyword)

• Flags Flags (not a keyword)

2.6 General addressing scheme Three distinct ways to address memory:

• Absolute address : mem[offset] (flat model–generally cannot be done)

• Segmented address : mem[f(seg,offset)] (done by hardware). Usual notation: ssss:oooo (hex digits)

• Expressing segmented address in assembly syntax – to be covered later

The f(seg,offset) function is mode-dependent. In real mode, f(seg,offset)=seg*16+offset. This allows to build 20 bit numbers out of 16 bit quantities. Examples

0000:0000 =⇒ 00000

1234:5678 =⇒ 179B8

+ 12340

05678

--------

179B8

The mapping is not one-to-one! Different (seg,offset) pairs may point to the same address.

0000:0100 =⇒ 00100

0010:0000 =⇒ 00100 Puzzle

FFFF:FFFF =⇒ ????? (ref: A10 address line) ==

Code segment is effectively mem[f(CS,i)], Data segment is effectively mem[f(DS,i)]

Protected memory addressing function uses Segment Descriptor Table lookup. Fields include Base, Limit, Access Rights.

Implication: instructions Segment<-value are very costly in protected mode.

2.7 Back to History: Original IBM PC (1981) Distorted:

Timeline

IBM’s brand recognition, along with a massive marketing campaign, ignites the fast growth of the personal computer mar- ket with the announcement of its own personal computer (PC). The first IBM PC, formally known as the IBM Model 5150, was based on a 4.77 MHz Intel 8088 microprocessor and used Mi- crosofts MS-DOS operating system. The IBM PC revolutionized business computing by becoming the first PC to gain widespread adoption by industry. The IBM PC was widely copied (“cloned”) and led to the creation of a vast “ecosystem” of software, pe- ripherals, and other commodities for use with the platform.

Better: WIKIPEDIA article Additional link (on reaction): Orson Scott Card’s novel

https://en.wikipedia.org/wiki/Lost_Boys_(novel)

https://en.wikipedia.org/wiki/IBM_Personal_Computer

https://www.computerhistory.org/timeline/1981/

No OS ! Three options:

• CP/M-86 (Control program for Microcomputers), see also DR page

• UCSD p-System

• PC DOS/MS DOS, see also 86-DOS

See also: PL/M Introduction #2: History (cont)

1982 80186, 80188

1982-1991 80286

1985-2007 80386

80186 : almost not used in PC’s, many improvements in instructions (kept).

80286 : 16mb protected mode–promise not fullfilled. Real mode −→−→ Prot mode

XENIX

https://en.wikipedia.org/wiki/Xenix

https://en.wikipedia.org/wiki/Intel_80386

https://en.wikipedia.org/wiki/Intel_80286

https://en.wikipedia.org/wiki/Intel_80186

https://en.wikipedia.org/wiki/PL/M

https://en.wikipedia.org/wiki/86-DOS

https://en.wikipedia.org/wiki/IBM_PC_DOS

https://en.wikipedia.org/wiki/UCSD_Pascal

http://www.digitalresearch.biz/CPM.HTM

https://en.wikipedia.org/wiki/CP/M

80386 • 32 bit

• 2 additional modes

• misc enhancements (debugging)

Doubling of registers again EAX = xxxxxxxxxxxxxxxx ahahahah alalalal

Flags register becomes EFLAGS :

Additionally:

• Control Registers CR0..CR7 (CR0=MSW(Machine Status Word) on 80286)

• Test Registers TR0..TR7

• Debug Registers DR0..DR7

64 bit mode adds RAX,...

On paging Virtual memory allows to execute programs larger than physical mem-

ory. Generally cannot be controlled by the programmer, paging algorithms

are implemented by the OS Page replacement algorithms Application algorithms can be tailored for paging environment. Example:

#define N 1024

int x[N][N],y[N][N],z[N][N];

int i,j;

for (int i=0; i<N; i++)

for (int j=0; j<N; j++)

z[i][j]=x[i][j]+y[i][j];

#define N 1024

int x[N][N],y[N][N],z[N][N];

int i,j;

for (int i=0; i<N; i++)

for (int j=0; j<N; j++)

z[j][i]=x[j][i]+y[j][i];

Will the two programs run equally fast ?

https://www.geeksforgeeks.org/page-replacement-algorithms-in-operating-systems/

Assume 3 pages are available. (1 page is exactly a row of a matrix above.)

Two dimensional arrays are stored row by row. First program : 1024 swaps. Second program : 10242 swaps. Technical info

https://wiki.osdev.org/Paging

2.8 Back to History: 32 bit OS?)

OS/2 1987-2001

https://en.wikipedia.org/wiki/OS/2

80486 : 1989 8087 : 1980 80187 (for 80186), 80287 (for 80286), 80387(for 80386). Other coprocessors existed.

Stack design, 8 80-bit registers ST(0), ST(1),.. ST(7).

80486 = 80386 + 80387

Datatypes:

32bit single (float in C/C++)

64bit double

80bit extended (internal format)

Pentium : 1993- 1993 Pentium (P5), why not 80586? (80486.00+100.00=???)

https://en.wikipedia.org/wiki/Pentium

https://en.wikipedia.org/wiki/Intel_8087

https://en.wikipedia.org/wiki/Intel_80486

1995 Pentium Pro (P6), MMX addition

1997 Pentium II

1999 Pentium III

2000 Pentium 4

MMX:

• MultiMedia eXtension

• Multiple Math eXtension

• Matrix Math eXtension

Intel Core (from 2006)

https://en.wikipedia.org/wiki/Intel_Core

https://en.wikipedia.org/wiki/Pentium_4

https://en.wikipedia.org/wiki/Pentium_III

https://en.wikipedia.org/wiki/Pentium_II

https://en.wikipedia.org/wiki/Pentium_Pro

Chapter 3

Instructions

3.1 Overall structure of asm program

• Header – TBD

• Sequence of instructions

• Trailer – TBD

Instructions generally are written one per line (minor exceptions later) Instructions generally follow the following format:

[<label>:] <opcode> [<operands>] [;comment]

[<label>:] [;comment]

where

<label> – optional label (any identifier that is not a keyword or defined oth- erwise).

<opcode> – name of the instruction (keyword)

<operands> – comma-separated operands, if any; their number (0-3) depends on the opcode

;comment – any text, ignored up to the EOL.

Trivial example:

lab: ; this line does not do anything

Symbolic representation of instructions corresponds to particular se- quence of bytes which are actually executed.

3.2 The NOP instruction

NOP (do nothing)

Binary representation: one byte, hex value 90h. Execution: Before: bb bb bb bb bb bb bb bb bb

↑IP

90 bb bb bb bb bb bb

After: bb bb bb bb bb bb bb bb bb 90

↑IP

bb bb bb bb bb bb

IP is incremented by 1; no other register is changed

WHY have it? • delay?

• padding for sloppy compilers

• patching (code deletion)

• reserving space for patching(code addition)

3.3 The MOV instruction

MOV dst,src (copy src to dst)

Example:

MOV AL,BL

;

; before : AL=3 BL=7

; after : AL=7 BL=7

Example:

MOV DL,CH

MOV DL,DL

MOV AX,CX

MOV AX,SP

MOV SP,CX ; very dangerous

MOV EDI,EDI

MOV EDI,ESP

MOV AL,CX ; illegal

MOV EDI,CX ; illegal

MOV IP,AX ; illegal

MOV AX,CS ; ok, special case (see below)

MOV DS,AX ; ok, special case (see below)

MOV CS,DX ; special case, illegal

MOV DS,EDI ; illegal

MOV CR0,EAX ; priveleged

MOV DR0,EAX ; ok, special case (see below)

RULE #1: size of src and dst must match

Most instructions support only gp regis- ters

Argument types:

• (r)egister

• (m)emory

• (i)mmediate

• (s)pecial register

Argument size:

• (b)yte

• (w)ord

• (d)oubleword

• ...

MOV DL,CH ; brr instruction

General template for 2-arg instructions: r m i

r . . . m . . . i . . .

Move-specific template: r m i s

r . . . . m . . . . i . . . . s . . . .

Right now:

r m i r X . . m . . . i . . .

Examples:

MOV AL,[100] ; brm

MOV BX,[200] ; wrm

MOV EDI,[400] ; drm

MOV [100],AL ; bmr

MOV [200],BX ; wmr

MOV [400],EDI ; dmr

Thus

r m i r X X . m X . . i . . .

What does [#] really mean? Answer: bytes beginning with byte #. in

MOV AX,[100]

which byte goes where?

Examples:

MOV AL,1 ; bri

MOV DX,2 ; wri

MOV EDI,4 ; dri

r m i r X X X m X . . i . . .

Examples:

MOV AL,97 ;

MOV AL,61h ; all four lines are equivalent

MOV AL,01100001b

MOV AL,’a’ ;

...

MOV AL,1000 ; ???

No storing into immediates, this would be like

1=x;

in C. Thus:

r m i r X X X m X . . i × × ×

Important: MOV with immediate is a fundamentally different operation from the rr,rm, mr forms.

RULE #2: no memory-to-memory (2 exceptions later)

Thus:

r m i r X X X m X × ? i × × ×

MOV [100],1 ; should not compile

RULE #3: size must be known

Correct syntax:

MOV byte ptr [100],1

MOV word ptr [100],1

MOV dword ptr [100],1

MOV qword ptr [100],1 ; 64 bit only

MOV tbyte ptr [100],1 ; ???

What about

MOV [100],AL

MOV byte ptr [100],AL ; unneeded

MOV word ptr [100],AL ; will not compile

Final result: r m i

r X X X m X × X i × × ×

Full table (MOV only): r m i s

r X X X X m X × X X i × × × × s X X × ×

3.3.1 Examples

Here is how C/C++ assignments may be compiled:

char c1,c2; c1=c2;

-------------------

MOV AL,c2

MOV c1,AL

short s1,s2; s1=s2;

-------------------

MOV AX,s2

MOV s1,AX

int x,y; x=y;

-------------------

MOV EAX,y;

MOV x,EAX;

int x,y,z; x=y=z;

-------------------

MOV EAX,z;

MOV x,EAX;

MOV y,EAX;

int x; x=0;

-------------------

MOV x,0;

int x,y,z; x=y=z=0;

-------------------

MOV x,0

MOV y,0

MOV z,0

perhaps, a better implementation?

MOV EAX,0 ; could be even better

MOV x,EAX

MOV y,EAX

MOV z,EAX

Exercise: Exchange bytes in [100] and [101]

MOV AL,[100]

MOV AH,[101]

MOV [100],AH

MOV [101],AL

can this be done in fewer lines of code?

MOV AX,[100]

MOV [100],AH

MOV [101],AL

Note: Byte order matters.

3.3.2 Byte order

Consider:

MOV [100],AX

Does

LE,reversed AL go into [100] and AH into [101] or, instead:

BE,normal AH go into [100] and AL into [101]

More than you want to know on Endianness

LE,reversed : Intel, Dec

BE,normal : IBM mainframe, Motorola, Sun

Practical implications:

• it is important to know the endiness of the hardware and the data.

• it is important to be able to swap.

• it is important to be able determine the endiness. How?

Specific example of byte order importance:

short s=1;

FILE *f=fopen("try.dat","wb");

if (!f) { ... error handling ... }

fwrite(&s,1,sizeof(s),f);

fclose(f);

Should create a 2-byte file try.dat. Now,

https://en.wikipedia.org/wiki/Endianness

short s;

FILE *f=fopen("try.dat","rb");

if (!f) { ... error handling ... }

fread(&s,1,sizeof(s),f);

fclose(f);

cout << s;

should print the value of s – indeed 1. But: what will happen if we run the Writing program on an Intel comp,

move the data file to a Sun, and run the reading program there?

Exercise: Can a high-level program be written that determines the order of bytes?

3.4 The XCHG instruction

XCHG dst,src (exchange src with dst)

XCHG r m i r X X × m X × × i × × ×

Segment and other non-gp registers are not supported. The syntax and examples from MOV apply, except for non-use of non-gp

registers and immediates. Examples (which of the following are valid?)

XCHG AL,AH

XCHG AX,SP

XCHG EAX,EDI

XCHG AL,[400]

XCHG [400],AL ;same as above

XCHG AL,DI

XCHG DI,DS

XCHG EAX,7

XCHG [100],[101]

XCHG AX,AX ; nop?

XCHG DI,DI ; nop?

XCHG CL,CL ; nop?

Can a better version of byte swap program be now written?

Better:

MOV AX,[100]

XCHG AL,AH

MOV [100],AX

Yet better:

XCHG AX,[100]

XCHG AL,AH

XCHG [100],AX

Q: can a shorter program be written (perhaps with another instruc- tion)?

3.4.1 Binary encoding of XCHG

We only consider accumulator exchanges now. Instructions

XCHG AX,reg

are extra optimized in the intel architecture.

90h XCHG AX,AX

91h XCHG AX,CX

92h XCHG AX,DX

93h XCHG AX,BX

94h XCHG AX,SP

95h XCHG AX,BP

96h XCHG AX,SI

97h XCHG AX,DI

Q: Why the # of registers is a power of 2 ? A: Because this allows to represent registers as in a fixed number of bits.

16-bit register representation:

000b AX

001b CX

010b DX

011b BX

100b SP

101b BP

110b SI

111b DI

An emulator may use code like

unsigned short regs[8];

#define AX regs[0]

#define CX regs[1]

#define DX regs[2]

#define BX regs[3]

#define SP regs[4]

#define BP regs[5]

#define SI regs[6]

#define DI regs[7]

Notes:

• this is just an example!

• 8 bit registers have their own 3-bit keys

• 32 bit registers parallel 16 bit registers

• 64 bit registers use 4-bit keys

• The above code should define 8-bit regs properly (f.e. setting AX should set AL,AH too!

• The above code should be modified to support 32 bit registers

\texttt{XCHG AX,AX}

is NOP. General encoding scheme of XCHG (with accumulator): 1 0 0 1 0 r e g

This idea is used in other instructions. XCHG without accumulator uses a l8engthier encoding, with first byte 86h/87h.

XCHG encoding

https://c9x.me/x86/html/file_module_x86_id_328.html

NOTE: MOV has several different forms, including optimized forms for the accumulator.

Similar scheme is used for the segment registers:

00b ES

01b CS

10b SS

11b DS

3.5 The ADD instruction

ADD dst,src (dst += src)

(proper name should be increment by.) General 2-operand instruction layout applies:

ADD r m i r X X X m X × X i × × ×

Given that syntax of ADD is largely similar to MOV, the examples are sim- ilar:

ADD AX,BX

ADD EAX,ESP

ADD DL,CL

ADD AX,[100]

ADD [150],EAX

ADD AX,DS ; illegal

ADD AX,DL ; illegal

ADD [10],5 ; syntax error

ADD word ptr [10],5 ; fine

C example:

int x,y,z; x=y+z;

-----

MOV EAX,y

ADD EAX,z

MOV x,EAX

int x,y,z; x=x+y;

-----

MOV EAX,y

ADD x,EAX

int x,y,z; x=x+25;

-----

ADD x,25

(NOTE: size specification is not required if x is declared to be a double word)

Consider

ADD AL,AL

Generally, multiplication by 2 should not be done as multiplication (generally about 3x slower than addition). Writing

int x; x=2*x;

is wrong! One should use either addition or a shift (if available). (What is better depends on the situation and hardware).

Q: Should we replace multiplication by addition in :

int f(int);

int x; x=2*f(x);

More simple examples: Consider

ADD AL,0 ; nop ?

ADD AL,1 ; increment ?

ADD AL,-1 ; decrement ?

ADD AL,AL ; double

MOV AL,1 ; AL=1

ADD AL,AL ; AL=2

ADD AL,AL ; AL=?

MOV AL,1 ; AL=1 binary |00000001

ADD AL,AL ; AL=2 binary |00000010

ADD AL,AL ; AL=4 binary |00000100

ADD AL,AL ; AL=8 binary |00001000

ADD AL,AL ; AL=16 binary |00010000

ADD AL,AL ; AL=32 binary |00100000

ADD AL,AL ; AL=64 binary |01000000

ADD AL,AL ; AL=128 binary |10000000

ADD AL,AL ; AL=0 binary 1|00000000 << overflow

ADD AL,AL ; AL=0 binary 00000000

Is this an assembler problem ?

unsigned char c;

c=1;

printf("%d",c); c=c+c;

....

Note: if you like C++ and cout<<, make sure to cast!

Q: what would be the output if we use char rather than unsigned char?

Is this a size problem ? Try

MOV AX,1

ADD AX,AX

...

MOV EAX,1

ADD EAX,EAX

...

OR C/C++ versions.

Unlike MOV and XCHG, ADD is an arithmetic instruction: it sets flags. Warning: the discussion of the flags is slightly simplified, I’m not con-

sidering the OF. Thus there are slight differences between the behavior described and the actual behavior of the processor. This makes no differ- ence for most programs, but there are rare instances where this matters. In particular, I will consider JS and JL as equivalent, in reality they are not exactly the same.

ZF Zero Flag

SF Sign Flag

CF Carry Flag

OF Overflow Flag

; ZF SF CF

MOV AL,1 ; AL=1 binary |00000001 ? ? ?

ADD AL,AL ; AL=2 binary |00000010 0 0 0

ADD AL,AL ; AL=4 binary |00000100 0 0 0

ADD AL,AL ; AL=8 binary |00001000 0 0 0

ADD AL,AL ; AL=16 binary |00010000 0 0 0

ADD AL,AL ; AL=32 binary |00100000 0 0 0

ADD AL,AL ; AL=64 binary |01000000 0 0 0

ADD AL,AL ; AL=128 binary |10000000 0 1 0

ADD AL,AL ; AL=0 binary 1|00000000 1 0 1 << overflow

ADD AL,AL ; AL=0 binary 00000000 1 0 0

WARNING: This is slightly simplified (there is also OF) Flags can be used to

• implement conditionals (IF, WHILE,...)

• implement “long” arithmetic

• check for overflow

3.5.1 Overflow detection

unsigned int x,y,z;

....

x=y+z; // concern about overflow

unsigned int x,y,z;

....

y=0x90000000;

z=0x90000000;

x=y+z; // overflow will occur here, result will be incorrect.

can we check for it like this?

unsigned int x,y,z;

....

if (y+z>0xFFFFFFFF)

error("overflow");

x=y+z;

Correct way:

unsigned int x,y,z;

....

if (y>0xFFFFFFFF-z)

error("overflow");

x=y+z;

Exercise: what about signed types?

A: you will need to check both for “positive” overflow (adding two large positive number) and for the “negative;; overflow (adding two large nega- tive numbers).

In assembler, flags report overflow condition – no need for extra check- ing!

3.6 The SUB instruction

SUB dst,src (dst -= src)

(proper name should be decrement by.) General 2-operand instruction layout applies:

SUB r m i r X X X m X × X i × × ×

Given that syntax of SUB is identical to ADD, syntax examples are similar and omitted.

ADD AX,100

SUB AX,-100 ; same as above

;

ADD AX,-100

SUB AX,100 ; same as above

What do these instructions do?

ADD AX,0

SUB AX,0

What does this instruction do?

SUB EAX,EAX

Answer: most efficient way to zero up a register.

What is the difference between the two instructions below?

SUB EAX,EAX

MOV EAX,0

Answer: the former is more efficient; the latter is rarely used, only in the situations when flags must be preserved. (an example, involving an if, will be given later.)

Revising example we saw above, more efficient code:

int x,y,z; x=y=z=0;

-------------------

SUB EAX,EAX

MOV x,EAX

MOV y,EAX

MOV z,EAX

with SUB, Carry flag indicates borrowing.

3.7 The INC instruction

INC dst (dst++)

Do we write

ADD AX,1

ADD byte ptr [10],1

A: yes, we can, but usually we would use the optimized form General 1-operand instruction layout applies:

INC r m i X X ×

(Same format applies to three more instructions, explained later). Register form is optimized to one-byte encoding:

40h INC AX

41h INC CX

42h INC DX

43h INC BX

44h INC SP

45h INC BP

46h INC SI

47h INC DI

Other forms of INC are encoded in lengthier way beginning with 0FFh and 0FEh.

Warning: this encoding applies to BOTH 16 and 32 registers!

What is better?

INC AX

;or

ADD AX,2

A: former. But do not do this with memory arguments.

3.8 The DEC instruction

DEC dst (dst- -)

DEC r m i X X ×

Comments on INC above are applicable. Optimized form:

48h DEC AX

49h DEC CX

4Ah DEC DX

4Bh DEC BX

4Ch DEC SP

4Dh DEC BP

4Eh DEC SI

4Fh DEC DI

Other forms of DEC are encoded in lengthier way beginning with 0FFh and 0FEh.

3.9 The NEG instruction

NEG dst (dst=-dst)

NEG r m i X X ×

How to negate without using NEG?

NEG EAX

;is the same as

SUB EBX,EBX

SUB EBX,EAX

MOV EAX,EBX

Solve equation x = −x ?

MOV AL,x

NEG AL

if AL did not change, does this mean x is 0?

No, it is either 0 or 128! For short, the solutions are 0 and 215=8000h, for 32-bit they are ....

#include <stdlib.h>

#include <stdio.h>

int main() {

short s,s1;

s=0; s1=-s; if (s!=s1) printf("for s=%d, changes;\n",s);

else printf("for s=%d, does not change;\n",s);

s=1000;s1=-s; if (s!=s1) printf("for s=%d, changes;\n",s);

else printf("for s=%d, does not change;\n",s);

s=0x8000;s1=-s; if (s!=s1) printf("for s=%d, changes;\n",s);

else printf("for s=%d, does not change;\n",s);

return 0;

}

results in

for s=0, does not change;

for s=1000, changes;

for s=-32768, does not change;

3.10 The CMP instruction

CMP dst,src (dst-src)

General 2-operand instruction layout applies: CMP r m i r X X X m X × X i × × ×

Conditionals (if,while) are implemented in two stages:

• compute flags

• do (or not) the operation depending on a flag(or flags) – this is later.

Examples:

; if (x==0) ...

;

; compute ZF from value of x

; if (x!=0) ...

;

; compute ZF from value of x

; if (x<0) ...

;

; compute SF from value of x

; if (x<=0) ...

;

; compute SF and ZF from value of x

What about

; if (x==y) ...

;

to set the flags, we compute the difference

; if (x==y) ...

;

MOV EAX,x

SUB EAX,y

; this sets ZF for our use

; if (x<y) ...

;

MOV EAX,x

SUB EAX,y

; this sets SF for our use

; if (x<y) ...

;

MOV EAX,x

SUB EAX,y

; this sets SF and ZF for our use

Problem: this computation destroys value of x which is often needed. So, instead:

; if (x<y) ...

;

MOV EAX,x

CMP EAX,y

; this sets SF and ZF for our use

Flags are set exactly as for SUB, but EAX retains the value of x.

; if (x==5) ...

;

MOV EAX,x

SUB EAX,5 ; inefficient

;

SUB x,5 ; worse yet: destroys variable

;

CMP x,5 ; fine

3.11 Logical or Bitwise

What kind of AND operation should be implemented in hardware? (same question of course can be asked of OR etc but we will concentrate on AND)

• logical AND (in C, &&) ?

• bitwise AND (in C, &) ?

• both ?

Logical AND works with True and False concepts:

1 && 1 = 1

1 && 0 = 0

0 && 1 = 0

0 && 0 = 0

Q: what about 5 && 6 ?

(5 is first converted to 1 (true), 6 is converted to 1 (true), we compute 1&&1 and get result 1).

This is a multi-step operation! Bitwise AND: apply the AND operation to every bit (every column):

0101 == 5

& 0110 == 6

----

0100 == 4

Thus 5 & 6 = 4. Notice that on single bits (or multibit 0 and 1), the results of & and &&

are identical.

1 & 1 = 1

1 & 0 = 0

0 & 1 = 0

0 & 0 = 0

We summarize:

• On appropriate logical values, logical and bitwise AND are the same.

• Logical AND is a multistep operation – hard to implement in hard- ware.

• Bitwise AND can be done easily and efficiently (array of AND-gates)

• on non-standard input values, logical AND offers minimal advantages (how often do we care about 5&&6?)

• on non-standard input values, bitwise AND offers huge advantages:

– masks

– sets

– images

3.11.1 Bits and Masks

Most x86 2-operand instructions are encoded as follows:

o p c o d e d w m d r e g r - m .....

for example, for ADD, opcode is 000000, MOV, opcode is 100010, etc (Encoding for forms with accumulator and immediate are similar). The w field states if the operation is word or byte. Register reg=000

means AX if w==1 and AL if w==0. The d field states if the direction is from register (0) or to register(1). So decoding should retrieve these bytes.

currentbyte=code[ip++];

....

wordop=currentbyte & 1;

from_reg=(currentbyte & 2)!=0;

....

alternately one can use division and mod (yuck).

3.11.2 Checking for odd/even

Q: how do we check for a number being odd or even?

if (x is odd) { ... }

if (x % 2 == 1) { ... }

better:

if (x % 2) { ... }

much better: better:

if (x & 1) { ... }

Q: how do we check for a number is divisible by 4?

if ((x & 3)==0) { ... }

if (!(x & 3) { ... }

Q: how do we check for a number is divisible by 2k?

if ((x & 2k−1)==0) { ... }

3.11.3 Sets

How do we represent a set?

Consider {2,3,5,7,11,13} Idea #1: Representation as a link list:

2 ⇒ 3 ⇒ 5 ⇒ 7 ⇒ 11 ⇒ 13 ⇒ #

Idea #2: Representation as an array:

2 3 5 7 11 13

Idea #3: Representation as a bit array (set)

0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0

Total space : 2 bytes. Search time:

• list Linear

• array Linear or log (binary search)

• set Constant!

Exercise: write the formula

If A,B are similar sets, one can use & to compute the intersection:

unsigned char A[N];

unsigned char B[N];

unsigned char C[N];

for (int i=0;i<N;i++) C[i]=A[i]&B[i];

Set representation is not always the right one, consider {1,10000000}!.

3.11.4 Images

Consider images. Actual representation (b/w image): one dimensional bit array. What would this produce?

Conclusion: Bitwise AND is a much better choice for assembler implen- tation.

3.12 The AND instruction

AND dst,src (dst &= src)

AND r m i r X X X m X × X i × × ×

Syntax is the same as ADD,SUB,... Examples:

MOV AX,5

AND AX,6 ; AX is 4

AND AX,1 ; check evenness

AND AX,7 ; what does this do?

AND AX,0 ; what does this do?

AND AX,AX ; what does this do?

AND AX,CH ; what does this do?

3.13 The TEST instruction

TEST dst,src (dst & src)

TEST is non-descructive AND, cf. CMP/SUB. TEST r m i r X X X m X × X i × × ×

Syntax is the same as AND, ADD, SUB,... Encoding is less efficient than for other operations – TEST is less com-

mon. TEST is helpful when we want to extract different bits from the same

number.

3.14 The OR instruction

OR dst,src (dst |= src)

OR r m i r X X X m X × X i × × ×

Syntax is the same as ADD,SUB, AND... In C, bitwise or is coded as |. Hardware implements the bitwise form. Examples:

MOV AX,5

OR AX,6 ; AX is 7

0101 == 5

| 0110 == 6

----

0111 == 7

OR AX,0

(set flags – not as efficient as RR form)

OR AX,AX

(set flags – just as efficient as RR form of AND)

OR AX,1

(set last bit to 1; numerically round up to an odd number.)

OR AX,2

(set next to the last last bit to 1.)

OR AX,3

(set last two bits to 1)

For sets (and graphics) OR implements union.

3.15 The XOR instruction

XOR dst,src (dst ˆ= src)

OR r m i r X X X m X × X i × × ×

Syntax is the same as ADD,SUB, OR... Truth table for XOR:

1 XOR 1 = 0

1 XOR 0 = 1

0 XOR 1 = 1

0 XOR 0 = 0

XOR AX,AX

(clear register – just as efficient as RR form of SUB)

XOR AX,0

(only sets flags)

XOR AX,1

(toggle last bit)

XOR AX,7

(toggle last three bits)

XOR AX,0xFFFF

(toggle all bits) Involution property of XOR: ((d XOR k) XOR k) == (d XOR (k XOR k)) == (d XOR 0) == d Application #1: temporary lines:

(same idea can be used for the mouse pointer or inverting selected text).

Application #2: cypher: ((data XOR key) XOR key) == data

char data[N];

for (int i=0; i<N; i++)

data[i]=data[i] ^ key;

In pure form, variation of a substitution cipher:

char data[N];

char subst[256]; // contains permutation of ASCII chars

for (int i=0; i<N; i++)

data[i]=subst[data[i]];

Not secure, but good for puzzles! Cryptogram puzzles. With key chaining:

char data[N];

for (int i=0; i<N; i++) {

char c;

c=data[i];

data[i]=c ^ key;

key=f(key,c);

}

https://api.razzlepuzzles.com/cryptogram

nearly unbreakable; cf. the Type 1 font encryption episode. The unbreakable Adobe encryption was actually implemented with this

code:

unsigned short int r; // this is the key

unsigned short int c1 = 52845;

unsigned short int c2 = 22719;

unsigned char Encrypt(plain) unsigned char plain;

{unsigned char cipher;

cipher = (plain ^ (r>>8));

r = (cipher + r) * c1 + c2;

return cipher;

}

(Naturally, there are myriads of small changes that can be made to the code, beginning with change of the constants... and they would lead to equally strong security!)

Question: why is C not providing ^^ ?

3.16 The NOT instruction

NOT dst (dst= d̃st)

NOT r m i X X ×

Follows the format of INC,DEC, NEG. Example:

XOR AX,0xFFFFh

NOT AX ; same result as above ?

Warning: NOT may produce drastically different results depending on the size of the number!

#include <stdlib.h>

#include <stdio.h>

int main() {

unsigned short s; unsigned char c; unsigned int i;

c=1; c=~c; printf("\n negated byte %u ",c);

s=1; s=~s; printf("\n negated short %u ",s);

i=1; i=~i; printf("\n negated int %u ",i);

}

results in

negated byte 254

negated short 65534

negated int 4294967294

3.17 Shift and Rotate operations

This group of instructions allows to shift bits in a register in different ways. The format of all instructions in the group is the same; we will start by

considering the

3.17.1 SHL

SHL r/m,amount (SHift Left)

instruction. The first argument of the instruction must be a register or a memory location; the second specified the number of positions the bits of the first argument will be shifted. There are three ways the shift amount can be specified

1. as number 1

2. as register CL (no other register is allowed)

3. as integer number greater than one. We separate this case from the first one since the encoding is different and this format did not exist in the original 8086 (added with 80186).

Here is a simple example of the SHL execution:

MOV AL,11

SHL AL,1

to see what is the resulting value in target, let’s write AL in binary:

0 0 0 0 1 0 1 1

Shifting the bits left by one position would result in

0 0 0 1 0 1 1 0

where the leftmost bit (0) is moved out and a zero is written to the last position. Numerically the value is 22.

It is not an accident that 22 = 11 × 2; the effect of appending a zero to a binary number is multiplication by 2 (just like appending a zero to a decimal number is multiplication by 10). Unlike the “generic” multiplica- tion described in a later section, this is an efficient and cheap operation – moving bits is naturally cheaper than invoking the multiplier.

If the first bit of the number were 1, pushing it out of the register will result in “an overflow” and an incorrect result. For example, shifting 128 in an 8-bit register will produce 0. (Notice that the correct value of 256 is impossible on an 8-bit register.)

The first bit, incidentally is not lost – it is moved to the Carry flag; checking the carry flag therefore offers a way to detect an overflow and possibly correct the calculation.

CF= 0 0 0 0 1 0 1 1 0

The other two forms of SHL arguments allow to shift the register by more than one position. Consider

MOV AL,11

SHL AL,3

MOV AL,11

MOV CL,3

SHL AL,CL

Both snippets will do the same : shift AL by 3 positions to the left. Three leftmost bits are pushed out, three zeros are appended at the end, changing

CF= ? 0 0 0 0 1 0 1 1

CF= 0 0 1 0 1 1 0 0 0

It is an efficient way to multiply by 8 = 23,and in general left shift by n positions is equivalent to multiplication by 2n.

The bits pushed out cannot be all placed into the carry; two of them are irreversibly lost and only the last bit to be pushed out ends up in the carry flag. In the previous example the bit ending up in the carry is shown in red.

Unless an overflow occurs, SHL multiplication often works correctly even with negative numbers. For example, -1 (binary 11111111 becomes 11111110, which is indeed -2).

This instruction does not change either zero or sign flags. NOTE: shift instructions like SHL existed on many processors even be-

fore Intel and got incorporated into the C language ( << ). Thus, one can write in C either x=y+y; or x=y<<1; instead of slower x=y*2. Which of the two options is better? In most cases, about the same. The right choice also depends on the particular formula. For instance

y=f(x)*2; // function call

should not be even changed to y=f(x)+f(x); since this would result in a double call of a function! – definitely slower and possibly a bug, if the function has side effects. Using the shift here is certainly fine. On the other hand, some languages do not even offer shift operator making the addition is the only alternative to (slower) multiplication

3.17.2 SHR

SHR r/m,amount (SHift Right)

instruction has the same syntax as SHL but shifts the bits to the right. We again look at an example:

MOV AL,11

SHR AL,1

In binary, 11 is

CF= ? 0 0 0 0 1 0 1 1

shifting this pattern right results in the last bit (1) expelled from the register into the carry flag, the remaining bits moving right by one position and a zero written into the leftmost position, obtaining

CF= 1 0 0 0 0 0 1 0 1

(decimal 5). It is not an accident that 5 = 11/2, for unsigned (or nonnega- tive) numbers shifting right by 1 is equivalent to division by 2, and shifting right by n bits is division by 2n. (Analogy: with decimal numbers deleting the last digit is the same as division by 10). when shifting by more than one it is the last bit that shifted-out from the register that ends up in the carry.

One more example to consider :

MOV AL,255

SHR AL,1

the result of shifting

CF= ? 1 1 1 1 1 1 1 1

CF= 1 0 1 1 1 1 1 1 1

in terms of division we see 127 = 255/2, correct. However, in

MOV AL,-1

SHR AL,1

we see the same exact values (255 is -1), but obviously incorrect result: 127 = −1/2. In fact the results will be always incorrect when shifting a negative number: after the shift the sign bit (1) is replaced by 0!

NOTE: in C/C++ bitshift operators << and >> use SHL and SHR for un- signed (or signed but positive) numbers. Applying them to negative num- bers is supported by syntax but unpredictable, the result may or may not be correct, depending on the compiler (it is correct under BCC). C/C++ syn- tax allows negative values of shift, but the results are again unpredictable (under BCC, the result would be always 0).

Division of negative numbers via a shift is, however, possible in assem- bler with the

3.17.3 SAR

SAR r/m,amount (Shift Arithmetic Right)

instruction. SAR is nearly identical to SHR; the only difference is that SHR inserts zeros at the left, SAR duplicates the first bit. Thus number 255 (11111111)

CF= ? 1 1 1 1 1 1 1 1

would become

CF= 1 0 1 1 1 1 1 1 1

after a SAR (as seen above), but after SHR it is

CF= 1 1 1 1 1 1 1 1 1

where the first 1 is copied to the right but also left where it was. SAR produces correct results for division of signed/negative numbers: the calculation in the example above is division of -1 by 2, resulting in -1 – correct result assuming rounding to −∞.

NOTE: Java implements both SAR and SHR as >> and >>>.

3.17.4 SAL

For symmetry, Intel also has

SAL r/m,amount (Shift Arithmetic Left)

while SAL is named and even encoded differently, it has the same function- ality as SHL.

3.17.5 ROL

ROL r/m,amount (ROtate Left)

rotates bits leftward, the high bit(s) pushed out appear on low end.

MOV AL,79h ; AL is 01111001

ROL AL,1 ; AL is 11110010

ROL AL,1 ; AL is 11100101

MOV CL,2

ROL AL,CL ; AL is 10010111

ROL never loses any bits and does not use the Carry flag. Notice that

ROL AL,8

does not do anything, yet another equivalent of NOP. Likewise

ROL AX,16

does not change any registers.

Exercise: What does ROL AX,8 do?

3.17.6 ROR

ROR r/m,amount (ROtate Right)

is similar, except that the bits are rotated right. Notice that the two instruc- tions below are equivalent.

ROL AL,3

ROR AL,5

ROL and ROR are rarely used, albeit this example should be of interest:

ROL word ptr [100],8

3.17.7 RCL

RCL r/m,amount (Rotate Carry Left)

rotates the bits in register and the carry flag left. High bits being rotated out are moved to the carry while the carry flag is entered on the low end of the register.

MOV AL,79h ; AL is 01111001 CY=c (undefined)

RCL AL,1 ; AL is 1111001c CY=0

RCL AL,1 ; AL is 111001c0 CY=1

RCL AL,1 ; AL is 11001c01 CY=1

rotation by more than one can be seen as repeated rotations by 1, except done as a single instruction.

3.17.8 RCR

RCR r/m,amount (Rotate Carry Right)

is similar to RCL except for the direction of the rotation.

!!! The code below uses some assembler syntax that has not been cov- ered yet !!!

While RCL and RCR may appear strange but they have a number of us- ages. One of them is the ability to shift (that is multiply or divide by a power of two) numbers that are larger than the register size. Assume that x is a 100-byte (800-bit) number stored in memory locations [500] through [599]. We would like to multiply it by 2. Remember that the bytes are stored in reverse order.

SHL byte ptr DS:[500],1 ;high bit is moved to the carry

RCL byte ptr DS:[501],1 ;carry to 2nd byte, high bit to carry.

RCL byte ptr DS:[502],1 ;carry to 3rd byte, high bit to carry.

...

RCL byte ptr DS:[599],1

or, as a loop:

CLC

MOV BX,500

MOV CX,100

L: RCL byte ptr DS:[BX],1

INC BX

LOOP L

Exercise: How would you divide this 800-bit number by 2?

Exercise: Would signed and unsigned numbers use the same code?

to multiply by 2k, you need to shift by k positions. This must be done as 1-position shifts repeated k times since the carry can hold only one bit, therefore a double loop is required.

another application for these instructions is shifting a monochrome image by one or several pixels. The code is similar to the one used for shifting long numbers above.

first two forms of all eight shift/rotate instructions are encoded to- gether, using bytecodes D0 through D3. The “register” field in the 2nd byte of the encoding actually denotes the operation. The third form uses C0 and C1, in a similar fashion.

80386 and all newer processors have double-register-shift operations SHLD and SHRD that allows to write more efficient code; we do not describe them here.

3.18 Data conversion

The arithmetic instructions studied in the previous sections are sufficient to implement many formulas, but not all. One particular problem is mixing of variables of different sizes, as seen in the following C-language example

int x;

short y;

char z;

x = y+z;

If all three variables were declared the same way (for instance, all three int’s), we could have implemented the computation with already known instructions:

MOV EAX,y

ADD EAX,z

MOV x,EAX

but the size must match rule prevents us from mixing operands of different size. Thus, we are in need for data conversion instructions – ones that change the size of a variable without changing the value.

Before showing the available instructions, let’s understand the exact problem.

In the example above we would like to add byte (char) quantity z with word (short) quanity y. To do this, we need to convert z to also a word quantity that has the value equal to the original value of z, this word quan- tity can be added with y. (We would later need to convert the sum – a word quantity – to a doubleword, to be able to store it into x).

We actually have a way to do this conversion in some cases. For exam- ple, if we assume z to be equal to 1, the following assembler code would work:

MOV AL,z

SUB AH,AH

Bingo: by zeroing-up the high bits in the AX register, we extended the value into a word, and can add it with the value of y:

MOV AL,z

SUB AH,AH

ADD AX,y

Assuming y is equal to 3, the result would be 4, stored in AX. To save it into x we can append two more zero bytes:

MOV AL,z

SUB AH,AH

ADD AX,y

MOV word ptr x,AX

MOV word ptr x[2],0

ZERO EXTENSION, used in the example, is in fact a correct solution, just not for the our formula; with diffently chosen numbers it will fail. Consider

int x;

short y=1;

char z=-1;

x = y+z;

ZERO extending z as we did above will result in the value 255 in the AX; x therefore will be computed as 256.

Why is this happening? ZERO EXTENSION always produces non-negative numbers, even if the original value had the sign (high) bit set, the extended value will always have zero in the high position. Thus, it is suitable for un- signed data types, the example

unsigned int x;

unsigned short y;

unsigned char z;

x = y+z;

will work fine with any values of x and y. For signed data types we need a different type of extension. For simplicity, let’s look only at conversion AL ⇒ AX. The value in AL is considered as signed. If it happens to be nonnegative, we can ZERO-EXTEND, as we did above

SUB AH,AH

if it happens to negative, then we should instead fill AH with 1s, not 0’s:

MOV AH,0FFh

In general, the extension byte(or bytes) will be filled with the sign of the number, the procedure itself is therefore called SIGN EXTENSION.

One practical difference between the two types of extensions: while the unsigned (ZERO) extension can be implemented efficiently with the instructions we already have, an implementation of the signed extension would require coding the logic like

if (AL>=0) AH=0; else AH=255;

can this pseudocode be written in assembly? Yes, of course. Can it be written efficiently? No.

Exercise: Implement the pseudocode above in assembler.

Therefore, the instruction set provides sign conversion instructions which we will now introduce

CBW (Convert Byte to Word)

CBW, just like other instructions in this group, does not have arguments, they all work on the accumulor. CBW specifically converts AL to AX, using sign extension.

CWD (Convert Word to Doubleword)

CWD converts 2-byte value stored in AX into 4-byte value stored on DX and AX, DX has the low bits.

Why not convert to a single doubleword register? Two reasons: firstly, CWD was added before the processor had 32-bit registers, but see CWDE below.

Secondly, in some cases this is actually more convenient; see the examples related to division below.

Our example program, written in assembler, would use both of these instructions:

MOV AL,z

CBW

ADD AX,y

CWD

MOV word ptr x,AX

MOV word ptr x[2],DX

The following two instructions did not exist in the original 8086 and were added only in 80386, when 32-bit registers were first introduced:

CWDE (Convert Word to Doubleword Extended)

sign-extends AX into EAX.

CDQ (Convert Doubleword to Quadword)

sign-extends EAX into EDX:EAX (to store a quadword on a single register one would need a 64-register!)

The four forms given above are the most efficient to use and the code should be written in such a way as to have the data to be sign-converted on the accumulator; unsigned data conversion, on the other hand, can be done on other registers:

MOV BL,z

SUB BH,BH ; now BX has zero-extended value of z.

In addition to these instructions, Intel – beginning with 80386 – offers more general forms

MOVSX target,source (MOV with Sign eXtension)

MOVZX target,source (MOV with Zero eXtension)

where target and source are either memory or registers, at least one of the operands must be a register ( memory-to-memory operations are not allowed ), and – unlike the usual “size must match” rule, here the target should be larger than the source.

For example:

MOVSX AX,AL

MOVZX EBX,DL

MOVSX word ptr DS:[10],BL

MOVZX EAX,byte ptr DS:[BX]

Notice that the first instruction in the example is equivalent to CBW – in terms of what it does! – but its encoding is 4 bytes vs only 1 and execution is slower. Thus, the accumulator forms are still the most efficient.

Encoding: CBW is encoded as a single byte 0x98, CWD as a single byte 0x99.

Data conversion instructions are not considered to be arithmetic, there- fore no flags are altered.

WARNING: X86 emulator does not support 32-bit instructions, so CWDE, CDQ, MOVSX, MOVZX would be rejected by it. Maybe one day :P

Now, what about conversion from a larger data size to a smaller one? Such situations surely happen, for example:

short a, int b;

a=b;

Well, in general this cannot be done: the data range of int is larger than that of short, so not every int value can be correctly represented as a short. Most compilers would issue a warning and proceed by saving only the low word:

MOV EAX,b

MOV a,AX

half of the EAX register is not saved at all! This may or may not work correctly, depending on the value of b, and assembler cannot solve what is impossible to solve. Assembler, however, offers a simple test that checks if the operation will work correctly:

MOV AX,word ptr b % lo half of b

MOV BX,word ptr b+2 % hi half of b

CWD

CMP DX,BX

JNZ error

MOV a,AX

In other words, if the true value of b can be recovered from the lo half, the number fits in the short range. For unsigned types, the check is even simpler

CMP word ptr b+2,0 % hi half of b

JNZ error

MOV AX,word ptr b % lo half of b

MOV a,AX

3.19 Multiplication and Division

These two operations were left to the end, because of the differences from other operations.

One of the differences is the higher cost: multiplication is slower (processer- dependent, but 3 times slower is generally a correct estimate) from other operations; division is yet slower. (compare this with performing the op- eration on paper: addition and subtraction are easy, multiplication is more

difficult, and if forgot how unpleasant division is, review your elementary school notes!).

Because of this one should always try to use these operations sparingly, or – if possible – not use them at all. For example, multiplication by a power of 2 should never be done as multiplication, same goes for division by a power of 2. This is far from the only possible savings.

Another problem – specific to multiplication – is the growth of the length of the result. With binary operations studied so far the lenght of the result is the same as the length of the arguments (AND, OR, . . . ) or longer by only one bit which can be saved in carry (ADD, SUB, . . . ). Multi- plication may double the lentgh – this requires a different type of syntax, and different understanding of the “size must match” rule. (With division, we will see yet another form of syntax and a direct violation of the “size must match” rule.)

Further, not every register is capable of doing these operations. The most common/usable forms require the accumulator (or work best on the accumulator).

Finally, multiplication and division provide different operation for signed and unsigned arithmetic.

With this in mind, let’s look at a sample instruction:

IMUL r/m (Integer (signed) MULtiplication)

We notice that while multiplication operation requires two arguments, the format specifies only one. The register or memory operand is multiplied by the accumulator of the same size as the operand. For example:

IMUL CL ; multiply CL by AL (8 bit)

IMUL word ptr DS:[BX]; multiply memory word by AX (16 bit)

IMUL ESI ; multiply ESI by EAX (32 bit)

the result is saved in the extended accumulator, it is AX for 8 bit operands, DX:AX for 16 bit arguments and EDX:EAX for 32 bit arguments. In all cases the length of the extended accumulator is twice the length of the arguments. This assures that the result is computed correctly, but does not

suggest what to do with the result if it is too large to work with (64 bit in a 32-bit program, or 128 bit after a 64-bit multiplication that also exists.)

Example:

MOV AL,7

MOV CL,13

IMUL CL ; results in AX being 7*13=91

MOV AL,255

MOV CL,255

IMUL CL ; results in AX being 1.

in the second multiplication the arguments are interpreted as signed num- bers, so we are actually squaring -1!. (Intel opcode names are not always consistent. IMUL uses the word Integer to mean Signed! Unsigned multiplication that we look at next is also an integer multiplication, all the x86 registers are integer!

The unsigned counterpart is

MUL r/m ((unsigned) MULtiplication)

and follows the same format as IMUL. The results, however, may be differ- ent:

MOV AL,7

MOV CL,13

MUL CL ; results in AX being 7*13=91

MOV AL,255

MOV CL,255

MUL CL ; results in AX being 0xFE01 = 65025

We now turn our attention to division. In addition to the already men- tioned, it has one extra feature: division is not always possible. But let us begin with the format:

IDIV r/m (Integer (signed) DIVision)

Once again, some of the arguments are implicit, the explicit argument is the Divisor in the operation. The quotent will appear in the accumulator, of the same size as the divisor; the divident is taken from the extended accumulator. For example:

MOV AX,100

MOV BL,7

IDIV BL ; result (14) in AL

...

IDIV word ptr ds:[10]

; DX:AX is divided by memory word, result in AX

...

IDIV ESI ; EDX:EAX is divided by ESI, result in EAX

Division operations compute the remainder, which is stored in the 2nd half of the extended accumular (AH, DX, EDX ), this makes separate MOD operation unnecessary.

DIV r/m ( (unsigned) DIVision)

it the unsigned counterpart of IDIV

Exercise: find inputs for which IDIV and DIV produce different results.

In some cases division cannot be performed and results in an exception (often meaning a program crash or even an OS crash:

SUB CX,CX

IDIV CX

100

division by zero is the primary example of it, the above example will fail regardless of the value of the divident.

In case of a division overflow (division by zero is one example of it, but not the only one), the hardware will call INT 0 (see explanation of interrupts and handlers which is TBW). What happens next fully depends on the installed interrupt handler; the outcome may be an error message and program termination, crash of the operating system, or a normal exit from the program with a error message like your program has performed an illegal operation.

Division overflow may occur with a non-zero divisor too:

MOV AX,1000

MOV CL,2

IDIV CL

While we are dividing by 2, this division cannot be done: 1000/2 is 500, too large for a byte register (AL). The outcome is the same as with division by zero, and in fact some exception handlers may report this as a division by zero – it is not.

In the next example, division result (same 500) is to be placed into 2 byte accumulator.

MOV AX,1000

MOV CX,2

IDIV CX

this appears possible, but an overflow still may occur.

Exercise: Understand why

In general to avoid overflows one should use longer registers, a byte division is very overflow prone.

IMUL, unlike the other instructions in this group, has additional formats, including multiplication by an immediate.

TODO : discuss MULDIV TODO : discuss DIVMOD

101

3.20 LEA : Load effective address

LEA s,t (Load Effective Address)

s=&t;

LEA is not an arithmeric operations sensu stricta, but it is closely related to address computation used in arithmetic instruction.

We begin with a closer look at what arithmetic instructions actually do, for example

ADD AX,[BX+SI+10]

Two parts of the computation are:

1. compute the memory address : BX+SI+10

2. retrieve the data from the computed address and add it to the AX register.

We notice that our sample instruction actually does three additions, not one!

The first part of the computation is exactly what LEA does, and using LEA we can break the addition into two steps:

LEA DI,[BX+SI+10]

ADD AX,[DI]

This is an equivalent, and obviously slower computation – we bring it for illustration purposes only, not a suggestion for coding.

LEA is more restrictive than other 2-operand instruction in terms of the allowed operands: the second operand must be a memory (otherwise the concept of an address does not apply; the first argument is therefore always a register.

LEA, not being an arithmetic instructions, does not affect flags. The size must match rule does not apply to LEA – it actually does not

move data. LEA allows the arguments to be of different sizes, albeit such forms of the instruction are rarely useful. The following are all valid exam- ples with different sizes of the arguments (16 and 32):

102

LEA DI,[BX+SI+3] ; 16 and 16

LEA EDI,[BX+SI+3] ; 32 and 16, 0-extension used

LEA DI,[EAX+EBX+3] ; 16 and 32, truncation

LEA EDI,[EAX+EBX+3] ; 32 and 32

With static addresses, LEA does not accomplish anything more than MOV can do :

msg DB ’Hello, World!’

LEA SI,msg

MOV SI,offset msg

(but notice the difference in the syntax). In this case, MOV is slightly more efficient (3-byte encoding rather than 4) and some assemblers (TASM!) will actually compile LEA as MOV!

With addresses that include registers, LEA can be emulated as two or three instructions:

nums DD 1000 dup(?)

......

LEA AX,nums[BX]

......

MOV AX,offset nums

ADD AX,BX

One important but rarely known detail is that the ability of LEA to per- form “free” additions translates to its ability to do free multiplications, lead- ing to a new fast way to multiply by small numbers.

Recall1 that in 32-bit mode the addresses are in the form

[base*scale+index+offset]

with base being any 32-bit register but ESP, index being any 32-bit register, scale is 1,2,4 or 8.

Now, consider the following example:

1actually this section has not been written!—maybe it will be yet

103

LEA EAX,[EAX+EAX*2]

This is a valid 32-bit addressing instruction, we can use the same regis- ter as the “index” and the “base”. The effect of the instruction is multipli- cation by 3, not surprisingly faster than IMUL. Using

LEA EAX,[EAX+EAX*2]

LEA EAX,[EAX+EAX*4]

to multiply by 15 is also faster than IMUL, despite being two instructions.

LEA EAX,[EAX+EAX*2]

SHL EAX,4

is a faster way to multiply by 12, and so on. For many small numbers there are similar tricks.

3.21 Long integers: ADC and SBB

Integer arithmetic can be performed on numbers that consists of more bits than the register size. This was done already on 8-bit processors in the Dark Ages of CP/M; exactly the same approach works on modern 32- and 64- processors to work with numbers that are yet longer. While the technique is the same, the usability of it is less now since for most computations even 32-bit integers are sufficient.

We will first work out the technique mathematically using 8-bit registers only, while trying to add 16-bit numbers. (The techique is fully scalable, the choice of small register sie is to make the examples easier to understand.)

Consider this calculation:

short x,y,z;

z=x+y;

Using 16-bit registers we can write it in assembler as

; program 1

MOV AX,x

ADD AX,y

MOV z,AX

104

We now assume that the 16-bit registers are not available and try to perform it using 8-bit registers only. How about this code?

; program 2

MOV AL,byte ptr x

ADD AL,byte ptr y

MOV byte ptr z,AL

MOV AH,byte ptr x[1]

ADD AH,byte ptr y[1]

MOV byte ptr z[1],AH

for some values it will indeed work. For example, for x=1234h and y=5678h, the result would be 68ACh, same in both programs:

1234h 12h 34h

+ 5678h + 56h 78h

----- -------

68ACh 68h ACh

For others, the result would different. If x=80h (128 decimal) and y=80h, the 16-bit program will produce 100h (256 decimal) correctly, whereas the 8-bit program above will produce 0: adding 80h with itself on an 8-bit register gives 0!.

0080h 00h 80h

+ 0080h + 00h 80h

----- -------

0100h 00h 00h

The source of the problem is the carry produced by the first 8-bit addi- tion: we are not using it. 80h+80h indeed results in 8 zero bits, but also a carry, indicating an extra one to be added to the 9th bit – but we are ignoring it.

It is actually possible to correct the above program without introducing new instructions:

; program 3

MOV AL,byte ptr x

ADD AL,byte ptr y

105

MOV byte ptr z,AL

MOV AH,byte ptr x[1]

JNC skip

INC AH

skip:

ADD AH,byte ptr y[1]

MOV byte ptr z[1],AH

(the INC instruction adds the previously ignored carry), but the Intel in- struction set has a shortcut that would result in a faster code:

; program 4

MOV AL,byte ptr x

ADD AL,byte ptr y

MOV byte ptr z,AL

MOV AH,byte ptr x[1]

ADC AH,byte ptr y[1]

MOV byte ptr z[1],AH

The new ADC instruction adds the 2nd argument and the carry to the 1st argument – this is exactly the computation needed.

ADC t,s; t=t+s+carry (ADd with Carry)

Except for the addition of the carry, ADC follows exactly the same rules as ADD, including the syntax and setting of the flags.

We now scale the example: how about adding 128-bit numbers? This cannot be done with a single addition on any of the current processors – you would need 128-bit registers for that!

In this example we will assume that x occupies 16 bytes (128 = 16 × 8) of memory beginning with address 100, y and z (same size both) are located at 200 and 300. We further assume that the bytes in out 128-bit integers are written in the Intel reversed order. The objective is, as before, to compute z=x+y.

How about this?

106

; program 5

MOV EAX,dword ptr x

ADD EAX,dword ptr y

MOV dword ptr z,EAX

MOV EAX,dword ptr x[4]

ADC EAX,dword ptr y[4]

MOV dword ptr z[4],EAX

MOV EAX,dword ptr x[8]

ADC EAX,dword ptr y[8]

MOV dword ptr z[8],EAX

MOV EAX,dword ptr x[12]

ADC EAX,dword ptr y[12]

MOV dword ptr z[12],EAX

notice that this computation could have been done using only two 64-bit additions, or using 8 16-bit additions, or using 16 8-bit additions. The code would look similar in all cases; naturally the code using longer registers will have fewer instructions and would run faster.

The code shown in the previous program can be extended to handle interegs of any size. It may be preferrable, however, to write this code as a loop. We notice that all the editions are done using the ADC instruction, except for the very first one, done by ADD – this is because there is no carry on the very first addition. We can use ADC for all additions if we are certain that the initial value of the carry flag is 0. The loop form of the previous program thus becomes:

; program final??

CLC ; clear carry

MOV CX,4

SUB BX,BX

L: MOV EAX,dword ptr x[BX]

ADC EAX,dword ptr y[BX]

MOV dword ptr z[BX],EAX

ADD BX,4

LOOP L

NOTE: the program above should be seen just an idea; to make it work one needs to ensure that the carry flag is not corrupted by other instruc- tions! – this is left to the reader.

The new instruction

107

CLC (CLear Carry)

sets carry to 0, thus allowing us to use ADC for the first loop iteration. We may as well introduce two other instructions that also alter the carry

flag.

STC (SeT Carry)

set carry to 1

CMC (CompleMent Carr)

toggle the carry bit Yet longer numbers can be handled with the code above by simply in-

creasing the number of times the loop is executed. The execution time will be proportional to O(n/r) where n is the bitlength of the numbers, and r is the bitlength of the register.

A similar problem for subtraction will not be examined in details. We shall only say that the analog of ADC for subtraction is the SBB instruction:

SBB t,s; t=t-s-carry (SuBtract with Borrow)

and all of the above examples can be adapted to subtraction by changing all ADD’s to SUB’s and all ADC’s to SBB’s.

108

3.21.1 24-bit case

As a special example, consider 24-bit (3 byte) numbers.

Exercise: Why would one even want to look at such?

Addition of such numbers can be done as three byte size additions or as one byte size and one word size.

; program 6

x db 3 dup(?)

y db 3 dup(?)

....

MOV AL,byte ptr x[0]

ADD AL,byte ptr y[0]

MOV byte ptr z[0],AL

MOV AL,byte ptr x[1]

ADC AL,byte ptr y[1]

MOV byte ptr z[1],AL

MOV AL,byte ptr x[2]

ADC AL,byte ptr y[2]

MOV byte ptr z[2],AL

Notice that even in column addition of decimal numbers, chunks do not need to be of the same size:

923

+ 278

---

we can add 23+78, obtaining 01 and carry of 1! In programs below we break the numbers into byte+word and word+byte

chunks.

109

; program 6a

x db 3 dup(?)

y db 3 dup(?)

....

MOV AL,byte ptr x[0]

ADD AL,byte ptr y[0]

MOV byte ptr z[0],AL

MOV AX,word ptr x[1]

ADC AX,word ptr y[1]

MOV word ptr z[1],AL

; program 6b

x db 3 dup(?)

y db 3 dup(?)

....

MOV AX,word ptr x[0]

ADD AX,word ptr y[0]

MOV word ptr z[0],AX

MOV AL,byte ptr x[2]

ADC AL,byte ptr y[2]

MOV byte ptr z[2],AL

Exercise: Are 6a and 6b equivalent?

3.21.2 other operations

• Bitwise AND, OR, XOR, NOT simply repeat the same operation on each chunk (loops for long data).

• Shifts propagate carry.

• NEG is left as a (nice!) exercise.

• (I)MUL is more complicated

• (I)DIV is even more complicated

110

The idea of long MUL: Assume x,y,z is twice the size of the register that is capable of multipli-

cation (N-bit long). Then x=xh ×R + xl, y=yh ×R + yl, where xl,yl are the low halves of

the values, xh and yh are the high halves, R is 2N.

x∗y = (xh∗R+xl)∗(yh∗R+yl) = xhyh∗R2 + (xhyl +xlyh)∗R+xlyl)

The formula contains 4 multiplications (multiplications by R are simply data movements)

Assume now that x and y are 64 bit, z is 128 bit.

x dd ?,?

y dd ?,?

z dd ?,?,?,?

SUB EAX,EAX

MOV z[8],EAX

MOV z[12],EAX

MOV EAX,x[0]

MUL y[0]

MOV z[0],EAX

MOV z[4],EDX

MOV EAX,x[0]

MUL y[4]

ADD z[4],EAX

ADC z[8],EDX

; ADC z[12],0

MOV EAX,x[4]

MUL y[0]

ADD z[4],EAX

ADC z[8],EDX

ADC z[12],0

111

MOV EAX,x[4]

MUL EDX

ADD z[8],EAX

ADC z[12],EDX

This seems to required 4 multiplications (and generally m*N numbers would require m2 multiplications); in reality only 3 are needed. (Toom’s algorith – google for more)

Idea:

x∗y = (xh∗R+xl)∗(yh∗R+yl) = xhyh∗R2 + (xhyl +xlyh)∗R+xlyl)

requires 4 multiplications, but

x∗y = (xh∗R+xl)∗(yh∗R+yl) = xhyh∗R2+[(xh+xl)∗(yh+yl)−xhyh−xlyl]∗R+xlyl)

requires only 3:

• xh ∗yh

• xl ∗yl

• (xh + xl) ∗ (yh + yl)

Division algorithms

112

https://en.wikipedia.org/wiki/Division_algorithm

Chapter 4

Jumps

4.1 Unconditional Jumps

JMP label (unconditionally JuMP (goto) to label)

where label is any identifier, not a keyword, unique. Some scope rules apply.

JMP lab

lab:

JMP lab

(Jumps can go both forward and backward, multiple JMP instructions may target the same label).

l: JMP l

(fine, but infinite loop)

113

NOTE: label is a keyword, so this id cannot be used. In fact, the colon : is an abbreviation for label near.

Other forms of JMP syntax exist, including

JMP r/m (jump(goto) to address stored in r/m)

Only a trivial example for now:

MOV AX,offset lab

JMP AX

....

lab:

(offset is a keyword, cf. the “take address” & operator in C).

114

4.2 Conditional Jumps

J<cc> label (Jump based on cond code)

General idea: conditional jumps implement an equivalent of

if (<condition>) goto lab;

where <condition> is usually expressed in flags. NOTE: of course, this is exactly the programming style not recommended

for high-level languages!

115

4.2.1 Using Zero Flag

if (ZeroFlag) goto lab;

corresponds to an assembler instruction, JZ.

JZ label (Jump if Zero flag)

Example:

; int x;

;

; if (x!=0)

; { <BODY> }

;

CMP x,0

JZ skip

; <BODY>

skip:

if x is already on a register, CMP may not be needed.

; int x,y,z;

;

; if (x+y+z!=0)

; { <BODY> }

;

MOV EAX,x

ADD EAX,y

ADD EAX,z

JZ skip

; <BODY>

skip:

116

JE label (Jump if equal)

is a synomym of JZ

; int x,y;

;

; if (x!=y)

; { <BODY> }

;

MOV EAX,x

SUB EAX,y

JE skip

; <BODY>

skip:

JNZ label (Jump if not Zero flag)

JNE label (Jump if not Equal)

; int x;

;

; if (x==0)

; { <BODY> }

;

MOV EAX,x

CMP EAX,0

JNZ skip

; <BODY>

skip:

117

; int x,y;

;

; if (x==y)

; { <BODY> }

;

MOV EAX,x

SUB EAX,y

JNE skip

; <BODY>

skip:

NOTE: The four (actually, two) instructions JZ, JE, JNZ, JNE are based on the Zero Flag only.

NOTE: The N “prefix” in the instruction name is used with other instruc- tions.

NOTE: Conditions in conditional jumps are always reversed: we specify skipping, not doing!

118

4.2.2 Using Sign Flag

JS label (Jump if Sign)

JNS label (Jump if Not Sign)

; int x,y;

;

; if (x<0)

; { <BODY> }

;

CMP x,0

JNS skip

; <BODY>

skip:

119

JL label (Jump if Less)

JLE label (Jump if Less or Equal)

JG label (Jump if Greater)

JGE label (Jump if Greater or Equal)

JNL label (Jump if Not Less)

JNLE label (Jump if Not Less or Equal)

JNG label (Jump if Not Greater)

JNGE label (Jump if Not Greater or Equal)

(There are only 4 different instructions above!)

120

; int x,y;

;

; if (x<=y)

; { <BODY> }

;

MOV EAX,x

SUB EAX,y

JNLE skip ; reverse Less-or-Equal cond

; <BODY>

skip:

121

WARNING: the explanation below is simplified, actually JS and JL are different, and the Overflow flag plays a role. For nearly all practical code the results are the same. For more details (and true picture) check this.

Why this extra mess? Because CMP may overflow!

How does this work? Primary instructions: JL (==JNGE), JLE (==JNG), JNL (==JGE), JNLE

(==JG). We assume JL ≈ JS. Then

JL jumps if sign flag is set

JLE jumps if sign or zero flag is set

JNL jumps if sign flag is not set

JNLE jumps if neither sign or zero flag is set

122

https://stackoverflow.com/questions/25031860/difference-between-js-and-jl-x86-instructions/25055804

4.2.3 A few examples

Example: implement ABS function

abs(x) = {−x if x < 0

x if x ≥ 0 Buggy!

; EAX=abs(EAX)

OR EAX,EAX

JNZ done

NEG EAX

done:

Fix the error above! The correct code is

; EAX=abs(EAX)

OR EAX,EAX

JNS done

NEG EAX

done:

Example: implement SIGN function (Sign definition:

sign(x) =

{−1 if x < 0 0 if x = 0 1 if x > 0

Actually

∀x,x = abs(x) × sign(x)

; EAX=sign(EAX)

OR EAX,EAX

JZ done

123

MOV EAX,1

JG done

NEG EAX

done:

Example: What does this do?

MOV CX,10

SUB AX,AX

l: ADD AX,CX

DEC CX

JNZ l

124

These instructions are not suitable for unsigned arithmetic! Example

MOV AL,-1

CMP AL,1

JL lab

...

lab:

This works correctly, JMP is taken

MOV AL,255

CMP AL,1

JL lab

...

lab:

Counter-intuitively, the jump is taken, this is because 255 = −1 and is interpreted as such!

Correct way to program this is

MOV AL,255

CMP AL,1

JB lab

...

lab:

4.2.4 Using Carry Flag

JC label (Jump if Carry)

JNC label (Jump if Not Carry)

125

MOV EAX,x

ADD EAX,y

JC error ; overflow !

...

error:

126

JB label (Jump if Below)

JBE label (Jump if Below or Equal)

JA label (Jump if Above)

JAE label (Jump if Above or Equal)

JNB label (Jump if Not Below)

JNBE label (Jump if Not Below or Equal)

JNA label (Jump if Not Above)

JNAE label (Jump if Not Above or Equal)

Only 4 different instructions here, further, JB==JC, JNB==JNC.

127

Exercise: Work out how the flags are used

4.2.5 Mixing signed and unsigned

Consider two examples:

//example S

int x,y;

if (x<y) ....

//example U

unsigned int x,y;

if (x<y) ....

In Example S, compiler will generate a CMP instruction, followed by an inverted signed jump, specifically here JNL.

In Example U, compiler will generate a CMP instruction, followed by an inverted unsigned jump, specifically here JNB.

This makes it easy for the user — he does not need to think which instruction to use. But there is a problem:

//example M

int x; unsigned y;

if (x<y) ....

Using either signed or unsigned instruction will work incorrectly in some cases! Compilers would typically issue a warning (f.e. “signed un- signed mismatch”, often ignored by the programmer.

How to do this correctly? Let’s consider a smaller case first.

128

//example MB

unsigned char x, /*signed*/ char y;

if (x<y) ....

8-bit comparison will not work. But conversion into a range that in- cludes both char and unsigned char will:

//example MB1

unsigned char x, /*signed*/ char y;

short sx,sy;

sx=x; sy=y;

if (sx<sy) ....

or, simply:

//example MB1

unsigned char x, /*signed*/ char y;

if ((short)x<(short)y) ....

Since short is a signed datatype, generated instruction will be signed. When comparing signed and unsigned shorts, we can convert both to

int (in fact, we could have used int above too). But what do we do when comparing signed and unsigned int’s?

In assembler, the above is of course also doable, but so is the 32-bit comparison!

(The code cannot be shown at this time : we bypassed conversion in- structions).

In C a proper comparison function can be written too.

129

4.3 Jumps encoding

opc del

where opc is the opcode (0x74 for JE, 0x78 for JS,...) and del is a signed byte.

IP is first incremented to point to the next instruction, then, if the jump is taken, IP=IP+del;

The short form of JMP is encoded similarly, using the opcode byte 0xEB

0xEB del

The near form of JMP uses opcode 0xE9 and 2-byte (16-bit segs) or 4- byte (32-bit segs) delta.

16 bit seg : 0xEB d1 d2

32 bit seg : 0xE9 d1 d2 d3 d4

If conditional jump cannot reach its target, one can overjump:

JZ lab

; more than 128 bytes

lab:

may not compile, but

JNZ temp

JMP lab

temp:

; more than 128 bytes

lab:

130

will.

80386+ has alternative longer forms of conditional jumps instructions, with 16-bit delta, they begin with bytes 0Fh 8?h ..

4.4 LOOPS

Revisiting the previously seen example:

MOV CX,10

SUB AX,AX

l: ADD AX,CX

DEC CX

JNZ l

We notice that adding numbers 10 down to 1 is more efficient than doing 1 to 10.

131

MOV CX,1

SUB AX,AX

l: ADD AX,CX

INC CX

CMP CX,11 ; EXTRA

JNZ l

Generally, running loops toward 0 is more efficient!. We next notice that we actually implemented a do while loop.

; cx=1; ax=0; do { ax=ax+cx; cx++; } while (cx<=10);

What about a while ?

; cx=1; ax=0; while (cx<=10) {ax=ax+cx; cx++; }

MOV CX,1

SUB AX,AX

l: CMP CX,10 ; EXTRA

JA done

ADD AX,CX

INC CX

JMP l ; EXTRA

done:

Each execution of while in n iteration loop requires 2n jumps vs n for do while.

Generally, do-while is more efficient than while!.

We will not consider for – generally equivalent to while. Returning to

MOV CX,10

SUB AX,AX

l: ADD AX,CX

DEC CX

JNZ l

This is a very common form of a loop, and Intel allows to optimize this further.

132

LOOP label (CX–; if (CX>0) JMP label)

Thus:

MOV CX,10

SUB AX,AX

l: ADD AX,CX

LOOP l

for one instruction fewer!

We can generalize this into

MOV CX,n

SUB AX,AX

l: ADD AX,CX

LOOP l

But what will happen if supplied n is 0 ?

One solution:

MOV CX,n

SUB AX,AX

OR CX,CX

JZ done

l: ADD AX,CX

LOOP l

done:

Again, Intel provides a shortcut,

133

JCXZ label (if (CX==0) JMP label)

Thus:

MOV CX,n

SUB AX,AX

JCXZ done

l: ADD AX,CX

LOOP l

done:

LOOP and JCXZ are encoded similarly to conditional jumps. LOOP’s op- code is 0E2h, JCXZ’s is 0xE3.

Mentioning only: there are also less commonly needed LOOPZ and LOOPNZ; All four instructions can work off ECX.

Double LOOPs should save CX!

MOV CX,n

L1:

....

**SAVE** CX

MOV CX,m

L2:

....

LOOP L2

**RESTORE** CX

LOOP L1

SAVE/RESTORE can be done using another reg (for example, XCHG CX,DX), memory, or stack (usually the best).

134

One more “real” example:

MOV CX,100

MOV AX,1

MOV BX,1

L: ; print AX

ADD BX,AX

XCHG BX,AX

LOOP L

4.5 Control statements templates

; goto lab;

;

JMP <lab>

lab:

; if <COND> <BODY>

;

; convert condition into a flag

JN<cond> skip

<BODY>

skip:

135

; if <COND> <BODY1> ;else <BODY2>

;

; convert condition into a flag

JN<cond> lelse

<BODY1>

JMP done

lelse:

<BODY2>

done:

; while (true) <BODY>

; for (;;) <BODY>

lagain:

<BODY>

JMP lagain

ldone: ; may be used for break in <BODY>

; while <COND> <BODY>

;

lagain:

; convert condition into a flag

JN<cond> ldone:

<BODY>

136

JMP lagain

ldone:

; do <BODY> while <COND>;

;

lagain:

<BODY>

; convert condition into a flag

J<cond> lagain

; while <COND> { <BODY1> break; <BODY2>}

;

lagain:

; convert condition into a flag

JN<cond> ldone:

<BODY1>

JMP ldone

<BODY2>

JMP lagain

ldone:

137

; while <COND> { <BODY1> continue; <BODY2>}

;

lagain:

; convert condition into a flag

JN<cond> ldone:

<BODY1>

JMP lcont

<BODY2>

lcont:

JMP lagain

ldone:

NOT covered: for, switch, and how to deal with more complex expres- sions (see “complete and shortcut” section).

138

Chapter 5

Variables

5.1 Declaring variables

Assembler language includes operators for declaring variables. The primitive data declarations operators correspond to the primitive

types in the assembler language:

139

[name] DB value[s] (Declare Byte)

[name] DW value[s] (Declare Word)

[name] DD value[s] (Declare Doubleword)

[name] DQ value[s] (Declare Qword)

[name] DT value[s] (Declare Tenbyte)

The [name] field of the declaration should be a unique identifier; this field is usually present but not required. The [name] field is used to refer to the variable and if absent, there is no way to refer to it.

The value field refers to the initial value of the variable, this field is always required.

Examples:

year dw 2020

a1 db ’a’ ; as char

a2 db 97 ; as decimal

a3 db 61h ; as hex

a4 db 01100001b ; as binary

dwrd dd 12345678

notice that a1,a2,a3 and a4 all provide the same initial value, in different

140

formats. ’a’ is 97 since the character ’a’ is in the 97th position in the ASCII table.

Multiple initialization values are allowed, in such cases the declared variable is actually an array:

primes DW 2,3,5,7,11,13

str1 DB ’H’,’e’,’l’,’l’,’o’

str2 DB ’Hello’

str3 DB "H",’e’,"L","L",’o’

str4 DB "HELLO"

str5 DB ’HE’,"LLO"

All five strings above contain identical data: both single of double quotes are allowed – but single quotes must match single quotes and double quotes must match double. Further, strings can be enteres either as strings or as sequences of characters.

s DB ’Hello, World’,0

shows a zero-terminated string (C language string format). One would use single quotes to enter a string that includes double

quotes and vice versa:

m DB ’"Hello", he said’

and will have to break the string into parts if it includes both single and double quotes

m DB ’"Don’,"’",’t", he said’

Long sequence of initialization values can be broken into multiple dec- larations:

msg DB "ERROR: you must not use"

DB "this operator the way you do"

DB "please delete the program and"

DB ’read the textbook’,0

141

in the example above we do not name the lines after the first – the program will not refer to them.

Often, one needs to calculate the length of a string like msg above. This can be done during the run time, by counting the characters up to the terminator, or during compilation:

msg DB "ERROR: you must not use"

DB "this operator the way you do"

DB "please delete the program and"

DB ’read the textbook’

msgend DB 0

msglen DW offset msgend-offset msg

....

MOV CX,offset msgend-offset msg

The offset operator is similar to the address operator & in C/C++. offset msgend-offset msg is a constant, computed during the compile time, it is equal to the number of the characters in the message, excluding the terminating null character. With this value known we actually do not need the null terminator at all and can instead use

msg DB "ERROR: you must not use"

DB "this operator the way you do"

DB "please delete the program and"

DB ’read the textbook’

msgend LABEL BYTE

msglen DW offset msgend-offset msg

....

MOV CX,offset msgend-offset msg

where LABEL declares a byte type variable with NO initial values. LABEL is used instead of DB since the latter will always allocate at least one byte. Other types that may follow LABEL include WORD, DWORD, QWORD, TENBYTE.

To specify an uninitialized variable, use “?”:

x DD ?

y DD ?

142

Declared variable can be referred to in code:

x DW ?

y DW ?

z DW ?

.....

MOV AX,x

ADD AX,y

MOV z,AX

Notice that declared variables have size and the usual size rules apply:

MOV x,10 ; OK, size known from x

MOV AL,x ; error, size mismatch

MOV AL,byte ptr x ; OK, size casted to byte

The dup operator can be used to specify large arrays without having to list all the initial values:

ar1 DW 1,1,1,1,1

ar2 DW 5 dup(1)

two arrays above contain exactly the same data. More practically useful example would be

ar3 DW 1000 dup(?)

– a 1000-element unitialized array.

Initialized and unitialized entries can be used within the same declara- tion:

aa DW 1,2,?,?,5,6

fib DW 1,1,998 dup(?)

the fib array will be used to compute the Fibonacci numbers below. First two elements are initialized statically, the subsequent entries will be com- puted.

143

Data declarations are not executable instructions and in general one should separate them from code so they do not get executed. This is usually accomplished by placing them into a separate segment. If there is only one segment (.com files), then the typical approach is to structure the program as follows:

start: ; entry point of the program

JMP real_start

.....

data and procedure declarations

.....

real_start:

; code begins here

because of overjumping the data, it will never get executed. An exception to the above is the case when we know the exactly which

instructions the data corresponds to. For example,

MOV AX,BX

DB 90h

ADD SI,DI

is totally safe, 90h is the NOP operation! While there is no good reason to code NOP as data, similar approach can be used to enter an instruction that is not supported by the assembler compiler, or a form of the instruction assembler will not produce.

There are two different ways to encode ADD BX,CX as machine code. One of the is normally produced by the assembler compiler, the other can only be entered using db’s.

Exercise: Do it.

5.2 Using arrays

This section contains several examples of common snippets of code. Example 1: initialize a 256-element byte array to contain all possible

256 characters in ascending order.

144

ba DB 256 dup(?)

.....

SUB BX,BX

MOV CX,256

l: MOV ba[BX],BL

INC BX

LOOP l

notice that we use the same register as both index and the current value to be stored; this is possible only because of the byte size of the data.

Example 2: initialize a 1000-element word array to contain numbers 0 through 999 in ascending order.

nums DW 1000 dup(?)

.....

SUB AX,AX

SUB BX,BX

MOV CX,1000

l: MOV nums[BX],AX

INC AX

ADD BX,2

LOOP l

Exercise: Explain what would happen if BX is incremented only by 1.

Generally, the index in loops like above needs to be incremented by the byte size of the data – the index counts bytes, and not elements. One very important implication is that in high language loops of this type there is a hidden multiplication by the element size, depending on the compiler it may or may not be implemented as a multiplication.

Consider two ways to implement the above code in C :

short s[1000]; int i;

for (i=0; i<1000; i++)

s[i]=i;

145

this code has a hidden multiplication by 2, index i needs to be translated into an offset to compute the address of s[i]. The other way to code the example

short s[1000]; int i; short *p=s;

for (i=0; i<1000; i++)

*p++=i;

does not have a hidden multiplication and often would result in better com- piled code.

Example 3: for collection, let us also initialize a 1000-element double- word (int) array to contain numbers 0 through 999 in ascending order.

nums DD 1000 dup(?)

.....

SUB EAX,EAX

SUB BX,BX

MOV CX,1000

l: MOV nums[BX],EAX

INC EAX

ADD BX,4

LOOP l

Example 4: Compute sum and average of the elements in a 1000- element word array (in this and subsequent examples we assume nums to be initialized with some values first.)

nums DW 1000 dup(?)

sum DW ?

ave DW ?

.....

SUB AX,AX

SUB BX,BX

MOV CX,1000

l: ADD AX,nums[BX]

ADD BX,2

LOOP l

MOV sum,AX

CWD

146

MOV BX,1000

IDIV BX

MOV ave,AX

(note that the answer will be rounded down to an integer) Example 5: Find the largest element in an array of signed numbers

nums DW 1000 dup(?)

max DW ?

.....

MOV AX,nums[0]

MOV BX,2

MOV CX,999

l: CMP AX,nums[BX]

JGE skip ; use JAE for unsigned

MOV AX,nums[BX]

skip:ADD BX,2

LOOP l

MOV max,AX

Use JLE and JBE if searching for the smallest entry. Example 6: Compute Fibonacci numbers

fibs DW 1,1,998 dup(?)

.....

MOV BX,4

MOV CX,998

l: MOV AX,fibs[BX-4]

ADD AX,fibs[BX-2]

MOV fibs[BX],AX

ADD BX,2

LOOP l

147

Exercise: Would all the values be computed correctly?

Example 7: Find value t in an array of numbers

nums DW 1000 dup(?)

max DW ?

......

MOV AX,t

SUB BX,BX

MOV CX,1000

l: CMP AX,nums[BX]

JE found

ADD BX,2

LOOP l

not_found:

.....

found:

.....

A more interesting program would be a binary search... it not all that difficult in assembler. We assume nums contains signed numbers in ascend- ing order.

nums DW 1000 dup(?)

......

MOV AX,t

MOV SI,0 ; left interval bound

MOV DI,999*2 ;right interval bound

l: CMP SI,DI

JA not_found

148

MOV BX,SI

ADD BX,DI

SHR BX,2

SHL BX,1 ; middle point

CMP t,nums[BX]

JE found

JG right

left:

MOV DI,BX

SUB DI,2

JMP l

right:

MOV SI,BX

ADD SI,2

JMP l

....

found:

....

not_found:

....

Exercise: Reverse the elements in an word array.

Exercise: Implement bubble sort.

149

5.3 Memory addressing syntax

16 bit addressing offers only the following nine schemes of entering the offset

• const

• BX+const

• BP+const (***)

• SI+const

• DI+const

• BX+SI+const

• BX+DI+const

• BP+SI+const (***)

• BP+DI+const (***)

Notes: How to remember: Base, or Index, or Both, or Neither

Double register addressing allows to efficiently access two dimensional array, but more often is used to access dynamic one-dimensional arrays (base points to the beginning of an array, index “indexes” it.)

This limitation allows for a compact encoding of the addressing scheme. Despite nine choices offered, only three bits are needed.

Options marked with (***) imply SS: segment use (others default to DS:).

150

5.4 Video memory in text mode.

For the default text video mode, the video memory begins in the middle of the B band, corresponding to the segment value B800. (the reasons are historical).

Let us begin with an example showing the idea, we will specify the exact rules later.

MOV AX,0B800h

MOV ES,AX ; ES=>video!

MOV byte ptr ES:[0],’A’

INT 20h ; terminate the program

If compiled (you would need to add extra lines to the file per assembler language syntax requirements) and run this program will deposit letter A into the upper left corner of the screen!

The general layout of video memory is the following: The screen has 25 rows and 80 columns of characters; each cell cor-

responds to two bytes in the video memory. The even bytes (0, 2, 4, ...) are the ASCII characters, the odd bytes define the attributes (colors) of the character in the preceding even address. Let us provide an example of this:

MOV AX,0B800h

MOV ES,AX ; ES=>video!

MOV byte ptr ES:[0],’A’

MOV byte ptr ES:[1],71h

INT 20h ; terminate the program

151

71h defines the attributes of the letter A on screen. The first nibble (7) is the background color, the second (1) is the foreground color: A would appear as blue on white, following these definitions:

• 0 – black (0x000000)

• 1 – blue (0x0000AA)

• 2 – green (0x00AA00)

• 3 – cyan (0x00AAAA)

• 4 – red (0xAA0000)

• 5 – magenta (0xAA00AA)

• 6 – brown (0xAA5500)

• 7 – white/light gray (0xAAAAAA)

• 8 – (dark) gray (0x555555)

• 9 – bright blue (0x5555FF)

• 10 – bright green (0x55FF55)

• 11 – bright cyan (0x55FFFF)

• 12 – bright red (0xFF5555)

• 13 – bright magenta (0xFF55FF)

• 14 – yellow (0xFFFF55)

• 15 – bright white (0xFFFFFF)

(In some modes the high bit of the background color indicates blinking rather than color, other variations of interpretation exist).

While MOV is the most common instruction to change video memory, any other instruction can be used. Consider:

152

MOV AX,0B800h

MOV ES,AX ; ES=>video!

MOV byte ptr ES:[0],’A’

MOV CX,256

L: INC byte ptr ES:[0]

LOOP L

INT 20h ; terminate the program

The program above will show A, then change it to B,.... eventually coming back to an A.

Will you be actually able to see the letters changing onscreen? Well.. no! This will happen too fast to notice.

Exercise: Fix it

Writing to video memory is the best way of doing full-screen output, most suitable for tables and games; other methods are used to produce scrollable console output (see INT 21h).

Assuming that the rows are numbered up-to-down 0 through 24, and the columns are numbered left-to-right 0 through 79, to write at position (X,Y) we should store information at offset 160 ∗ Y + 2 ∗ X in the video segment.

153

Chapter 6

Project #1

Due Date: 04/19/2021 Goal: implement Game of 1024. This a new simple board game(rather a puzzle), you can play the online

version Here. Project specifics:

1. Use direct write to the video memory (seg 0xB800).

2. Output will be explained(today), input can be done using int 16h (look up).

3. Recommendation to use color and line draw characters (2nd half of the ASCII OEM set)

4. Projects are individual and should not be copies of source code found elsewhere; such submissions will not be accepted.

5. You do not need to follow the layout exactly, shortcuts that simplify programming but essentially keep the game the same are fine.

154

https://1024game.org/

Basic grading guidelines:

• C something works (but perhaps not really playable)

• B playable with glitches/deficiencies.

• A enjoyable to play (well... to the degree the entire idea of such puzzle is)

6.0.1 Program outline

Hopefully, of some help – this is an overall structure of the program one can use.

0 Initialize the program, display.

1 Initialize the configuration

2 Display the board

3 Wait for a key (int 16h)

4 If the key indicates a move of a piece, update the configuration, go to step 2.

5 .If the key indicates new game, go to step 1.

6 If the key indicates quit, quit.

7 Ignore the key, go to step 3.

6.0.2 Submission

Submit the source (.asm) and the executable (.com or .exe), ideally by email.

Please not that some email providers kill attachments of certain types. Google is likely to reject .com or .exe, CCNY email does not like .asm.

To avoid problems : rename the files to give them “safe” extension, then zip(or rar) files together. Please do not name the archive “project.zip”, rather use your name to name the archive.

(Example: rename project.com to project.com.txt)

155

When submitting, CC: yourself – this way you would see that the project does come through.

Submission can be made either my CCNY email or to mv@panix.com (the latter address does not kill any attachments!).

Submission is your responsibility.

156

6.1 How to compile and run a program.

Assembler distribution (Download, unpack, make sure it works. The distribution contains DOS-

BOX, TASM, TLINK). At this time we only need to know how to create a binary from an

assembler program, a sample HELLO.ASM is provided.

.model tiny

.code

org 100h

start:

; BEGIN BODY

mov dx, offset hello

mov ah, 9

int 21h

mov ah, 4ch

int 21h

hello db ’Hello, world.’,13,10,’$’

; END BODY

end start

The part between BEGIN BODY and END BODY is replaced by your own code.

To compile: (assuming you are in the directory where hello.asm is unpacked; under DOSBOX or 32-bit OS)

C> BIN\TASM hello

This will create file HELLO.OBJ. To link

C>BIN\TLINK hello /t

157

http://tinyurl.com/cjdyruy

This will create file HELLO.COM. To run

C>Hello

This will say (guess what?) This output method (console write) will not work right for full-screen

apps, write into video directly, instead:

.model tiny

.code

org 100h

start: jmp real_start

; place your variables and/or procedures here.

real_start:

; BEGIN BODY

MOV AX,0B800h

MOV ES,AX ; ES=>video!

MOV byte ptr ES:[0],’A’

MOV CX,256

L: INC byte ptr ES:[0]

LOOP L

INT 20h ; terminate the program

end start

158

Chapter 7

Stack and procedures

Hardware stack discussed in the chapter is not a necessary feature of a hardware, many earlier designs did not provide it. However, having it helps in many ways, most notably in the ability to implement procedures efficiently.

7.1 Using stack

We will begin working with the stack by using it. For this section it is not important just how it works psysically, we only state that there is – somewhere in memory – a data structure, implemented in hardware, that parallels the stack software data structure. Namely, we can push things on the stack, we can pop them back, and the operations work in the LIFO fashion: Last In, First Out.

Let us begin with the syntax:

PUSH w/d m/r (PUSH word or doubleword)

POP w/d m/r (POP word or doubleword)

Notice that byte operations are not supported. Further, notice that push-

159

ing or poping memory actually results in a memory-to-memory operation! These stack operations are one of the exceptions to the general no memory- to-memory rule.

PUSH AX

....

POP AX

The example above shows saving a register on the stack and later restor- ing it. This would be helpful if the register – and we always are short on them – is needed for some other purpose in the code between PUSH and POP. This situation is very common and use of PUSH+POP tends to be shorter and more efficient than any alternative solution (saving a temporary variable, recomputing AX to restore the old value,...)

The code above will work if the omitted part does not have any stack operations, or if they are balanced. For example,

PUSH AX

POP AX

is fine, while

PUSH AX

PUSH CX

POP AX

PUSH AX

POP CX

POP AX

will not work right and – as will see later – will quite probably lead to a program crash.

A more sophisticated example is a typical double loop. We would like to execute a piece of code N times and then the entire loop M times. Irrespec- tively of exactly what we want to do, the optimal implementation would use LOOP instruction and thus look like this:

160

MOV CX,M

L1:

...

MOV CX,N

L2:

...

LOOP L2

...

LOOP L1

As written this code will not work correctly since the inner loop resets the value of the counter for the other loop! We therefore change it to

MOV CX,M

L1:

...

**save** CX

MOV CX,N

L2:

...

LOOP L2

**restore** CX

...

LOOP L1

**save** and **restore** are not assembler operations, just indica- tions of action to undertake. They can be implemented in multiple ways, PUSH and POP are generally the easiest implementation:

MOV CX,M

L1:

...

PUSH CX

MOV CX,N

L2:

...

LOOP L2

POP CX

...

LOOP L1

161

but something different can be done, for example save in another register:

MOV CX,M

L1:

...

MOV DX,CX

MOV CX,N

L2:

...

LOOP L2

MOV CX,DX

...

LOOP L1

this is actually a bit faster than using the stack, but one may not have an available register.

When saving multiple registers (or memory variables), one retrieves them in the reverse order:

PUSH AX

PUSH BX

PUSH CX

...

POP CX

POP BX

POP AX

Consider an example where we don’t follow the LIFO ordering:

PUSH BX

PUSH AX

POP BX

POP AX

This code would restore BX with the value previously in AX and vice versa, in effect this example is equivalent to

XCHG AX,BX

162

and in this specific case using single XCHG is obviously much better. How- ever, this would not be the case with segment registers, also supported by PUSH and POP:

PUSH DS

PUSH ES

POP DS

POP ES

there is no XCHG alternative, since XCHG does not allow segment registers as arguments!

One can also push and pop the FLAGS register, albeit the syntax is a bit different since FLAGS is not a keyword.

PUSHF (PUSH Flags register)

POPF (POP Flags register)

These instruction can be used to examine or change the contents of the FLAGS register. Consider:

PUSHF

POP AX ; now AX has all the bits of the FLAGS

OR AL,1 ; change the last bit of flags

PUSH AX

POPF ; now FLAGS are changed -- its last bit is set to 1.

What did we do exactly? To get the answer we need to know just what the changed bit of FLAGS is. The last bit happens to be the Carry Flag, and our code set it to one.

Using five instructions for just this is an overkill, STC would have done the same, but we do not have instructions like STC for other flags! – and to work with them we would have to be use code similar to the above.

(Related instructions SAHF and LAHF are not discussed here.)

163

PUSH immed16 (PUSH immediate)

This form was added in 80186. There is no POP counterpart. 80386 and newer processors allow the immediate argument to be a

doubleword, adding new syntax:

push 123h ; push word

push dword 123456h ; push dword

while pushing doubleword can be accomplished with a single instruc- tion (prefixed by 066h), some assemblers compile it as two word-size PUSH instructions.

Noticing that many programs include many sequences of PUSH register or POP register instructions like

PUSH AX

PUSH BX

PUSH CX

....

in 80286 Intel added shortcut instructions

PUSHA (PUSH All)

POPA (POP All)

PUSHA pushes AX,BX,CX,DX,SI and DI; POPA pops them all back. PUSHA and POPA are one-byte instructions, so use of them is always more efficient space-wise. Time-wise, they are roughtly equivalent to three sin- gle register PUSHes or POPs, thus replacing a sequence of three or more PUSH(POP) instructions by a single PUSHA(POPA) is warranted, but only two is not. in 80386 Intel added 32-bit support to PUSHes and POPs:

164

1. arguments of PUSH and POP can be doubleword (32bit) registers and memory locations

2. new instructions PUSHFD, POPFD push and pop the EFLAGS register.

3. new instructions PUSHAD, POPAD push and pop 32-bit regs: EAX, EBX, ECX, EDX, ESI, EDI.

7.1.1 Encoding of PUSH/POP instructions

One byte opcodes are also used for segment register commands:

PUSH ES ; 06h

POP ES ; 07h

PUSH CS ; 0Eh

;

PUSH SS ; 16h

POP SS ; 17h

PUSH DS ; 1Eh

POP DS ; 1Fh

For the flags register

PUSHF ; 09Ch

POPF ; 09Dh

notice the absense of POP CS : this instruction actually does not exist, and putting it into the source code would cause a compilation error. The reason is that resetting the code segment register without simulteniously chang- ing the instruction pointer will almost always lead to a non-working code. The byte opcode OFh where this instruction would have been if it existed was undefined in 8086 and later used as a prefix for a large set of new instructions.

The forms of PUSH and POP for memory arguments are less used and therefore are implemented less efficiently: they use multibyte sequences with opcodes 8Fh for PUSH and 0FFh for POP.

PUSH immediate encoding uses opcode 68h (followed by a word con- stant) or 6Ah (followed by a signed byte constant, extended to a word).

165

7.2 Stack implementation

We now turn our attention to the actual stack implementation: what hap- pens when we push and pop data?

The stack is controlled by two registers: SS indicates the segment where the stack is; SP points to the end of the stock. (We avoid using the usual word “top” for the reasons seen below). The stack segment may be a sepa- rate segment or be the same as other program segments; in a .com program all segment registers are the same and therefore stack is physically the same segment as code and data.

When an item is being pushed on the stack, SP is decremented by the item’s size (2 – for 16-bit arguments or 4 – for 32-bit) and the data is copied to the memory location SS:[SP]. When an item is popped from the stack, data from the memory location SS:[SP] is copied to the argument and SP is incremented by the argument size.

Notice that unlike the usual stack data structure the stack grows down, not up. This does not affect the functionality and makes it more convenient to check for stack overflow. There is, however, a problem with terminology: if in a normal stack the last element resides on the “top” of the stock, here it would reside in the “bottom”; to avoid confusion we shall use the term end of stack instead of “top”.

7.2.1 Stack Balancing

We now look more carefully at the meaning of “stack balancing” men- tioned in the previous section. While in simple examples balancing may mean merely matching push’es and pop’s, its true definition is preserving the value of SP. an decrease in SP corresponds to bytes being pushed; an increase in SP corresponds to bytes being popped; both amounts should be the same and thus the value of SP should not change. This does not necesserily mean equal number of PUSHes and POPs.

Some examples:

PUSH AX

PUSH BX

POP CX

POP DX

166

the stack is balanced; there is nothing wrong in popping information to different registers if this is what the logic of the program dictates.

PUSH AX

POP EAX

the stack is not balanced: 2 bytes were pushed and then 4 bytes are popped off.

PUSH ESI

POP AX

POP DX

the stack is balanced (despite having one PUSH and two POPs). This snip- pet actually breaks the 32-bit integer in ESI into its low and high 16-bit parts.

PUSHA

POPA

the stack is balanced.

PUSHA

POP DI

POP SI

POP DX

POP CX

POP BX

POP AX

the stack is balanced.

167

7.2.2 Working with Stack Pointer

We shall start with an example, as usual

PUSH AX

....

POP AX

In the snippet above, assume that we do not really need the restored value of AX, it will never be used. We cannot neglect removing it from the stack, this would unbalance the stack, but we can remove it merely by adjusting the stack pointer:

PUSH AX

....

ADD SP,2

the stack is balanced, despite having a PUSH and no POP’s at all. More than a single PUSH can be balanced by an adjustment of SP:

PUSH AX

PUSH EDX

PUSH ECX

....

ADD SP,10

and more complicated scenarios are possible :

MOV BP,SP ; save SP to BP

...

MOV CX,n

L: PUSH CX

LOOP L ; n pushes

...

MOV SP,BP ; undo them all at once, restoring SP

Notice that while usually one would Save/Restore on the stack, it is not going to work in the last example!

168

The flip side of SP controlling the stack is that while SP supports the same arithmetic instructions as other general purpose registers, performing them on SP will almost always lead to a disasterous unbalancing of the stack. For example, XOR SP,0123h is a legal instruction, but it would not be trivial to find a situation where executing this instruction would not result in a crash. The three most common operations with SP, and still not universally save are

ADD SP,n ; remove n bytes from the stack

MOV SP,prev ; set SP to previous safe value

SUB SP,n ; allocate n bytes on the stack

The last operation is the most interesting. Subtrackting 2 bytes from SP is equivalent to pushing 2 undefined bytes on the stack, this effectively allocates a 2-byte variable. Subtracking n likewise allocates |n| bytes. This is a very cheap way to get dynamic memory, albeit limited by the stack size and thus not suitable for allocating large blocks – that would require a heap manager and/or using OS calls for allocation.

A typical allocation/deallocation sequence is

SUB SP,n ; allocate n bytes on the stack

MOV reg,SP ; use reg (often BP) to access the dynamic blocks

.....

; any balanced code

.....

ADD SP,n ; free

7.3 Procedures

The most important application of stack is the implementation of proce- dures1. Before looking at Intel specifics, we shall look at generic procedure implementation on any hardware.

We start by looking a simple procedure : no arguments, no return value2, no local variables and scopes. All of these extra features can be

1cf.function in C/C++ 2cf. void in C/C++

169

added later, for now we only care about implementing a call. Calling a procedure is nothing but transferring control to it (think goto

or JUMP) with one extra feature: the ability to return to the place in the code from where the call was made. Thus,a procedure implementation can be accomplished by two instructions:

CALL p == **save** IP ; JMP p

RET == **restore** IP

Just like before, specifics of implemenations of **save** and **restore** would depend on the hardware. If it has has stack, it would be most reason- able to use it, and in fact this is how instructions CALL and |RET| work on Intel processors: **save** is PUSH and **restore** is POP. If the hardware does not provide a stack, it becomes the responsibility of the programmer to implement a software stack and software equivalents of PUSH and POP; without having some kind of stack, nested procedure calls simply cannot be implemented. Naturally, the software approach actually used on some computers (IBM mainframe) would always be less efficient than a hardware stack as well as complicate programming.

On Intel processor the two key instructions are exactly as described:

CALL p (== PUSH IP, JMP p)

RET (== POP IP)

notice that the value of IP pushed by CALL corresponds to the address of the next instruction after CALL, otherwise an infinite loop of calls may result.

Simple arguments and return values can be passed on registers, this is actually the most efficient way in assembler.

With this in mind, let us implement the abs function as an assembler procedure. We will use the accumulator AX for both the input and output value:

170

abs: ; compute ABS value of AX, return in AX

OR AX,AX

JNS done

NEG AX

done:

RET

....

MOV AX,-3

CALL abs

; return here, with AX being +3

MOV AX,5

CALL abs

; return here, with AX being 5

No additional syntax is required but we will consider a later alternative syntax which makes the procedure more “structured-looking”.

One more simple example:

min: ; compute min of numbers in AX and BX

CMP AX,BX

JL L

XCHG AX,BX

L: RET ; return in AX

notice that this function considers numbers as signed, an unsigned version would use JB rather than JL. Further, this function also computes the max: this is the value returned in BX. Multiple return values are fine in assembler procedures, but such functions cannot be called from high-level language : there is no syntax in, for example, C/C++ to express this!

Two important rules apply to procedures: (1) one must never walk into a procedure and (2) procedure should balance the stack. (Exceptions to these rules are possible!)

Let us illustrate:

MOV DX,SI

ADD DX,DI

171

abs: ; compute ABS value of AX, return in AX

OR AX,AX

JNS done

NEG AX

done:

RET

In the example above we get to execute the procedure by walking into it, not by calling. It would execute (it does not matter that AX is not defined) but where would it return? We never pushed the return address on the stack! RET would still work, taking an undefined value from the stack and sending us into a random location in memory where before long we will encounter illegal instructions of some kind. A crash.

(As we will see later, walking into procedures may be a valid idea in some cases – specifically when we know what is on the stack. In most cases, we don’t.)

The unbalanced stack would lead to the same type of troubles. Con- sider:

PUSH CX

PUSH BX

POP BX

RET

....

CALL p

When RET is executed, the return address is not at the end of the stack, instead the value of CX is. This value would define where we will end up after the return, and most likely it is not a place where code execution would succeed.

Alternatively, consider the case when we POP more than PUSH.

PUSH CX

POP CX

172

POP BX

RET

....

CALL p

Here we removed the return address from the stack. RET would take what- ever value is at the end of the stack, and again we are not likely to survive this.

To summarize: it is your responsibility to ensure that at the time of RET being executed, it pops the return address and not something else.

7.3.1 Structured syntax

Most assembler programs use alternative syntax, based on the same CALL/RET ideas. In our previous examples we wrote:

.....

RET

CALL p

This indicates where the procedure begins (label p:), but not where it physically ends, there may be more code after the RET and there may be multiple RET statements inside a body of the procedure. We can alterna- tively write

p proc

.....

RET

p endp

CALL p

173

where proc and endp are assembler keywords which must match: every proc should be followed eventually by a endp that has the same label. Both types of syntax produce the same code. In addition to endp clearly showing where the code ends, there are other advantages like declarations of local labels and variables within proc–endp blocks. while in most cases one would prefer to use proc–endp syntax for more structured code, in some cases this does not work – namely, when proce- dures overlap. This scenario is possible in assembler, an example will be supplied shortly (see ¶).

7.3.2 Printing a number.

We will provide more interesting examples of procedures now. In high level languages, printing a number is usually done by a call to

supplied library. In assembler this task, like many others, is left to the programmer (who can, of course, include it in a library he uses). The procedures shown below rely on a procedure printchar which can be im- plemented easily in different ways, depending on what kind of printing we want! The choices include

• print to the video memory, via segment 0B800h at a specified screen location

• print at cursor position via int 10h

• print to console using int 21h

• print to a file

• print to a string

Out procedure for printing a number can do all of these, depending on the choice of the printchar.

For the option of writing to the console, printchar can be coded as

printchar: ; argument AL=char

PUSH DX

MOV AH,6

MOV DL,AL

INT 21h

174

POP DX

RET

For the option of writing to the video memory, printchar can be coded as

printchar: ; argument AL=char, DI=current screen position

; ES assumed to be 0B800h

MOV ES:[DI],AL

ADD DI,2

RET

A number of other alternatives can be written, including “write-to-string” and “write-to-file”.

Our printint procedure would use argument in AX, treated as an un- signed short integer.

We will split the number into digits and print them one by one. To split the number we will use repeated division by 10; on each step the remainder would be a digit to be printed, and the quotent would be split further. We will stop when the quotent becomes zero. Notice that we do not know how many digits the number contains, and also that we cannot simply print the digits as we obtain them: this would lead to the number written backward. Therefore, we will store the obtained digits and once we have them all, we will print them in the reversed order. In a high level language store would be most likely store in an array; in assembler we will use more convenient stack.

printnum: ; argument: AX

SUB CX,CX ; counter of digits

MOV BX,10 ; print in base 10

l1: SUB DX,DX ; extend # to DX:AX

IDIV BX ; quotent=AX, remainder=DX

PUSH DX ; save on stack

INC CX

OR AX,AX ; are we done?

JNZ l1 ; if remained!=0 continue

l2: POP AX ; restore a digit

175

ADD AX,’0’; convert to ascii char

CALL printchar

LOOP l2

RET

This is the entire procedure. Make sure you understand this code fully before continuing!

If we want to print a signed number, we will have to slightly modify the above code. The idea is to check if the number is negative; if it is, we will print a minus and negate the number. We then proceed with printing the number as unsigned!

printsignedint: ;argument: AX

OR AX,AX

JNS l3

NEG AX

XCHG AX,BX

MOV AL,’-’

CALL printchar

XCHG AX,BX

l3: CALL printnum

RET

Another possible improvement would be give printnum an extra param- eter to specify how to print the number : decimal, binary, hex, octal... in fact, what about simply making BX (see the code of printnum) a parameter indicating the base? Then BX==10 will print decimal, BX==2 will print bi- nary, BX==3 will print in base 3 (not that it is easy to find a reason why you would want this!), etc.

Exercise: which bases are supported?

Exercise: what needs to be fixed for hex to work?

176

7.3.3 Printing a number in hex

We will provide a different program for printing a hexidecimal number. The printint routine in the previous section always prints a number using as many digits as needed, so 000Ah will be printed as one digit, whereas 1234h will be printed as four. For hex dumps and other application we would prefer to print a byte value using two digits (padding with zeroes if needed), word value using four digits (padding with zeroes if needed), and a doubleword value using eight digits (padding with zeroes if needed) – this way the output would align nicer.

While it is possible (how?) to add this option to printnum, we will show a totally different code to demonstrate a couple of assembler tricks.

We will use an array of hex digits

hexd DB ’0123456789ABCDEF’

and first implement a procedure for printing a single hex digit

print_nibble: ; input AL -- must be 0..15

MOV BL,AL

SUB BH,BH

MOV AL,hexd[BX]

CALL printchar

RET

(This code can be written better using the XLAT instruction – you may want to take a look).

Out next task is to print a byte, to do this we will split it into two digits.

print_byte: ; input AL

PUSH AX

SHR AL,4 ; AL is the low nibble now

CALL print_nibble

POP AX

AND AL,15 ; AL is the high nibble now

CALL print_nibble

RET

We can optimize the code by overlapping (!) both procedures as follows:

177

print_byte: ; input AL

PUSH AX

SHR AL,4 ; AL is the low nibble now

CALL print_nibble

POP AX

AND AL,15 ; AL is the high nibble now

print_nibble: ; input AL -- must be 0..15

MOV BL,AL

SUB BH,BH

MOV AL,hexd[BX]

CALL printchar

RET

Notice that this is a rare case when we can walk into a procedure, with- out a call!–and this is of course faster. Further, this is one case where using proc and endp directives would not be a good idea.

The same trick will be used now to upgrade the code to printing a word:

print_word: ; input AX

PUSH AX

MOV AL,AH

CALL print_byte

POP AX

print_byte: ; input AL

PUSH AX

SHR AL,4 ; AL is the low nibble now

CALL print_nibble

POP AX

AND AL,15 ; AL is the high nibble now

print_nibble: ; input AL -- must be 0..15

MOV BL,AL

SUB BH,BH

MOV AL,hexd[BX]

CALL printchar

RET

We now have tree procedures overlapping each other, each having its own entry point.

178

Exercise: add printdword to the code

NOTE: in high level languages it is easy to have a procedure with mul- tiple exit points: for example, in C/C++ we can have several return state- ments in one procedure. But is it possible to have multiple entry points like we see in the above code? For most languages the answer is no, but there are exceptions that provide syntax for this (PL/I).

179

Chapter 8

Interrupts

Assembler instructions do not define a complete language suitable for pro- gramming: some tasks require interraction with the operating system. These include console I/O (with the exception of the output via screen memory), work with files, memory allocation, process management. Surely, one can implement these tasks in assembler, but then one would have to write huge amounts of code which will use data structures internal to the OS... effec- tively reinventing the wheel. To make it worse, any change to the hardware or the OS will require reworking this code, and under protected mode oper- ating systems user application will not have sufficient privelege to execute this code.

The solution is to call routines provided by the OS... but now there is a question to resolve: where are they located? How would we find them? Surely they are not going to be in fixed addresses, changes to the OS or the system configuration would shift their position in memory.

The Intel solution to this is to have the OS store the addresses of system routines in a fixed place: the interrupt table. This allows an application to call provided routines by number, the address is computed by a simple lookup into the interrupt table.

Specifically, the interrupt table is an array of 256 addresses (each con- taining the offset and the segment parts of the address of the system rou- tine, usually called interrupt handler); it is located at the very begin- ning of the system memory (0000:0000) and occupies exactly one kilobyte (256 ∗ 4 = 1024).

180

INT n (≈ CALL inttable[n])

where n is an unsigned byte. INT is similar to CALL but pushes on the stack three values: offset (IP), segment (CS) and flags. (pushing segment is needed since the system routine is likely to be in a different segment from the caller, flags are saved to prevent interrupts from irreversibly modifying them, see below). Because of additional values pushed on the stack, returning from an inter- rupt can not be done with RET; instead one uses the

IRET (Interrupt RETurn)

instruction to come back, it pops all three values off the stack. Some of the important interrupt numbers are INT 20h (terminate), INT 10h (video), INT 16h (keyboard), and INT 21h. In this chapter we will go sys- tematically through many interrupts numbers, but before doing this, let us systematize the interrupts.

*More detailed descriptions of the interrupts mentioned in the next sec- tion will be given later).

8.1 Interrupt types

Far the most common usage of interrupts is via the INT instruction: this is indeed a call to the operating system to perform a particular service. We specify the interrupt number and the required parameters, given in registers – each interrupt has its own syntax. In the mainframe interrupt classification (there is no PC classification!) such an interrupt would be called an SVC, for SuperVisor Call.

Other types of interrupts occur without an explicit INT instruction. The hardware interrupt (perhaps better called “hardware fault”) oc-

curs if the hardware detects a malfunction – an example of this would be INT 2. No instruction explicit instruction INT 2 is needed in the code, the

181

routine corresponding to this interrupt would be called regardless of what is being executed, the moment a problem is detected.

The software interrupt (perhaps better called “software fault”) occurs if the program is trying to execute an instruction that cannot be executed. Examples include trying to execute an opcode that does not correspond to a valid instruction, or division by zero/overflow on division (INT 0).

The external interrupt is triggered by an event that is external to the processor that must be processed immediately. One example of such an event is pressing a key on the keyboard. Unless the key is processed or saved right away, it may be forgotten! Therefore an INT 9 is issued, and the called routine saves the key, together with the modifiers (status of the Shift, Control and Alt keys). One other example is the timer interrupt (INT 8).

The debug interrupt is intended for use in system software like debug- gers and profilers. Debug interrupts may occur because of explicit instruc- tions (INT 1, INT 3) or without them.

The mainframe classification also includes CPU-to-CPU interrurps: sig- nals from one processor to another. This is very similar to external inter- rupts.

Here is an example how an external interrupt may occur.

CMP AX,BX

; <<<=====

JL LAB

....

LAB:

Assume that a key on the keyboard is pressed when the CMP instruction was executed. Then the control will be switched to the INT 9 routine, we will execute JL only upon return. Naturally if the INT 9 routine alters flags our code will not work right (and this is why flags are saved automatically).

The same type of problem may occur with other non-SVC interrupts. With SVC, on the other hand, this is never an issue – and in fact some

of the SVC interrupts use the carry flag to provide the return.

182

Different interrupts were added by different vendors at different times. We distinguish the following ranges:

• 0..7: these interrupt numbers were assigned by Intel; these interrupts interplay with other instructions and hardware.

• 5..1Fh,40h: these interrupt numbers were assigned by IBM at the time original IBM PC was designed1.

• 20h..33h: these interrupt numbers were assigned by Microsoft, the routines point to the OS (DOS and Windows).

Other interrupt numbers may be uses by software apps and viruses, such interrupts may work only on some systems and not others.

Notice that this classification is mostly “chronological”: first Intel de- signed the CPU (and reserved the first few numbers), then IBM built the computer (and reserved some more numbers), and finally Microsoft added the OS (and again reserved some more numbers). Numbers beyond 40h were never reserved.

The special case of 40h is because this interrupt number corresponds to low-level workings of the to hard disk controller, and there were no hard disks in the very first PC. Similarly 33h corresponds to the mouse – again a late arrival.

We now look at the interrupts in a systematic way.

8.2 INT 0

INT 0 is a software interrupt, triggered by a division overflow. What hap- pens next depends on the interrupt handler installed, either in the OS or in the program library:

• Under original DOS, any program that does not install its own han- dler, the control is given to the BIOS handler. It switches the monitor to 40 column mode, displays DIVIDE BY ZERO and hangs the com- puter. High level language programs typically would install a new handler automatically, assembler programs requite the user to supply the handler.

1notice that Int 5 occurs both as an Intel and as IBM interrupt – this will be explained below

183

• Under Windows, any program that does not install its own handler, Windows display Blue Screen of Death, usually allowing the applica- tion to be terminated, sometimes resulting in a system crash.

• Typical runtime library handler in a high level language would display a message like Runtime error ... and terminate.

While INT 0 is not particularly important (do not divide by zero and you don’t need to know it), we use it to illustrate some general points.

Here is how an assembler program can install a handler:

Int0Handler:

; say something

MOV AX,4CFFh

INT 21h ; (terminate with code 255)

IRET ; unneeded in this particular handler

....

SUB AX,AX

MOV ES,AX

MOV word ptr ES:[0*4+0],offset Int0Handler

MOV word ptr ES:[0*4+2],CS

.....

The code above is simplified, in most cases the following corrections are needed:

1. It is best to save the old interrupt handler pointer and restore it upon program exit.

2. Fully replacing interrupt handler is possible only if one does not need functionality offered by the original handler. With INT 0 it is indeed fine, but with INT 21h (DOS) it is not: the moment the vector is replaced the entire file system and much else becomes unavailable to the application. Same applies to most other interrupts that do something generally important. In cases like this one should ensure that the new interrupt handler calls the original one, you can only add code before or after the call to provide more functionality (See example in the section on the antivirus software – TBW)

184

3. Finally, under protected mode operating systems it is not possible to change the interrupt table by simply storing data into it (as we do above). Instead one uses functions 25h and 35h of INT 21h to accomplist this change in a controlled fashion. (going through these methods is safer since then the OS knows about alterations to the interrupt table and can undo them when the program is exited or task switches to another program.)

INT 0 would not normally appear in a program source – we expect this interrupt to be triggered by IDIV or DIV – what do you think would happen if one writes

INT 0

? – well, the processor would think you actually divided by 0! – and you would get exactly the same response as if you had.

8.3 INT 1

INT 1 is a debug interrupt that implements singlestep: ability to execute exactly one machine instruction. (We will not describe it in this book)

8.4 INT 2

INT 2 is a hardware interrupt that occurs if the processor detects and in- valid change of memory – indicating a memory failure. (We will not de- scribe it in this book)

8.5 INT 3

INT 3 is a debug interrupt that implements breakpoint: ability to stop ex- ecution when a specific condition (for example IP reaching the address of a set breakpoint, or value of a variable changing) occurs. (We will not describe it in this book)

185

8.6 INT 4

INT 4 is a software interrupt used for detection of overflows (alternative name : INTO.

8.7 INT 5

INT 5 serves two entirely different purposes; Intel implemented it as a soft- ware interrupt used for detection of out-of-range conditions, whereas IBM used the same number to implement an external interrupt.

(This will be expanded)

8.8 INT 8

8.9 INT 9

8.10 INT 10h

8.11 INT 13h

8.12 INT 16h

INT 16h provides a series of functions for reading the keybord. The most useful is function 0:

• AH= 00h ; wait for a key

• =>AL ASCII value of the key pressed

• =>AH AH scan code of the key pressed

Some of the keyboard keys do not have a corresponding ASCII character value, for them the returned ASCII value is 0. Such keys include cursor control keys and Function keys, as well as “normal” keys in combination with Control or Alt keys. To distingush among them one looks at the scan codes (low bytes in the table below):

186

Key Normal Shifted w/Ctrl w/Alt

A 1E61 1E41 1E01 1E00

B 3062 3042 3002 3000

C 2E63 2E42 2E03 2E00

D 2064 2044 2004 2000

E 1265 1245 1205 1200

F 2166 2146 2106 2100

G 2267 2247 2207 2200

H 2368 2348 2308 2300

I 1769 1749 1709 1700

J 246A 244A 240A 2400

K 256B 254B 250B 2500

L 266C 264C 260C 2600

M 326D 324D 320D 3200

N 316E 314E 310E 3100

O 186F 184F 180F 1800

P 1970 1950 1910 1900

Q 1071 1051 1011 1000

R 1372 1352 1312 1300

S 1F73 1F53 1F13 1F00

T 1474 1454 1414 1400

U 1675 1655 1615 1600

V 2F76 2F56 2F16 2F00

W 1177 1157 1117 1100

X 2D78 2D58 2D18 2D00

Y 1579 1559 1519 1500

Z 2C7A 2C5A 2C1A 2C00

Key Normal Shifted w/Ctrl w/Alt

1 0231 0221 7800

2 0332 0340 0300 7900

3 0433 0423 7A00

4 0534 0524 7B00

5 0635 0625 7C00

6 0736 075E 071E 7D00

7 0837 0826 7E00

187

8 0938 092A 7F00

9 0A39 0A28 8000

0 0B30 0B29 8100

Key Normal Shifted w/Ctrl w/Alt

- 0C2D 0C5F 0C1F 8200

= 0D3D 0D2B 8300

[ 1A5B 1A7B 1A1B 1A00

] 1B5D 1B7D 1B1D 1B00

; 273B 273A 2700

’ 2827 2822

‘ 2960 297E

\ 2B5C 2B7C 2B1C 2600 (== Alt-L)

, 332C 333C

. 342E 343E

/ 352F 353F

Key Normal Shifted w/Ctrl w/Alt

F1 3B00 5400 5E00 6800

F2 3C00 5500 5F00 6900

F3 3D00 5600 6000 6A00

F4 3E00 5700 6100 6B00

F5 3F00 5800 6200 6C00

F6 4000 5900 6300 6D00

F7 4100 5A00 6400 6E00

F8 4200 5B00 6500 6F00

F9 4300 5C00 6600 7000

F10 4400 5D00 6700 7100

F11 8500 8700 8900 8B00

F12 8600 8800 8A00 8C00

Key Normal Shifted w/Ctrl w/Alt

BackSpace 0E08 0E08 0E7F 0E00

Del 5300 532E 9300 A300

Down Arrow 5000 5032 9100 A000

188

End 4F00 4F31 7500 9F00

Enter 1C0D 1C0D 1C0A A600

Esc 011B 011B 011B 0100

Home 4700 4737 7700 9700

Ins 5200 5230 9200 A200

Keypad 5 4C35 8F00

Keypad * 372A 9600 3700

Keypad - 4A2D 4A2D 8E00 4A00

Keypad + 4E2B 4E2B 4E00

Keypad / 352F 352F 9500 A400

Left Arrow 4B00 4B34 7300 9B00

PgDn 5100 5133 7600 A100

PgUp 4900 4939 8400 9900

PrtSc.... 7200

Right Arrow 4D00 4D36 7400 9D00

SpaceBar 3920 3920 3920 3920

Tab 0F09 0F00 9400 A500

Up Arrow 4800 4838 8D00 9800

8.13 INT 20h

Terminates the program. There are no parameters for this function call. Alternative ways of terminating the program are via INT 21h, function

calls 0 (no other parameters) and 4Ch (AL is the return code).

8.14 INT 21h

In terms of the number and variety of functions, INT 21h is the definite leader. Nearly the entire OS support is done via INT 21h. Rather than listing the entire set, we will concentrate only on one group of functions, especially important for applications: File operations.

Before getting into the specifics, let us outline the general idea of how file systems are implemented in programming languages.

There are two different entities called files:

• Physical file stored usually on a secondary devide (drive)

189

• Logical file variable defined in the programming language.

Logical file variable is defined in different data types, depending on the language and/or system:

• Integer number : FORTRAN, BASIC, Unix IO in C, DOS

• Structure : CP/M

• Pointer to a structure: ANSI C

• Stream: C++

Since programming code rarely examines file variables, only passes them between function calls, the actual format is not overly important.

We note that Microsoft uses the term handle to refer to file variables. Handle is actually just an integer, a different word is intended to underscore that one should not look at the value too closely – the number may mean nothing to the programmer, only the OS can make sense of it. (In reality, in many cases the number is informative.)

Work with file begins by establishing a correspondence between these two entities. Physical file is identified by name (string), and tied to a vari- able. We typically call this Opening a file. This makes the subsequent calls to file operations more efficient: whereas opening a file requires directory (disk) search and is therefore costly, subsequent calls use already available information about the physical file. Furthermore, more than a single file variable can correspond to one physical file, allowing two different types of work with a file to be done simulteniously.

Generally this is the only time when the file name appears in the syntax, all the subsequent calls refer to the file variable.

Some systems separate opening a file (which requires to the file to exist and does not create a new physical file) from Creating a file (which always creates new physical file) leading to having two functions, some lump both operations together.

During the operations with the file (read, write, seek) position file ma- chinery maintains the current offset in the file (read/write pointer). Usually it is initialized to the beginning of the file and is automatically incremented by the number of files read or written. When the application reads or writes

190

entire file from the beginning to the end (sequential access), automatic up- dating of the read/write pointer makes it unnecessary to track it. For appli- cations that want to read or write particular parts of the file (direct access) it becomes necessary to be able able to reposition the read/write pointer (direct access files). This is also a needed feature to determine the size of the file before reading it – even if reading itself is to be done sequentially.

A minimal set of functions to support sequential working with files would include opening (and/or creating) a file, reading, writing and clos- ing. To support direct access files one would also need a seek function.

(This is the minimal list; typical file systems, including the ones in INT 21h, have many other functions, some of them will be mentioned after we master the important ones)

In INT 21h there are two different sets of functions supporting these features: the original set that uses file variables in the form of structures was inhereted from CP/M; the newer set, with file variables being integer number, was modeled after UNIX.

We will describe the second set: the usage of the original CP/M function calls was shortlived and UNIX-style calls are much easier to use.

This set consists of the following functions, listed together with their ANSI C (header stdio.h) and UNIX C (header unix.h) analogs:

3Ch Create file (ansi: fopen, unix: creat).

3Dh Open file (ansi: fopen, unix: open).

3Eh Close file (ansi: fclose, unix: close).

3Fh Read file (ansi: fread, unix: read).

40h Write file (ansi: fwrite, unix: write).

42h Seek file (ansi: fseek and ftell, unix: lseek and ltell).

(Notice that fopen does both opening and creating, while open and creat are different functions; INT 21h follows the UNIX way.)

While every function of INT 21h has its own syntax, let us begin by outlining the general scheme of register usage. For input parameters:

AH : function number.

191

BX : file handle (if applicable).

CX : count (if applicable).

DX : pointer to data (if applicable).

the output is given by the carry flag and the AX register. If carry is set, the call failed, and the AX contains the error code; if no carry, then AX contains the result. Some of the common error codes are

• 0 invalid function number (ah value)

• 1 invalid handle (bx value)

• 2 file not found (on open, for example)

• 3 path not found (on open or create, for example)

• 5 access denied (attempt to open a read only file for writing, for ex- ample)

• 8 out of memory (rarely with file operations)

8.14.1 Copying a file

First five of the file functions will be demonstrated in a single example, a complete program that copies a file. To simplify the problem we will provide the filenames hardwired in the program, in a real application they should be obtained from the command line – this will be demonstrated later.

Here are the variables used by our program:

inname db "input.dat",0 ;name of the input file

outname db "output.dat",0 ;name of the output file

inh dw ? ;input file handle

outh dw ? ;output file handle

b dw N dup (?) ;data buffer

We begin by opening the input file:

192

MOV AX,3D00h ;open for read

MOV DX,offset inname

INT 21h

JC error

MOV inh,AX ;save the handle

The error handler should print an error message and terminate the pro- gram – there is no point in continuing if the input file cannot be opened. A more sophisticated program would examine the error code (in AX) and issue correct diagnostics, in our application this is not necessary and we will use the same error handler for all possible errors; the handler can be simply this:

errmsg DB "ERROR!",13,10,"$"

......

error:

MOV AH,09

MOV DX,offset errmsg

INT 21h

MOV AX,4CFFh

INT 21h ;terminate

We next attempt to create the output file:

MOV AH,3Ch ;create file

SUB CX,CX ;normal file

MOV DX,offset outname

INT 21h

JC error

MOV outh,AX ;save the handle

Before getting to the actual work of reading and writing the data, let us show the end of the program:

MOV AH,3Eh ;close input

MOV BX,inh

INT 21h

JC error

MOV AH,3Eh ;close output

193

MOV BX,outh

INT 21h

JC error

Is this part really necessary? For the input file, only to return the re- source to the OS – this is more about good style of programming. For the output file : absolutely. If an output file is not closed, part of the data we wrote to it may not be actually saved – write’s are delayed by the OS to optimize the performance and the final syncronization only occurs during closing.

The errors in these calls should not be happening – the handles we supply came from the OS and should be valid. But in a larger program variables inh or outh may be accidentally overwritten, we may as well have diagnostics for this kind of problem.

If nothing else added to our program, no data would be copied... but running the program should still result in a file output.dat being created in the current directory, this file should have size zero.

We now proceed with with read and write operations. Firstly, a simpli- fied version:

MOV AH,3Fh ;read

MOV BX,inh

MOV CX,1

MOV DX,offset b

INT 21h

JC error

MOV AH,40h ;write

MOV BX,outh

MOV CX,1

MOV DX,offset b

INT 21h

JC error

Adding this code would copy exactly one byte from the input to output. Since our aim is to copy the entire file, we either need to know the number of bytes in the input or be able to detect the end of file.

To determine the size of the file, one can use the seek function (ah=42h). An example will follow, but in the copy program it is the other approach

194

that is easier2. Recall that the read function, when successful, would re- turn the number of bytes read. Since we only ask for 1 byte, it can possibly return only the values 1 (byte was read) or 0 (it could not read one byte == end-of-file condition).

This allows us to construct a loop:

again: MOV AH,3Fh ;read

MOV BX,inh

MOV CX,1

MOV DX,offset b

INT 21h

JC error

OR AX,AX

JZ done

MOV AH,40h ;write

MOV BX,outh

MOV CX,1

MOV DX,offset b

INT 21h

JC error

JMP again

done:

We now have a working copy program: it will correctly copy a file of any size. But running this program will show abysmal performance – it will be visibly much slower than the OS’ command copy.

The source of the problem is reading/writing one byte at a time: ac- cessing external device is slow and it is best to make fewer accessing to it, having each access do more work. In our case this simply means: copy more than one byte at a time. Here is an updated version:

again: MOV AH,3Fh ;read

MOV BX,inh

MOV CX,N

MOV DX,offset b

INT 21h

JC error

2and avoids a nasty situation when the file size cannot be expressed as a 32-bit integer!

195

OR AX,AX

JZ done

MOV CX,AX

MOV AH,40h ;write

MOV BX,outh

MOV DX,offset b

INT 21h

JC error

JMP again

done:

Here N signifies the number of bytes to read on each iteration (==the buffer size); notice that we always write the number of bytes we have read.

Assuming N is 100 and the input file had 323 bytes, the first time we will read 100 bytes and write 100 bytes, same will happen on the second and third iterations, but the fourth time we will read only 23 bytes and write them out. The fifth iteration will result in no bytes read and exit from the loop.

What kind of value should we choose for N for the optimal perfor- mance?

Two considerations here: firstly, powers of 2 are most efficient. sec- ondly, while generally larger values of N will be more efficient than smaller values, increasing N beyond about 4k tends not to improve the performance noticeably while consumes more memory for the buffer. The exact value of optimal N would depend on the hardware, but 4K (4096) is generally a good choice.

8.14.2 INT 21h,function 09h Write string to console

• AH= 09h

• DS:DX= string to write, terminated with the $ character.

8.14.3 INT 21h,function 3Ch Create File

• AH= 3Ch

• CX= file attributes (0 for normal file)

196

• DS:DX= points to the file name (0-terminated string, it may include drive and path)

• => AX handle

Common error codes: 3,5

8.14.4 INT 21h,function 3Dh Open File

• AH= 3Dh

• AL= (0 for read, 1 for write, 2 for both)

• DS:DX= points to the file name (0-terminated string, it may include drive and path)

• => AX handle

This function sets read/write pointer to the beginning of the file. Common error codes: 2,3,5

8.14.5 INT 21h,function 3Eh Close File

• AH= 3Eh

• BX= handle

This function sets read/write pointer to the beginning of the file being created.

Common error codes: 1

8.14.6 INT 21h,function 3Fh Read File

• AH= 3Fh

• BX= handle

• CX= number of bytes to read

• DS:DX= offset of buffer to receive file data

197

• =>AX number of bytes actually read

Read operation advances read/write pointer by the number of bytes read.

Common error codes: 1

8.14.7 INT 21h,function 40h Write File

• AH= 40h

• BX= handle

• CX= number of bytes to write

• DS:DX= offset of buffer to write

• =>AX number of bytes actually written

Write operation advances read/write pointer by the number of bytes written.

Common error codes: 1

8.14.8 INT 21h,function 42h Seek file

• AH= 42h

• AL= Seek method: 0==from beginning, 1==from current, 2==from end

• BX= handle

• CX:DX= offset to move to (relative to the method!)

• =>DX:AX new offset in the file (absolute, that is relative to the begin- ning of the file)

Common error codes: 1 Examples: Move the r/w pointer to location (n*0x10000+m)

198

MOV AX,4200h

MOV BX,handle

MOV CX,n

MOV DX,m

INT 21h

Rewind:

MOV AX,4200h

MOV BX,handle

MOV CX,0

MOV DX,0

INT 21h

Skip m bytes

MOV AX,4201h

MOV BX,handle

MOV CX,0

MOV DX,m

INT 21h

“Unget” one byte

MOV AX,4201h

MOV BX,handle

MOV CX,-1

MOV DX,-1

INT 21h

Get current position (in DX:AX).

MOV AX,4201h

MOV BX,handle

MOV CX,0

MOV DX,0

INT 21h

Move to the end of the file–this can be used to append or find out the file size.

199

MOV AX,4202h

MOV BX,handle

MOV CX,0

MOV DX,0

INT 21h

8.15 INT 33h

200

Chapter 9

Project #2

Due Date: 05/17/2021, 8pm. Goal: implement Text-to-binary and Binary-to-text conversions for the

purpose of sending email attachments, independent from the MIME stan- dard. (Test that this actually works!)

Basic idea: the protocol used by email is suitable only for text, transmission of bi-

nary files via email is only possible by converting binary files to text and embedding them within email body.

Different conversion algorithms are possible. Implement Base64-type algorithm (for a “B” grade), or that and Ascii-85 type algorithm (for an “A” grade).

Example. The smallest program that can be made in assembler is built out of a

single instruction:

int 20h

Compiled into a .com file (say, try.com) this will be a two-byte binary file, consisting of bytes 0CDh and 020h.

We will convert it to text. Notice: in all cases the text file is going to be larger than the original,

the overhead is smaller if the used alphabet is larger.

Decimal 205,32. Alphabet size=11. File size is about 3.5x the size of the original.

201

Hex CD20. Alphabet size=16. File size is about 2x the size of the original.

Base64 [:((. Alphabet size=64 File size is about 1.33x the size of the origi- nal. See the algorithm below

Ascii85 ???. Alphabet size=85. File size is about 1.25x the size of the origi- nal.

The idea of the Base64-type algorithm: Every 3 bytes of the input (any bytes) are converted into 4 bytes of the output (text chars only). If we do not have 3 bytes, we pad. In our example:

1 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 x x x x x x x x

We rearrange these 3 8-bit bytes into 4 6-bit “bytes”

1 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 x x x x x x x x

Since bytes are really 8 bit, we add padding:

- - 1 1 0 0 1 1 - - 0 1 0 0 1 0 - - 0 0 0 0 x x - - x x x x x x

The new bytes contain values in the 0..63 range, adding 40 to them moves them to the ASCII range ( This is not the only formula possible, of course; f.e. one may prefer converting character set that includes all letters and digits.) In our example the four obtained values are 51, 18, 0, 0; after shifting by 40 they become 91,58,40,40 or [:((.

How does the email may look like? We need to separate the ascii (text) and converted-to ascii parts. We

need header and footer for the data. For example:

Here is some normal message text. Trying encodings:

$!ATTB file="try.com" size="2" encoding=10

202

205,32.

$!ATTE

$!ATTB file="try.com" size="2" encoding=16

CD20

$!ATTE

$!ATTB file="try.com" size="2" encoding=64

[:((

$!ATTE

(The author is not suggesting any particular syntax).

Refs: Base64. Base85.

Technical note: email body is subject to line length limits; newlines should be inserted during conversion to text to break the lines, during con- version back to binary they should be ignored.

The projects must be individual and not downloads of copy one finds on web.

203

https://en.wikipedia.org/wiki/Ascii85

https://en.wikipedia.org/wiki/Base64

Chapter 10

Sample code

10.1 Reading in an entire file

In this example we assume that the we want to read an entire file (its name in the inname variable) into buffer buff). The example is simplified: it cannot read more than the buffer size (10k). More general code would dynamically allocate buff using for example INT 21h function 48h.

inname DB "C:\work\test.dat",0

buff DB 10000 dup (?)

handle DW ?

; open the file

MOV DX,offset inname

MOV AX,3D00h ;open for read

INT 21h

JC error ; file not found

MOV handle,AX

SUB CX,CX

SUB DX,DX

MOV BX,handle

MOV AX,4202h

INT 21h ; position to the end

JC error ; this should not happen

204

OR DX,DX

JNZ error ; the file is too large

CMP AX,10000

JA error ; the file is too large

PUSH AX ; file size

SUB CX,CX

SUB DX,DX

MOV BX,handle

MOV AX,4200h

INT 21h ;rewind file

MOV AX,3Fh

MOV BX,handle

MOV DX,offset buff

POP CX

INT 21h ; read the entire file

MOV AX,3Eh

MOV BX,handle

INT 21h ; close file

10.2 Reading command line

Let us begin with an example: imagine an application was invoked with the command

C> mycopy a.dat b.dat

The text typed on the command line has two parts :

• mycopy is used by the operating system to locate an application (mycopy.exe, mycopy.com, . . . ) and invoke it1.

1Generally the application does not need to know how it was invoked, but this infor- mation can be obtained elsewhere.

205

• a.dat b.dat is used by the application, the operating system cannot know the format of the arguments and passes them to the application. We will refer to this part as the “command line” below. Notice that it begins right after the application name, with a space.

(The argument part) of the command line is passed to the application as one string in a special segment called PSP (Program Segment Prefix). The length of the command line is passed in the byte at the offset 80h of the PSP, the command line itself follows from the next byte (offset 81h) and is terminated with the carriage return character (0Dh).

In a .com application there is only one segment, therefore PSP is the same segment as the code, data, or stack. In an .exe PSP is a separate segment; initially DS points to it.

Since the command line arguments are given as a single string we need to parse the string. The following code illustrates how this can be done for the mycopy example; our goal is to extract the two filenames from the command line string. We assume that our application is a .com file, thus we would not have to worry about the segments.

inname db 128 dup (0)

outname db 128 dup (0)

......

MOV SI,81h

L1: MOV AL,[SI] ; loop to skip spaces

INC SI

CMP AL,0Dh

JZ error ; no arguments are given!

CMP AL,’ ’

JBE L1

MOV DI,offset inname

L2: MOV [DI],AL ; loop to extract inname

INC DI

MOV AL,[SI]

INC SI

CMP AL,’ ’

JAE L2

206

MOV byte ptr [DI],0

INC DI

The code above should be followed by two more loops : to skip spaces between the filenames and then to extract outname. These loops are nearly identical to the loops shown above and is left to the reader

Notice that the code above can be shortened with the string instruc- tions.

207

Chapter 11

Branch prediction

One the most important issues in assembler program performance is that of Branch prediction: the code that does not contain conditional branches executes considerably faster than the one that has. The earliest models of the x86 architecture did not employ branch prediction and this was not an issue, as the branch prediction got introduced and became more sophisti- cated, the importance of the issue grew.

The goal of cutting down on conditional jumps was addressed by both

• the programmers (by finding new ways to code – see the abs() func- tion example

• the manufacturer – Intel introduced new instructions that make cod- ing without conditional jumps possible in some selected special cases, SET<cc> and CMOV<cc> are notable cases.

11.1 The abs() function

Of all functions one can think of, abs() is about the simplest, it is therefore odd that one can find something interesting in its implementation

The standard code for the abs() function is shown here:

abs: ; assume EAX is the argument

OR EAX,EAX

JNS l

NEG EAX

l: RET

208

https://en.wikipedia.org/wiki/Branch_predictor

There are two inefficiencies here (and a bug – do you see what the bug is?). One inefficiency comes from the code being a function – an inline implementation would be better. The other – from the conditional jump (JNS). Is it possible to implement abs() without it?

Surprisingly, yes:

abs: ; assume EAX is the argument

CDQ

XOR EAX,EDX

SUB EAX,EDX

RET

Naturally, the code would be much faster, especially on modern proces- sors.1.

It is common to want to compute not abs(x) but rather abs(x-y). Can we use the same trick? For example:

MOV EAX,x

SUB EAX,y

CDQ

XOR EAX,EDX

SUB EAX,EDX

RET

The answer – oddly! – is a NO. We will leave it to the reader to realize the problem and find a counter-example.

1This trick was discovered in very early days and even used by some compilers – for example good ol’ Turbo/Borland pascal implemented abs() as an inline, via the 16-bit version of exactly this code

209

Chapter 12

String Instructions

The instuctions explained in this chapter are non-essential in terms of abil- ity to program: everything they do can be accomplished by other means. However, for they offer superior performance and programming conve- nience for some of the more common and important tasks.

The word string in the name of this group should not be taken too se- riously : yes, we will see some examples related to strings, but in general these are instructions that work with any blocks of data in memory, strings are just one special case.

12.1 MOVS: copying memory data

We begin by considering a common problem: copy a chunk of memory from one place to another. In C/C++ the problem would look like this:

char a[N],b[N]; //N is a constant

// copy b to a

for (int i=0; i<N; i++)

a[i]=b[i];

this is actually a bad way to code, we will come back to the C example later, but for now the task is to write an assembler equivalent.

The problem resstated in assembler is

210

a DB N dup(?)

b DB N dup(?)

; copy b to a

and we can solve it by the following loop:

P1:

MOV CX,N

SUB BX,BX

L1: MOV AL,b[BX]

MOV a[BX],AL

INC BX

LOOP L1

or, alternatively:

P2:

MOV CX,N

MOV SI,offset b

MOV DI,offset a

L1: MOV AL,[SI]

MOV [DI],AL

INC SI

INC DI

LOOP L1

Either code works, but there is a much faster program that will use new instruction

MOVSB (MOV String Byte)

While MOVSB does not have any arguments, it uses several registers im- plicitly. The logic of the instruction is

byte ES:[DI] = byte DS:[SI]

SI = SI +- 1

DI = DI +- 1

211

Whether SI and DI are incremented or decremented depends on the Direction Flag; increment is done if the flag is clear (DF==0), decrement if it is set (DF==1). We will assume for now that both index registers are incremented – this is the most common case.

NOTE: MOVSB is an exception to the general no-memory-to-memory rule (as are some other string instructions).

We can use this instruction to copy a single byte of memory (case N==1), instead of the usual

MOV AL,b

MOV a,AL

we can write

MOV SI,offset b

MOV DI,offset a

MOVSB

This is clearly not an improvement: we replaced a 2-instruction program with a 3-instruction one – and possibly even longer, if the DS or ES segments or the direction flag also need to be set. But for copying more than one byte the balance is different:

P3: MOV CX,N

MOV SI,offset b

MOV DI,offset a

L: MOVSB

LOOP L

In P3 the body of the loop is only one line, comparing to two in P1 or three in P3! this code therefore will run faster, about twice. The savings are coming from the index registers being advanced automatically – we do not need extra instructions to update them. While P3 is a faster program indeed, we will be able to speed it up – about twice! – two more times.

For the next version we will need another new instruction

212

REP (REPeat next instruction)

REP is a special form of loop: the body is exactly one instruction, the next one, to be executed CX times. Since out loop’s body is exactly one instruction long, we can use REP instead of LOOP:

P4: MOV CX,N

MOV SI,offset b

MOV DI,offset a

REP

MOVSB

This would run about twice faster (specifics depend on the processor), but why? LOOP is a conditional jump, and all conditional jumps break in- struction prefetching or parallel execution: the processor cannot predict whether a conditional jump going to be taken or not, thus it cannot opti- mize execution beyond the jump. But REP does not suffer from this prob- lem: the processor knows in advance to execute the next instruction a given number of times.

NOTE: Assembler syntax allows to write REP on the same line as the body of the loop instruction1. This, we can change the above example to its shorter equivalent:

P5: MOV CX,N

MOV SI,offset b

MOV DI,offset a

REP MOVSB

NOTE: While REP can be used with any assembler instruction (for ex- ample REP NOP or REP CALL ... is fine), it is only with string instructions that REP leads to useful code.

NOTE: The string instructions are mostly used as part of REP or similar loops, but we will see exceptions.

The next version will be again much faster than the previous, and again, we will use yet another instruction:

1yes, two opcodes in one line!—only here

213

MOVSW (MOVe String Word)

word ES:[DI] = word DS:[SI]

SI = SI +- 2

DI = DI +- 2

MOVSW moves two bytes at a time (and updates the index registers by 2, to point to the next item, just as we need). This allows us to make the loop twice shorter, assuming that N is even:

P6: MOV CX,N/2

MOV SI,offset b

MOV DI,offset a

REP MOVSW

Since a single MOVSB takes as much time as MOVSW, this is exactly twice faster. And, if desired, we can go further:

MOVSD (MOVe String Dword)

dword ES:[DI] = dword DS:[SI]

SI = SI +- 4

DI = DI +- 4

MOVSD moves four bytes at a time, here we must assume that N is divisible by 4:

P7: MOV CX,N/4

MOV SI,offset b

MOV DI,offset a

REP MOVSD

Can we use word or doubleword MOVS if N is odd (or not divisible by 4)? Yes, but we will only work out the program for MOVSW. For odd N we can code:

214

P8: MOV CX,N/2

MOV SI,offset b

MOV DI,offset a

REP MOVSW

MOVSB

(We use a standalone MOVSB to copy one remaining byte). The more interesting situation is when we do not know if N is even or

odd at the compile time. One way to solve the problem would be to text N and use the code from the even and odd examples, something like this:

P9:

MOV SI,offset b

MOV DI,offset a

MOV CX,N

TEST CX,1

JNZ odd

even:SHR CX,1

REP MOVSW

JMP done

odd: SHR CX,1

REP MOVSW

MOVSB

done:

but shorter,prettier and faster code is possible :

P10:

MOV SI,offset b

MOV DI,offset a

MOV CX,N

SHR CX,1

REP MOVSW

RCL CX,1

REP MOVSB

The code is faster because of the absence of condtional jumps.

215

Exercise: Understand how this works!

One can optimize the code further using MOVSD, we will not do this but leave this as an exercise.

Let us now return to the original problem: how to copy an array in C? The correct answer is

char a[N],b[N]; //N is a constant

// copy b to a

memcpy(a,b,N);

where memcpy is a function that is typically written in assembler and is equivalent to either P10 or its MOVSD-based analog. memcpy (this stands for memory copy) is a part of standard C/C++ and is found in the string.h header; use of memcpy allows one to replace a slow C loop by a fast assem- bler one.

We will see other useful functions in string.h later.

12.1.1 Deleting from a character array

In this and the following section we will consider a typical problem that comes up when working with strings: deleting and inserting characters. A sample deletion problem we will look at is to remove one character (say, D) from the array buff.

buff DB "ABCDEFGHI...XYZ"

This is a perfect problem for string move:

MOV SI,offset buff+4 ;-->E

MOV DI,offset buff+3 ;-->D

MOV CX,len ;number of bytes after D

REP MOVSB

Notice that we do not try to optimize this with MOVSW etc; notice also that in any application we probably keep the length of the string stored somewhere and we would need to update (decrement it) as well.

Similar code can be used to delete more than one character.

216

12.1.2 Inserting into a character array

The reverse problem would be to insert a character or a string into another string, for example we may want to insert "789" after C. This task has to be broken into two: (1) we should make space by shifting DEF.... to the right by three positions; then (2) we will copy "789" into the provided space.

Consider the following code:

MOV SI,offset buff+3

MOV DI,offset buff+6

MOV CX,len ; number of bytes after C

REP MOVSB

While this would indeed create three spaces in the array, it would also destroy the data. While REP MOVSB is a super-optimized form of a loop, it is still a loop and copying the data to the right would override and destroy the contents of buff. Very first copy, for example, would copy D over G and G will be irreversibly gone.

217

Syllabus
Preliminary material

Introduction #1: looking at new hardware
Introduction #2: History
Introduction #3: Fundamentals
x86 CPU
8086 registers -- full list
General addressing scheme
Back to History: Original IBM PC (1981)
Back to History: 32 bit OS?)

Instructions

Overall structure of asm program
The NOP instruction
The MOV instruction

Examples
Byte order

The XCHG instruction

Binary encoding of XCHG

The ADD instruction

Overflow detection

The SUB instruction
The INC instruction
The DEC instruction
The NEG instruction
The CMP instruction
Logical or Bitwise

Bits and Masks
Checking for odd/even
Sets
Images

The AND instruction
The TEST instruction
The OR instruction
The XOR instruction
The NOT instruction
Shift and Rotate operations

Data conversion
Multiplication and Division
LEA : Load effective address
Long integers: ADC and SBB

24-bit case
other operations

Jumps

Unconditional Jumps
Conditional Jumps

Using Zero Flag
Using Sign Flag
A few examples
Using Carry Flag
Mixing signed and unsigned

Jumps encoding
LOOPS
Control statements templates

Variables

Declaring variables
Using arrays
Memory addressing syntax
Video memory in text mode.

Project #1

Program outline

Submission

How to compile and run a program.

Stack and procedures

Using stack

Encoding of PUSH/POP instructions

Stack implementation

Stack Balancing
Working with Stack Pointer

Procedures

Structured syntax
Printing a number.
Printing a number in hex

Interrupts

Interrupt types
INT 0
INT 1
INT 2
INT 3
INT 4
INT 5
INT 8
INT 9
INT 10h
INT 13h
INT 16h
INT 20h
INT 21h

Copying a file
INT 21h,function 09h Write string to console
INT 21h,function 3Ch Create File
INT 21h,function 3Dh Open File
INT 21h,function 3Eh Close File
INT 21h,function 3Fh Read File
INT 21h,function 40h Write File
INT 21h,function 42h Seek file

INT 33h

Project #2
Sample code

Reading in an entire file
Reading command line

Branch prediction

The abs() function

String Instructions

MOVS: copying memory data

Deleting from a character array
Inserting into a character array