assembly programming
Chapter 1
Syllabus
Catalog Description: Computer structure, machine representation of data, addressing and indexing, computation and control instructions, assembly language and assemblers; procedures (subroutines) and data segments, linkages and subroutine calling conventions, loaders; practical use of an assembly language for computer implementation of illustrative examples.
Course Goals
0 Knowledge of the basic structure of microcomputers - registers, mem- ory, addressing I/O devices, etc.
1 Knowledge of most non-privileged hardware instructions for the Ar- chitecture being studied.
2 Ability to write small programs in assembly language
3 Knowledge of computer representations of data, and how to do simple arithmetic in binary & hexadecimal, including conversions
4 Being able to implementing a moderately complicated algorithm in assembler, with emphasis on efficiency.
5 Knowledge of procedure calling conventions and interfacing with high- level languages.
Optional Text: Kip Irvine, Assembly Language for the IBM PC, Prentice Hall, 4th or 5th edition
1
Additional References: Intel and DOS API documentation as presented in Intel publications and online at www.x86.org; lecture notes (to be sup- plied as we go).
Prerequisites by Topic. Working knowledge of some programming lan- guage (102/103: C/C++); Minimal programming experience
Major Topics Covered in the Course:
1 Low-level and high-level languages; why learn assembler?
2 How does one study a new computer: the CPU, memory, addressing modes, operation modes.
3 History of the Intel family of microprocessors.
4-5 Registers; simple arithmetic instructions; byte order; Arithmetic and logical operations.
6 Implementing longer integer type support; carry and overflow.
7 Shifts, multiplication and division.
8 Memory layout.
9 Direct video memory access; discussion of the first project.
10 Assembler syntax; how to use the tools.
11-13 Conditional & unconditional jumps; loops; emulating high-level lan- guage constructions; Stack; call and return; procedures
14-15 String instructions: effcient memory-to-memory operations.
16 Interrupts overview: interrupt table; how do interrupts work; classif- cation.
17 Summary of the most important interrupts.
18-20 DOS interrupt; File I/O functions; file-copy program; discussion of the second project
21 Interrupt handlers; keyboard drivers; timer-driven processes; viruses and virus-protection software.
2
22 Debug interrupts; how do debuggers and profilers work.
23-24 (Optional).interfacing with high level languages; Protected mode fun- damentals
Grading The grading is based on two projects, midterm project is 49% and the final is 51%. Please note that the projects are individual, submitting projects that are similar to submissions of others and/or are essentially downloads from the Web would result in a fail.
Office Hours My hours this term for CSc 210 will be 3:45 ¶Ł 4:45 on Mondays.
Zoom links:
11am https://ccny.zoom.us/j/85378437821
2pm https://ccny.zoom.us/j/87625527827
3
Chapter 2
Preliminary material
4
: Why assembler? • Why take this class?
• Why program assembler?
• Why know assembler?
5
: NOTE: think Binary! Why binary? Binary numbers (WIKI) (brief answer: because this is easy to implement) Why hex? Hexadecimal numbers (WIKI) (brief answer: because it is much easier to work with shorter strings) What about DNA?
6
2.1 Introduction #1: looking at new hard- ware
• CPU, general purpose (arithmetic) registers
– How large?
– How many?
– Are they all the same?
– Modes?
• Memory Model
– Is all memory the same?
– Flat?
– Segmented?
– Paged?
• Other hardware (peripherals)
• OS
• Special features
7
2.2 Introduction #2: History Intel Processors Over the Years
The History Of Intel CPUs
-1971 before Intel
1971 4004
• Intention
• Name
• Usage
• What can you do with 4 bits?
1972 8008
- Doubling – what can you do with 8 bits
1974 8080
1975 8085
1975 Z80
1974 CP/M – Digital Research, Gary Kildall
1978
- 8086 – X86 architecture. 8 8bit registers, 8(+6) 16 bit registers. 1mb limit. 1mb mystery?
1979 8088 – cost cutting
1981 iAPX 432 – an attempted 32 bit processor
1982 80186 – minor improvements/corrections
1981 IBM PC
8
2.3 Introduction #3: Fundamentals Data types
1 bit
4 nibble
8 byte
16 word
32 dword, doubleword
64 qword, quadword
80 tenbyte
9
2.4 x86 CPU
10
Registers Overlap! Problem: Let AH = 2,AL = 3. What is AX? Solution:
00000010 00000011
AH=00000010b=02h
AL=00000011b=03h
AX=0000001000000011b = 0203h = 515d
Note: suffices b, h and d are part of the Assembly language syntax; d(ecimal) is the default. Assignment syntax, however, is different, it is only used for an illustration here.
Fast solution:
AX = 2*256+3
AX = (2<<8)+3
Problem: Let AX = 2020. What are AL and AH ?
Registers BX,CX, DX are divided similarly.
General purpose aka Arithmetic registers: Sequence A,B,C,D is an illusion, these letters stand for Accumulator,
Base, Count, Data. 8 8bit registers: AH,AL,BH,BL,CH,CL,DH,DL 8 16bit registers: AX,BX,CX,DX
and SI,DI,BP(?),SP(??) SP generally cannot be used for calculations, BP usually cannot be used
either. (32 bit to be described later)
11
IP – Instruction pointer points to the first byte of the current instruction. Code: B B B B B B B B B B B B B B B B B B
Code is essentially a one dimensional array of bytes (in C/C++ – un- signed char type).
IP initially is 0, after one instruction is executed it should be 2, then 5, then 6, ....
Simplified logic (one instruction)
byte code[MAXCODE];
byte opcode;
opcode=code[IP++];
switch (opcode) {
case 0x00: ...
case 0x01: ...
...
case 0xFF: ...
}
each subcase will read additional bytes if needed to complete reading of the instruction.
12
Simplified logic (full execution)
byte code[MAXCODE];
byte opcode;
while(true) {
opcode=code[IP++];
switch(opcode) {
case 0x00: ...
....
case 0xFF: ...
}
}
Why is this simplified? • CS is also used.
• how do we terminate?
• how do we change the executed sequence?
What do we do with this? Is switch efficient? Question: what would IP=k do (if such instruction exists).
13
FLAGS register Should be seen not as a single 16-bit register but as a collection of 16
1-bit registers. More important ones: ZF, SF, CF, DF Neither FLAGS nor the names above are keywords.
14
Segment registers : CS, DS, SS, ES – specify where segments (“parts”) of the program are located.
• CS Code Segment
• DS Data Segment
• SS Stack Segment
• ES Extra Segment
15
2.5 8086 registers – full list • AX Accumulator eXtended
• AL Accumulator Low
• AH Accumulator High
• BX Base eXtended
• BL Base Low
• BH Base High
• CX Count eXtended
• CL Count Low
• CH Count High
• DX Data eXtended
• DL Data Low
• DH Data High
• SI Source Index
• DI Destination Index
• BP Base Pointer
• SP Stack Pointer
• CS Code Segment
• DS Data Segment
• SS Stack Segment
• ES Extra Segment
• IP Instruction Pointer (not a keyword)
• Flags Flags (not a keyword)
16
2.6 General addressing scheme Three distinct ways to address memory:
• Absolute address : mem[offset] (flat model–generally cannot be done)
• Segmented address : mem[f(seg,offset)] (done by hardware). Usual notation: ssss:oooo (hex digits)
• Expressing segmented address in assembly syntax – to be covered later
The f(seg,offset) function is mode-dependent. In real mode, f(seg,offset)=seg*16+offset. This allows to build 20 bit numbers out of 16 bit quantities. Examples
0000:0000 =⇒ 00000
1234:5678 =⇒ 179B8
+ 12340
05678
--------
179B8
The mapping is not one-to-one! Different (seg,offset) pairs may point to the same address.
0000:0100 =⇒ 00100
0010:0000 =⇒ 00100 Puzzle
FFFF:FFFF =⇒ ????? (ref: A10 address line) ==
17
Code segment is effectively mem[f(CS,i)], Data segment is effectively mem[f(DS,i)]
Protected memory addressing function uses Segment Descriptor Table lookup. Fields include Base, Limit, Access Rights.
Implication: instructions Segment<-value are very costly in protected mode.
18
2.7 Back to History: Original IBM PC (1981) Distorted:
Timeline
IBM’s brand recognition, along with a massive marketing campaign, ignites the fast growth of the personal computer mar- ket with the announcement of its own personal computer (PC). The first IBM PC, formally known as the IBM Model 5150, was based on a 4.77 MHz Intel 8088 microprocessor and used Mi- crosofts MS-DOS operating system. The IBM PC revolutionized business computing by becoming the first PC to gain widespread adoption by industry. The IBM PC was widely copied (“cloned”) and led to the creation of a vast “ecosystem” of software, pe- ripherals, and other commodities for use with the platform.
Better: WIKIPEDIA article Additional link (on reaction): Orson Scott Card’s novel
19
No OS ! Three options:
• CP/M-86 (Control program for Microcomputers), see also DR page
• UCSD p-System
• PC DOS/MS DOS, see also 86-DOS
See also: PL/M Introduction #2: History (cont)
1982 80186, 80188
1982-1991 80286
1985-2007 80386
80186 : almost not used in PC’s, many improvements in instructions (kept).
80286 : 16mb protected mode–promise not fullfilled. Real mode −→−→ Prot mode
XENIX
20
80386 • 32 bit
• 2 additional modes
• misc enhancements (debugging)
21
Doubling of registers again EAX = xxxxxxxxxxxxxxxx ahahahah alalalal
22
Flags register becomes EFLAGS :
Additionally:
• Control Registers CR0..CR7 (CR0=MSW(Machine Status Word) on 80286)
• Test Registers TR0..TR7
• Debug Registers DR0..DR7
64 bit mode adds RAX,...
23
24
On paging Virtual memory allows to execute programs larger than physical mem-
ory. Generally cannot be controlled by the programmer, paging algorithms
are implemented by the OS Page replacement algorithms Application algorithms can be tailored for paging environment. Example:
#define N 1024
int x[N][N],y[N][N],z[N][N];
int i,j;
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
z[i][j]=x[i][j]+y[i][j];
vs
#define N 1024
int x[N][N],y[N][N],z[N][N];
int i,j;
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
z[j][i]=x[j][i]+y[j][i];
Will the two programs run equally fast ?
25
Assume 3 pages are available. (1 page is exactly a row of a matrix above.)
Two dimensional arrays are stored row by row. First program : 1024 swaps. Second program : 10242 swaps. Technical info
26
80486 : 1989 8087 : 1980 80187 (for 80186), 80287 (for 80286), 80387(for 80386). Other coprocessors existed.
Stack design, 8 80-bit registers ST(0), ST(1),.. ST(7).
80486 = 80386 + 80387
Datatypes:
32bit single (float in C/C++)
64bit double
80bit extended (internal format)
Pentium : 1993- 1993 Pentium (P5), why not 80586? (80486.00+100.00=???)
28
1995 Pentium Pro (P6), MMX addition
1997 Pentium II
1999 Pentium III
2000 Pentium 4
MMX:
• MultiMedia eXtension
• Multiple Math eXtension
• Matrix Math eXtension
Intel Core (from 2006)
29
Chapter 3
Instructions
3.1 Overall structure of asm program
• Header – TBD
• Sequence of instructions
• Trailer – TBD
Instructions generally are written one per line (minor exceptions later) Instructions generally follow the following format:
[<label>:] <opcode> [<operands>] [;comment]
[<label>:] [;comment]
where
<label> – optional label (any identifier that is not a keyword or defined oth- erwise).
<opcode> – name of the instruction (keyword)
<operands> – comma-separated operands, if any; their number (0-3) depends on the opcode
;comment – any text, ignored up to the EOL.
Trivial example:
30
lab: ; this line does not do anything
Symbolic representation of instructions corresponds to particular se- quence of bytes which are actually executed.
3.2 The NOP instruction
NOP (do nothing)
Binary representation: one byte, hex value 90h. Execution: Before: bb bb bb bb bb bb bb bb bb
↑IP
90 bb bb bb bb bb bb
After: bb bb bb bb bb bb bb bb bb 90
↑IP
bb bb bb bb bb bb
IP is incremented by 1; no other register is changed
31
WHY have it? • delay?
• padding for sloppy compilers
• patching (code deletion)
• reserving space for patching(code addition)
32
3.3 The MOV instruction
MOV dst,src (copy src to dst)
Example:
MOV AL,BL
;
; before : AL=3 BL=7
; after : AL=7 BL=7
Example:
MOV DL,CH
MOV DL,DL
MOV AX,CX
MOV AX,SP
MOV SP,CX ; very dangerous
MOV EDI,EDI
MOV EDI,ESP
MOV AL,CX ; illegal
MOV EDI,CX ; illegal
MOV IP,AX ; illegal
MOV AX,CS ; ok, special case (see below)
MOV DS,AX ; ok, special case (see below)
MOV CS,DX ; special case, illegal
MOV DS,EDI ; illegal
MOV CR0,EAX ; priveleged
MOV DR0,EAX ; ok, special case (see below)
RULE #1: size of src and dst must match
Most instructions support only gp regis- ters
33
Argument types:
• (r)egister
• (m)emory
• (i)mmediate
• (s)pecial register
Argument size:
• (b)yte
• (w)ord
• (d)oubleword
• ...
MOV DL,CH ; brr instruction
34
General template for 2-arg instructions: r m i
r . . . m . . . i . . .
Move-specific template: r m i s
r . . . . m . . . . i . . . . s . . . .
35
Right now:
r m i r X . . m . . . i . . .
Examples:
MOV AL,[100] ; brm
MOV BX,[200] ; wrm
MOV EDI,[400] ; drm
MOV [100],AL ; bmr
MOV [200],BX ; wmr
MOV [400],EDI ; dmr
Thus
r m i r X X . m X . . i . . .
What does [#] really mean? Answer: bytes beginning with byte #. in
MOV AX,[100]
which byte goes where?
36
Examples:
MOV AL,1 ; bri
MOV DX,2 ; wri
MOV EDI,4 ; dri
r m i r X X X m X . . i . . .
Examples:
MOV AL,97 ;
MOV AL,61h ; all four lines are equivalent
MOV AL,01100001b
MOV AL,’a’ ;
...
MOV AL,1000 ; ???
37
No storing into immediates, this would be like
1=x;
in C. Thus:
r m i r X X X m X . . i × × ×
Important: MOV with immediate is a fundamentally different operation from the rr,rm, mr forms.
38
RULE #2: no memory-to-memory (2 exceptions later)
Thus:
r m i r X X X m X × ? i × × ×
MOV [100],1 ; should not compile
RULE #3: size must be known
Correct syntax:
MOV byte ptr [100],1
MOV word ptr [100],1
MOV dword ptr [100],1
MOV qword ptr [100],1 ; 64 bit only
MOV tbyte ptr [100],1 ; ???
What about
MOV [100],AL
MOV byte ptr [100],AL ; unneeded
MOV word ptr [100],AL ; will not compile
Final result: r m i
r X X X m X × X i × × ×
39
Full table (MOV only): r m i s
r X X X X m X × X X i × × × × s X X × ×
3.3.1 Examples
Here is how C/C++ assignments may be compiled:
char c1,c2; c1=c2;
-------------------
MOV AL,c2
MOV c1,AL
short s1,s2; s1=s2;
-------------------
MOV AX,s2
MOV s1,AX
int x,y; x=y;
-------------------
MOV EAX,y;
MOV x,EAX;
40
int x,y,z; x=y=z;
-------------------
MOV EAX,z;
MOV x,EAX;
MOV y,EAX;
int x; x=0;
-------------------
MOV x,0;
int x,y,z; x=y=z=0;
-------------------
MOV x,0
MOV y,0
MOV z,0
perhaps, a better implementation?
MOV EAX,0 ; could be even better
MOV x,EAX
MOV y,EAX
MOV z,EAX
41
Exercise: Exchange bytes in [100] and [101]
MOV AL,[100]
MOV AH,[101]
MOV [100],AH
MOV [101],AL
can this be done in fewer lines of code?
MOV AX,[100]
MOV [100],AH
MOV [101],AL
Note: Byte order matters.
42
3.3.2 Byte order
Consider:
MOV [100],AX
Does
LE,reversed AL go into [100] and AH into [101] or, instead:
BE,normal AH go into [100] and AL into [101]
More than you want to know on Endianness
LE,reversed : Intel, Dec
BE,normal : IBM mainframe, Motorola, Sun
Practical implications:
• it is important to know the endiness of the hardware and the data.
• it is important to be able to swap.
• it is important to be able determine the endiness. How?
Specific example of byte order importance:
short s=1;
FILE *f=fopen("try.dat","wb");
if (!f) { ... error handling ... }
fwrite(&s,1,sizeof(s),f);
fclose(f);
Should create a 2-byte file try.dat. Now,
43
short s;
FILE *f=fopen("try.dat","rb");
if (!f) { ... error handling ... }
fread(&s,1,sizeof(s),f);
fclose(f);
cout << s;
should print the value of s – indeed 1. But: what will happen if we run the Writing program on an Intel comp,
move the data file to a Sun, and run the reading program there?
Exercise: Can a high-level program be written that determines the order of bytes?
44
3.4 The XCHG instruction
XCHG dst,src (exchange src with dst)
XCHG r m i r X X × m X × × i × × ×
Segment and other non-gp registers are not supported. The syntax and examples from MOV apply, except for non-use of non-gp
registers and immediates. Examples (which of the following are valid?)
XCHG AL,AH
XCHG AX,SP
XCHG EAX,EDI
XCHG AL,[400]
XCHG [400],AL ;same as above
XCHG AL,DI
XCHG DI,DS
XCHG EAX,7
XCHG [100],[101]
XCHG AX,AX ; nop?
XCHG DI,DI ; nop?
XCHG CL,CL ; nop?
45
Can a better version of byte swap program be now written?
Better:
MOV AX,[100]
XCHG AL,AH
MOV [100],AX
Yet better:
XCHG AX,[100]
XCHG AL,AH
XCHG [100],AX
Q: can a shorter program be written (perhaps with another instruc- tion)?
46
3.4.1 Binary encoding of XCHG
We only consider accumulator exchanges now. Instructions
XCHG AX,reg
are extra optimized in the intel architecture.
90h XCHG AX,AX
91h XCHG AX,CX
92h XCHG AX,DX
93h XCHG AX,BX
94h XCHG AX,SP
95h XCHG AX,BP
96h XCHG AX,SI
97h XCHG AX,DI
Q: Why the # of registers is a power of 2 ? A: Because this allows to represent registers as in a fixed number of bits.
47
16-bit register representation:
000b AX
001b CX
010b DX
011b BX
100b SP
101b BP
110b SI
111b DI
An emulator may use code like
unsigned short regs[8];
#define AX regs[0]
#define CX regs[1]
#define DX regs[2]
#define BX regs[3]
#define SP regs[4]
#define BP regs[5]
#define SI regs[6]
#define DI regs[7]
Notes:
• this is just an example!
• 8 bit registers have their own 3-bit keys
• 32 bit registers parallel 16 bit registers
• 64 bit registers use 4-bit keys
• The above code should define 8-bit regs properly (f.e. setting AX should set AL,AH too!
48
• The above code should be modified to support 32 bit registers
\texttt{XCHG AX,AX}
is NOP. General encoding scheme of XCHG (with accumulator): 1 0 0 1 0 r e g
This idea is used in other instructions. XCHG without accumulator uses a l8engthier encoding, with first byte 86h/87h.
XCHG encoding
49
NOTE: MOV has several different forms, including optimized forms for the accumulator.
Similar scheme is used for the segment registers:
00b ES
01b CS
50
10b SS
11b DS
3.5 The ADD instruction
ADD dst,src (dst += src)
(proper name should be increment by.) General 2-operand instruction layout applies:
ADD r m i r X X X m X × X i × × ×
Given that syntax of ADD is largely similar to MOV, the examples are sim- ilar:
ADD AX,BX
ADD EAX,ESP
ADD DL,CL
ADD AX,[100]
ADD [150],EAX
ADD AX,DS ; illegal
ADD AX,DL ; illegal
ADD [10],5 ; syntax error
ADD word ptr [10],5 ; fine
C example:
int x,y,z; x=y+z;
-----
MOV EAX,y
51
ADD EAX,z
MOV x,EAX
int x,y,z; x=x+y;
-----
MOV EAX,y
ADD x,EAX
int x,y,z; x=x+25;
-----
ADD x,25
(NOTE: size specification is not required if x is declared to be a double word)
52
Consider
ADD AL,AL
Generally, multiplication by 2 should not be done as multiplication (generally about 3x slower than addition). Writing
int x; x=2*x;
is wrong! One should use either addition or a shift (if available). (What is better depends on the situation and hardware).
Q: Should we replace multiplication by addition in :
int f(int);
int x; x=2*f(x);
More simple examples: Consider
ADD AL,0 ; nop ?
ADD AL,1 ; increment ?
ADD AL,-1 ; decrement ?
ADD AL,AL ; double
53
MOV AL,1 ; AL=1
ADD AL,AL ; AL=2
ADD AL,AL ; AL=?
ADD AL,AL ; AL=?
ADD AL,AL ; AL=?
ADD AL,AL ; AL=?
ADD AL,AL ; AL=?
ADD AL,AL ; AL=?
ADD AL,AL ; AL=?
ADD AL,AL ; AL=?
MOV AL,1 ; AL=1 binary |00000001
ADD AL,AL ; AL=2 binary |00000010
ADD AL,AL ; AL=4 binary |00000100
ADD AL,AL ; AL=8 binary |00001000
ADD AL,AL ; AL=16 binary |00010000
ADD AL,AL ; AL=32 binary |00100000
ADD AL,AL ; AL=64 binary |01000000
ADD AL,AL ; AL=128 binary |10000000
ADD AL,AL ; AL=0 binary 1|00000000 << overflow
ADD AL,AL ; AL=0 binary 00000000
Is this an assembler problem ?
unsigned char c;
c=1;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
....
Note: if you like C++ and cout<<, make sure to cast!
54
Q: what would be the output if we use char rather than unsigned char?
Is this a size problem ? Try
MOV AX,1
ADD AX,AX
...
OR
MOV EAX,1
ADD EAX,EAX
...
OR C/C++ versions.
55
Unlike MOV and XCHG, ADD is an arithmetic instruction: it sets flags. Warning: the discussion of the flags is slightly simplified, I’m not con-
sidering the OF. Thus there are slight differences between the behavior described and the actual behavior of the processor. This makes no differ- ence for most programs, but there are rare instances where this matters. In particular, I will consider JS and JL as equivalent, in reality they are not exactly the same.
ZF Zero Flag
SF Sign Flag
CF Carry Flag
OF Overflow Flag
; ZF SF CF
MOV AL,1 ; AL=1 binary |00000001 ? ? ?
ADD AL,AL ; AL=2 binary |00000010 0 0 0
ADD AL,AL ; AL=4 binary |00000100 0 0 0
ADD AL,AL ; AL=8 binary |00001000 0 0 0
ADD AL,AL ; AL=16 binary |00010000 0 0 0
ADD AL,AL ; AL=32 binary |00100000 0 0 0
ADD AL,AL ; AL=64 binary |01000000 0 0 0
ADD AL,AL ; AL=128 binary |10000000 0 1 0
ADD AL,AL ; AL=0 binary 1|00000000 1 0 1 << overflow
ADD AL,AL ; AL=0 binary 00000000 1 0 0
WARNING: This is slightly simplified (there is also OF) Flags can be used to
• implement conditionals (IF, WHILE,...)
• implement “long” arithmetic
• check for overflow
56
3.5.1 Overflow detection
unsigned int x,y,z;
....
x=y+z; // concern about overflow
unsigned int x,y,z;
....
y=0x90000000;
z=0x90000000;
x=y+z; // overflow will occur here, result will be incorrect.
can we check for it like this?
unsigned int x,y,z;
....
if (y+z>0xFFFFFFFF)
error("overflow");
x=y+z;
Correct way:
unsigned int x,y,z;
....
if (y>0xFFFFFFFF-z)
error("overflow");
x=y+z;
57
Exercise: what about signed types?
A: you will need to check both for “positive” overflow (adding two large positive number) and for the “negative;; overflow (adding two large nega- tive numbers).
In assembler, flags report overflow condition – no need for extra check- ing!
3.6 The SUB instruction
SUB dst,src (dst -= src)
(proper name should be decrement by.) General 2-operand instruction layout applies:
SUB r m i r X X X m X × X i × × ×
Given that syntax of SUB is identical to ADD, syntax examples are similar and omitted.
ADD AX,100
SUB AX,-100 ; same as above
;
ADD AX,-100
SUB AX,100 ; same as above
What do these instructions do?
ADD AX,0
SUB AX,0
58
What does this instruction do?
SUB EAX,EAX
Answer: most efficient way to zero up a register.
What is the difference between the two instructions below?
SUB EAX,EAX
MOV EAX,0
Answer: the former is more efficient; the latter is rarely used, only in the situations when flags must be preserved. (an example, involving an if, will be given later.)
Revising example we saw above, more efficient code:
int x,y,z; x=y=z=0;
-------------------
SUB EAX,EAX
MOV x,EAX
MOV y,EAX
MOV z,EAX
with SUB, Carry flag indicates borrowing.
59
3.7 The INC instruction
INC dst (dst++)
Do we write
ADD AX,1
ADD byte ptr [10],1
A: yes, we can, but usually we would use the optimized form General 1-operand instruction layout applies:
INC r m i X X ×
(Same format applies to three more instructions, explained later). Register form is optimized to one-byte encoding:
40h INC AX
41h INC CX
42h INC DX
43h INC BX
44h INC SP
45h INC BP
46h INC SI
47h INC DI
Other forms of INC are encoded in lengthier way beginning with 0FFh and 0FEh.
Warning: this encoding applies to BOTH 16 and 32 registers!
What is better?
60
INC AX
INC AX
;or
ADD AX,2
A: former. But do not do this with memory arguments.
61
3.8 The DEC instruction
DEC dst (dst- -)
DEC r m i X X ×
Comments on INC above are applicable. Optimized form:
48h DEC AX
49h DEC CX
4Ah DEC DX
4Bh DEC BX
4Ch DEC SP
4Dh DEC BP
4Eh DEC SI
4Fh DEC DI
Other forms of DEC are encoded in lengthier way beginning with 0FFh and 0FEh.
62
3.9 The NEG instruction
NEG dst (dst=-dst)
NEG r m i X X ×
How to negate without using NEG?
NEG EAX
;is the same as
SUB EBX,EBX
SUB EBX,EAX
MOV EAX,EBX
Solve equation x = −x ?
MOV AL,x
NEG AL
if AL did not change, does this mean x is 0?
No, it is either 0 or 128! For short, the solutions are 0 and 215=8000h, for 32-bit they are ....
63
#include <stdlib.h>
#include <stdio.h>
int main() {
short s,s1;
s=0; s1=-s; if (s!=s1) printf("for s=%d, changes;\n",s);
else printf("for s=%d, does not change;\n",s);
s=1000;s1=-s; if (s!=s1) printf("for s=%d, changes;\n",s);
else printf("for s=%d, does not change;\n",s);
s=0x8000;s1=-s; if (s!=s1) printf("for s=%d, changes;\n",s);
else printf("for s=%d, does not change;\n",s);
return 0;
}
}
results in
for s=0, does not change;
for s=1000, changes;
for s=-32768, does not change;
3.10 The CMP instruction
CMP dst,src (dst-src)
General 2-operand instruction layout applies: CMP r m i r X X X m X × X i × × ×
Conditionals (if,while) are implemented in two stages:
• compute flags
• do (or not) the operation depending on a flag(or flags) – this is later.
64
Examples:
; if (x==0) ...
;
; compute ZF from value of x
; if (x!=0) ...
;
; compute ZF from value of x
; if (x<0) ...
;
; compute SF from value of x
; if (x<=0) ...
;
; compute SF and ZF from value of x
What about
; if (x==y) ...
;
;
to set the flags, we compute the difference
; if (x==y) ...
;
MOV EAX,x
SUB EAX,y
; this sets ZF for our use
65
; if (x<y) ...
;
MOV EAX,x
SUB EAX,y
; this sets SF for our use
; if (x<y) ...
;
MOV EAX,x
SUB EAX,y
; this sets SF and ZF for our use
Problem: this computation destroys value of x which is often needed. So, instead:
; if (x<y) ...
;
MOV EAX,x
CMP EAX,y
; this sets SF and ZF for our use
Flags are set exactly as for SUB, but EAX retains the value of x.
; if (x==5) ...
;
MOV EAX,x
SUB EAX,5 ; inefficient
;
SUB x,5 ; worse yet: destroys variable
;
CMP x,5 ; fine
66
3.11 Logical or Bitwise
What kind of AND operation should be implemented in hardware? (same question of course can be asked of OR etc but we will concentrate on AND)
• logical AND (in C, &&) ?
• bitwise AND (in C, &) ?
• both ?
Logical AND works with True and False concepts:
1 && 1 = 1
1 && 0 = 0
0 && 1 = 0
0 && 0 = 0
Q: what about 5 && 6 ?
(5 is first converted to 1 (true), 6 is converted to 1 (true), we compute 1&&1 and get result 1).
This is a multi-step operation! Bitwise AND: apply the AND operation to every bit (every column):
0101 == 5
& 0110 == 6
----
0100 == 4
Thus 5 & 6 = 4. Notice that on single bits (or multibit 0 and 1), the results of & and &&
are identical.
1 & 1 = 1
1 & 0 = 0
0 & 1 = 0
0 & 0 = 0
67
We summarize:
• On appropriate logical values, logical and bitwise AND are the same.
• Logical AND is a multistep operation – hard to implement in hard- ware.
• Bitwise AND can be done easily and efficiently (array of AND-gates)
• on non-standard input values, logical AND offers minimal advantages (how often do we care about 5&&6?)
• on non-standard input values, bitwise AND offers huge advantages:
– masks
– sets
– images
3.11.1 Bits and Masks
Most x86 2-operand instructions are encoded as follows:
o p c o d e d w m d r e g r - m .....
for example, for ADD, opcode is 000000, MOV, opcode is 100010, etc (Encoding for forms with accumulator and immediate are similar). The w field states if the operation is word or byte. Register reg=000
means AX if w==1 and AL if w==0. The d field states if the direction is from register (0) or to register(1). So decoding should retrieve these bytes.
currentbyte=code[ip++];
....
wordop=currentbyte & 1;
from_reg=(currentbyte & 2)!=0;
....
68
alternately one can use division and mod (yuck).
3.11.2 Checking for odd/even
Q: how do we check for a number being odd or even?
if (x is odd) { ... }
69
if (x % 2 == 1) { ... }
better:
if (x % 2) { ... }
much better: better:
if (x & 1) { ... }
Q: how do we check for a number is divisible by 4?
if ((x & 3)==0) { ... }
if (!(x & 3) { ... }
Q: how do we check for a number is divisible by 2k?
if ((x & 2k−1)==0) { ... }
70
3.11.3 Sets
How do we represent a set?
Consider {2,3,5,7,11,13} Idea #1: Representation as a link list:
2 ⇒ 3 ⇒ 5 ⇒ 7 ⇒ 11 ⇒ 13 ⇒ #
Idea #2: Representation as an array:
2 3 5 7 11 13
Idea #3: Representation as a bit array (set)
0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0
Total space : 2 bytes. Search time:
• list Linear
• array Linear or log (binary search)
• set Constant!
Exercise: write the formula
If A,B are similar sets, one can use & to compute the intersection:
unsigned char A[N];
unsigned char B[N];
unsigned char C[N];
for (int i=0;i<N;i++) C[i]=A[i]&B[i];
Set representation is not always the right one, consider {1,10000000}!.
71
3.11.4 Images
Consider images. Actual representation (b/w image): one dimensional bit array. What would this produce?
&
Conclusion: Bitwise AND is a much better choice for assembler implen- tation.
72
3.12 The AND instruction
AND dst,src (dst &= src)
AND r m i r X X X m X × X i × × ×
Syntax is the same as ADD,SUB,... Examples:
MOV AX,5
AND AX,6 ; AX is 4
AND AX,1 ; check evenness
AND AX,7 ; what does this do?
AND AX,0 ; what does this do?
AND AX,AX ; what does this do?
AND AX,CH ; what does this do?
3.13 The TEST instruction
TEST dst,src (dst & src)
TEST is non-descructive AND, cf. CMP/SUB. TEST r m i r X X X m X × X i × × ×
73
Syntax is the same as AND, ADD, SUB,... Encoding is less efficient than for other operations – TEST is less com-
mon. TEST is helpful when we want to extract different bits from the same
number.
3.14 The OR instruction
OR dst,src (dst |= src)
OR r m i r X X X m X × X i × × ×
Syntax is the same as ADD,SUB, AND... In C, bitwise or is coded as |. Hardware implements the bitwise form. Examples:
MOV AX,5
OR AX,6 ; AX is 7
0101 == 5
| 0110 == 6
----
0111 == 7
OR AX,0
(set flags – not as efficient as RR form)
OR AX,AX
74
(set flags – just as efficient as RR form of AND)
OR AX,1
(set last bit to 1; numerically round up to an odd number.)
OR AX,2
(set next to the last last bit to 1.)
OR AX,3
(set last two bits to 1)
For sets (and graphics) OR implements union.
3.15 The XOR instruction
XOR dst,src (dst ˆ= src)
OR r m i r X X X m X × X i × × ×
Syntax is the same as ADD,SUB, OR... Truth table for XOR:
1 XOR 1 = 0
1 XOR 0 = 1
0 XOR 1 = 1
0 XOR 0 = 0
75
XOR AX,AX
(clear register – just as efficient as RR form of SUB)
XOR AX,0
(only sets flags)
XOR AX,1
(toggle last bit)
XOR AX,7
(toggle last three bits)
XOR AX,0xFFFF
(toggle all bits) Involution property of XOR: ((d XOR k) XOR k) == (d XOR (k XOR k)) == (d XOR 0) == d Application #1: temporary lines:
(same idea can be used for the mouse pointer or inverting selected text).
76
Application #2: cypher: ((data XOR key) XOR key) == data
char data[N];
for (int i=0; i<N; i++)
data[i]=data[i] ^ key;
In pure form, variation of a substitution cipher:
char data[N];
char subst[256]; // contains permutation of ASCII chars
for (int i=0; i<N; i++)
data[i]=subst[data[i]];
Not secure, but good for puzzles! Cryptogram puzzles. With key chaining:
char data[N];
for (int i=0; i<N; i++) {
char c;
c=data[i];
data[i]=c ^ key;
key=f(key,c);
}
77
nearly unbreakable; cf. the Type 1 font encryption episode. The unbreakable Adobe encryption was actually implemented with this
code:
unsigned short int r; // this is the key
unsigned short int c1 = 52845;
unsigned short int c2 = 22719;
unsigned char Encrypt(plain) unsigned char plain;
{unsigned char cipher;
cipher = (plain ^ (r>>8));
r = (cipher + r) * c1 + c2;
return cipher;
}
(Naturally, there are myriads of small changes that can be made to the code, beginning with change of the constants... and they would lead to equally strong security!)
Question: why is C not providing ^^ ?
3.16 The NOT instruction
NOT dst (dst= d̃st)
NOT r m i X X ×
Follows the format of INC,DEC, NEG. Example:
XOR AX,0xFFFFh
NOT AX ; same result as above ?
Warning: NOT may produce drastically different results depending on the size of the number!
78
#include <stdlib.h>
#include <stdio.h>
int main() {
unsigned short s; unsigned char c; unsigned int i;
c=1; c=~c; printf("\n negated byte %u ",c);
s=1; s=~s; printf("\n negated short %u ",s);
i=1; i=~i; printf("\n negated int %u ",i);
}
results in
negated byte 254
negated short 65534
negated int 4294967294
3.17 Shift and Rotate operations
This group of instructions allows to shift bits in a register in different ways. The format of all instructions in the group is the same; we will start by
considering the
79
3.17.1 SHL
SHL r/m,amount (SHift Left)
instruction. The first argument of the instruction must be a register or a memory location; the second specified the number of positions the bits of the first argument will be shifted. There are three ways the shift amount can be specified
1. as number 1
2. as register CL (no other register is allowed)
3. as integer number greater than one. We separate this case from the first one since the encoding is different and this format did not exist in the original 8086 (added with 80186).
Here is a simple example of the SHL execution:
MOV AL,11
SHL AL,1
to see what is the resulting value in target, let’s write AL in binary:
0 0 0 0 1 0 1 1
Shifting the bits left by one position would result in
0 0 0 1 0 1 1 0
where the leftmost bit (0) is moved out and a zero is written to the last position. Numerically the value is 22.
It is not an accident that 22 = 11 × 2; the effect of appending a zero to a binary number is multiplication by 2 (just like appending a zero to a decimal number is multiplication by 10). Unlike the “generic” multiplica- tion described in a later section, this is an efficient and cheap operation – moving bits is naturally cheaper than invoking the multiplier.
80
If the first bit of the number were 1, pushing it out of the register will result in “an overflow” and an incorrect result. For example, shifting 128 in an 8-bit register will produce 0. (Notice that the correct value of 256 is impossible on an 8-bit register.)
The first bit, incidentally is not lost – it is moved to the Carry flag; checking the carry flag therefore offers a way to detect an overflow and possibly correct the calculation.
CF= 0 0 0 0 1 0 1 1 0
The other two forms of SHL arguments allow to shift the register by more than one position. Consider
MOV AL,11
SHL AL,3
or
MOV AL,11
MOV CL,3
SHL AL,CL
Both snippets will do the same : shift AL by 3 positions to the left. Three leftmost bits are pushed out, three zeros are appended at the end, changing
CF= ? 0 0 0 0 1 0 1 1
to
CF= 0 0 1 0 1 1 0 0 0
It is an efficient way to multiply by 8 = 23,and in general left shift by n positions is equivalent to multiplication by 2n.
The bits pushed out cannot be all placed into the carry; two of them are irreversibly lost and only the last bit to be pushed out ends up in the carry flag. In the previous example the bit ending up in the carry is shown in red.
81
Unless an overflow occurs, SHL multiplication often works correctly even with negative numbers. For example, -1 (binary 11111111 becomes 11111110, which is indeed -2).
This instruction does not change either zero or sign flags. NOTE: shift instructions like SHL existed on many processors even be-
fore Intel and got incorporated into the C language ( << ). Thus, one can write in C either x=y+y; or x=y<<1; instead of slower x=y*2. Which of the two options is better? In most cases, about the same. The right choice also depends on the particular formula. For instance
y=f(x)*2; // function call
should not be even changed to y=f(x)+f(x); since this would result in a double call of a function! – definitely slower and possibly a bug, if the function has side effects. Using the shift here is certainly fine. On the other hand, some languages do not even offer shift operator making the addition is the only alternative to (slower) multiplication
82
3.17.2 SHR
SHR r/m,amount (SHift Right)
instruction has the same syntax as SHL but shifts the bits to the right. We again look at an example:
MOV AL,11
SHR AL,1
In binary, 11 is
CF= ? 0 0 0 0 1 0 1 1
shifting this pattern right results in the last bit (1) expelled from the register into the carry flag, the remaining bits moving right by one position and a zero written into the leftmost position, obtaining
CF= 1 0 0 0 0 0 1 0 1
(decimal 5). It is not an accident that 5 = 11/2, for unsigned (or nonnega- tive) numbers shifting right by 1 is equivalent to division by 2, and shifting right by n bits is division by 2n. (Analogy: with decimal numbers deleting the last digit is the same as division by 10). when shifting by more than one it is the last bit that shifted-out from the register that ends up in the carry.
83
One more example to consider :
MOV AL,255
SHR AL,1
the result of shifting
CF= ? 1 1 1 1 1 1 1 1
is
CF= 1 0 1 1 1 1 1 1 1
in terms of division we see 127 = 255/2, correct. However, in
MOV AL,-1
SHR AL,1
we see the same exact values (255 is -1), but obviously incorrect result: 127 = −1/2. In fact the results will be always incorrect when shifting a negative number: after the shift the sign bit (1) is replaced by 0!
NOTE: in C/C++ bitshift operators << and >> use SHL and SHR for un- signed (or signed but positive) numbers. Applying them to negative num- bers is supported by syntax but unpredictable, the result may or may not be correct, depending on the compiler (it is correct under BCC). C/C++ syn- tax allows negative values of shift, but the results are again unpredictable (under BCC, the result would be always 0).
Division of negative numbers via a shift is, however, possible in assem- bler with the
84
3.17.3 SAR
SAR r/m,amount (Shift Arithmetic Right)
instruction. SAR is nearly identical to SHR; the only difference is that SHR inserts zeros at the left, SAR duplicates the first bit. Thus number 255 (11111111)
CF= ? 1 1 1 1 1 1 1 1
would become
CF= 1 0 1 1 1 1 1 1 1
after a SAR (as seen above), but after SHR it is
CF= 1 1 1 1 1 1 1 1 1
where the first 1 is copied to the right but also left where it was. SAR produces correct results for division of signed/negative numbers: the calculation in the example above is division of -1 by 2, resulting in -1 – correct result assuming rounding to −∞.
NOTE: Java implements both SAR and SHR as >> and >>>.
85
3.17.4 SAL
For symmetry, Intel also has
SAL r/m,amount (Shift Arithmetic Left)
while SAL is named and even encoded differently, it has the same function- ality as SHL.
86
3.17.5 ROL
ROL r/m,amount (ROtate Left)
rotates bits leftward, the high bit(s) pushed out appear on low end.
MOV AL,79h ; AL is 01111001
ROL AL,1 ; AL is 11110010
ROL AL,1 ; AL is 11100101
MOV CL,2
ROL AL,CL ; AL is 10010111
ROL never loses any bits and does not use the Carry flag. Notice that
ROL AL,8
does not do anything, yet another equivalent of NOP. Likewise
ROL AX,16
does not change any registers.
Exercise: What does ROL AX,8 do?
87
3.17.6 ROR
ROR r/m,amount (ROtate Right)
is similar, except that the bits are rotated right. Notice that the two instruc- tions below are equivalent.
ROL AL,3
ROR AL,5
ROL and ROR are rarely used, albeit this example should be of interest:
ROL word ptr [100],8
88
3.17.7 RCL
RCL r/m,amount (Rotate Carry Left)
rotates the bits in register and the carry flag left. High bits being rotated out are moved to the carry while the carry flag is entered on the low end of the register.
MOV AL,79h ; AL is 01111001 CY=c (undefined)
RCL AL,1 ; AL is 1111001c CY=0
RCL AL,1 ; AL is 111001c0 CY=1
RCL AL,1 ; AL is 11001c01 CY=1
rotation by more than one can be seen as repeated rotations by 1, except done as a single instruction.
89
3.17.8 RCR
RCR r/m,amount (Rotate Carry Right)
is similar to RCL except for the direction of the rotation.
!!! The code below uses some assembler syntax that has not been cov- ered yet !!!
While RCL and RCR may appear strange but they have a number of us- ages. One of them is the ability to shift (that is multiply or divide by a power of two) numbers that are larger than the register size. Assume that x is a 100-byte (800-bit) number stored in memory locations [500] through [599]. We would like to multiply it by 2. Remember that the bytes are stored in reverse order.
SHL byte ptr DS:[500],1 ;high bit is moved to the carry
RCL byte ptr DS:[501],1 ;carry to 2nd byte, high bit to carry.
RCL byte ptr DS:[502],1 ;carry to 3rd byte, high bit to carry.
...
RCL byte ptr DS:[599],1
or, as a loop:
CLC
MOV BX,500
MOV CX,100
L: RCL byte ptr DS:[BX],1
INC BX
LOOP L
90
Exercise: How would you divide this 800-bit number by 2?
Exercise: Would signed and unsigned numbers use the same code?
to multiply by 2k, you need to shift by k positions. This must be done as 1-position shifts repeated k times since the carry can hold only one bit, therefore a double loop is required.
another application for these instructions is shifting a monochrome image by one or several pixels. The code is similar to the one used for shifting long numbers above.
first two forms of all eight shift/rotate instructions are encoded to- gether, using bytecodes D0 through D3. The “register” field in the 2nd byte of the encoding actually denotes the operation. The third form uses C0 and C1, in a similar fashion.
80386 and all newer processors have double-register-shift operations SHLD and SHRD that allows to write more efficient code; we do not describe them here.
91
3.18 Data conversion
The arithmetic instructions studied in the previous sections are sufficient to implement many formulas, but not all. One particular problem is mixing of variables of different sizes, as seen in the following C-language example
int x;
short y;
char z;
x = y+z;
If all three variables were declared the same way (for instance, all three int’s), we could have implemented the computation with already known instructions:
MOV EAX,y
ADD EAX,z
MOV x,EAX
but the size must match rule prevents us from mixing operands of different size. Thus, we are in need for data conversion instructions – ones that change the size of a variable without changing the value.
Before showing the available instructions, let’s understand the exact problem.
In the example above we would like to add byte (char) quantity z with word (short) quanity y. To do this, we need to convert z to also a word quantity that has the value equal to the original value of z, this word quan- tity can be added with y. (We would later need to convert the sum – a word quantity – to a doubleword, to be able to store it into x).
We actually have a way to do this conversion in some cases. For exam- ple, if we assume z to be equal to 1, the following assembler code would work:
MOV AL,z
SUB AH,AH
Bingo: by zeroing-up the high bits in the AX register, we extended the value into a word, and can add it with the value of y:
92
MOV AL,z
SUB AH,AH
ADD AX,y
Assuming y is equal to 3, the result would be 4, stored in AX. To save it into x we can append two more zero bytes:
MOV AL,z
SUB AH,AH
ADD AX,y
MOV word ptr x,AX
MOV word ptr x[2],0
ZERO EXTENSION, used in the example, is in fact a correct solution, just not for the our formula; with diffently chosen numbers it will fail. Consider
int x;
short y=1;
char z=-1;
x = y+z;
ZERO extending z as we did above will result in the value 255 in the AX; x therefore will be computed as 256.
Why is this happening? ZERO EXTENSION always produces non-negative numbers, even if the original value had the sign (high) bit set, the extended value will always have zero in the high position. Thus, it is suitable for un- signed data types, the example
unsigned int x;
unsigned short y;
unsigned char z;
x = y+z;
will work fine with any values of x and y. For signed data types we need a different type of extension. For simplicity, let’s look only at conversion AL ⇒ AX. The value in AL is considered as signed. If it happens to be nonnegative, we can ZERO-EXTEND, as we did above
SUB AH,AH
93
if it happens to negative, then we should instead fill AH with 1s, not 0’s:
MOV AH,0FFh
In general, the extension byte(or bytes) will be filled with the sign of the number, the procedure itself is therefore called SIGN EXTENSION.
One practical difference between the two types of extensions: while the unsigned (ZERO) extension can be implemented efficiently with the instructions we already have, an implementation of the signed extension would require coding the logic like
if (AL>=0) AH=0; else AH=255;
can this pseudocode be written in assembly? Yes, of course. Can it be written efficiently? No.
Exercise: Implement the pseudocode above in assembler.
Therefore, the instruction set provides sign conversion instructions which we will now introduce
CBW (Convert Byte to Word)
CBW, just like other instructions in this group, does not have arguments, they all work on the accumulor. CBW specifically converts AL to AX, using sign extension.
CWD (Convert Word to Doubleword)
CWD converts 2-byte value stored in AX into 4-byte value stored on DX and AX, DX has the low bits.
Why not convert to a single doubleword register? Two reasons: firstly, CWD was added before the processor had 32-bit registers, but see CWDE below.
94
Secondly, in some cases this is actually more convenient; see the examples related to division below.
Our example program, written in assembler, would use both of these instructions:
MOV AL,z
CBW
ADD AX,y
CWD
MOV word ptr x,AX
MOV word ptr x[2],DX
The following two instructions did not exist in the original 8086 and were added only in 80386, when 32-bit registers were first introduced:
CWDE (Convert Word to Doubleword Extended)
sign-extends AX into EAX.
CDQ (Convert Doubleword to Quadword)
sign-extends EAX into EDX:EAX (to store a quadword on a single register one would need a 64-register!)
The four forms given above are the most efficient to use and the code should be written in such a way as to have the data to be sign-converted on the accumulator; unsigned data conversion, on the other hand, can be done on other registers:
MOV BL,z
SUB BH,BH ; now BX has zero-extended value of z.
In addition to these instructions, Intel – beginning with 80386 – offers more general forms
95
MOVSX target,source (MOV with Sign eXtension)
MOVZX target,source (MOV with Zero eXtension)
where target and source are either memory or registers, at least one of the operands must be a register ( memory-to-memory operations are not allowed ), and – unlike the usual “size must match” rule, here the target should be larger than the source.
For example:
MOVSX AX,AL
MOVZX EBX,DL
MOVSX word ptr DS:[10],BL
MOVZX EAX,byte ptr DS:[BX]
Notice that the first instruction in the example is equivalent to CBW – in terms of what it does! – but its encoding is 4 bytes vs only 1 and execution is slower. Thus, the accumulator forms are still the most efficient.
Encoding: CBW is encoded as a single byte 0x98, CWD as a single byte 0x99.
Data conversion instructions are not considered to be arithmetic, there- fore no flags are altered.
WARNING: X86 emulator does not support 32-bit instructions, so CWDE, CDQ, MOVSX, MOVZX would be rejected by it. Maybe one day :P
Now, what about conversion from a larger data size to a smaller one? Such situations surely happen, for example:
short a, int b;
a=b;
96
Well, in general this cannot be done: the data range of int is larger than that of short, so not every int value can be correctly represented as a short. Most compilers would issue a warning and proceed by saving only the low word:
MOV EAX,b
MOV a,AX
half of the EAX register is not saved at all! This may or may not work correctly, depending on the value of b, and assembler cannot solve what is impossible to solve. Assembler, however, offers a simple test that checks if the operation will work correctly:
MOV AX,word ptr b % lo half of b
MOV BX,word ptr b+2 % hi half of b
CWD
CMP DX,BX
JNZ error
MOV a,AX
In other words, if the true value of b can be recovered from the lo half, the number fits in the short range. For unsigned types, the check is even simpler
CMP word ptr b+2,0 % hi half of b
JNZ error
MOV AX,word ptr b % lo half of b
MOV a,AX
3.19 Multiplication and Division
These two operations were left to the end, because of the differences from other operations.
One of the differences is the higher cost: multiplication is slower (processer- dependent, but 3 times slower is generally a correct estimate) from other operations; division is yet slower. (compare this with performing the op- eration on paper: addition and subtraction are easy, multiplication is more
97
difficult, and if forgot how unpleasant division is, review your elementary school notes!).
Because of this one should always try to use these operations sparingly, or – if possible – not use them at all. For example, multiplication by a power of 2 should never be done as multiplication, same goes for division by a power of 2. This is far from the only possible savings.
Another problem – specific to multiplication – is the growth of the length of the result. With binary operations studied so far the lenght of the result is the same as the length of the arguments (AND, OR, . . . ) or longer by only one bit which can be saved in carry (ADD, SUB, . . . ). Multi- plication may double the lentgh – this requires a different type of syntax, and different understanding of the “size must match” rule. (With division, we will see yet another form of syntax and a direct violation of the “size must match” rule.)
Further, not every register is capable of doing these operations. The most common/usable forms require the accumulator (or work best on the accumulator).
Finally, multiplication and division provide different operation for signed and unsigned arithmetic.
With this in mind, let’s look at a sample instruction:
IMUL r/m (Integer (signed) MULtiplication)
We notice that while multiplication operation requires two arguments, the format specifies only one. The register or memory operand is multiplied by the accumulator of the same size as the operand. For example:
IMUL CL ; multiply CL by AL (8 bit)
IMUL word ptr DS:[BX]; multiply memory word by AX (16 bit)
IMUL ESI ; multiply ESI by EAX (32 bit)
the result is saved in the extended accumulator, it is AX for 8 bit operands, DX:AX for 16 bit arguments and EDX:EAX for 32 bit arguments. In all cases the length of the extended accumulator is twice the length of the arguments. This assures that the result is computed correctly, but does not
98
suggest what to do with the result if it is too large to work with (64 bit in a 32-bit program, or 128 bit after a 64-bit multiplication that also exists.)
Example:
MOV AL,7
MOV CL,13
IMUL CL ; results in AX being 7*13=91
MOV AL,255
MOV CL,255
IMUL CL ; results in AX being 1.
in the second multiplication the arguments are interpreted as signed num- bers, so we are actually squaring -1!. (Intel opcode names are not always consistent. IMUL uses the word Integer to mean Signed! Unsigned multiplication that we look at next is also an integer multiplication, all the x86 registers are integer!
The unsigned counterpart is
MUL r/m ((unsigned) MULtiplication)
and follows the same format as IMUL. The results, however, may be differ- ent:
MOV AL,7
MOV CL,13
MUL CL ; results in AX being 7*13=91
MOV AL,255
MOV CL,255
MUL CL ; results in AX being 0xFE01 = 65025
We now turn our attention to division. In addition to the already men- tioned, it has one extra feature: division is not always possible. But let us begin with the format:
99
IDIV r/m (Integer (signed) DIVision)
Once again, some of the arguments are implicit, the explicit argument is the Divisor in the operation. The quotent will appear in the accumulator, of the same size as the divisor; the divident is taken from the extended accumulator. For example:
MOV AX,100
MOV BL,7
IDIV BL ; result (14) in AL
...
IDIV word ptr ds:[10]
; DX:AX is divided by memory word, result in AX
...
IDIV ESI ; EDX:EAX is divided by ESI, result in EAX
Division operations compute the remainder, which is stored in the 2nd half of the extended accumular (AH, DX, EDX ), this makes separate MOD operation unnecessary.
DIV r/m ( (unsigned) DIVision)
it the unsigned counterpart of IDIV
Exercise: find inputs for which IDIV and DIV produce different results.
In some cases division cannot be performed and results in an exception (often meaning a program crash or even an OS crash:
SUB CX,CX
IDIV CX
100
division by zero is the primary example of it, the above example will fail regardless of the value of the divident.
In case of a division overflow (division by zero is one example of it, but not the only one), the hardware will call INT 0 (see explanation of interrupts and handlers which is TBW). What happens next fully depends on the installed interrupt handler; the outcome may be an error message and program termination, crash of the operating system, or a normal exit from the program with a error message like your program has performed an illegal operation.
Division overflow may occur with a non-zero divisor too:
MOV AX,1000
MOV CL,2
IDIV CL
While we are dividing by 2, this division cannot be done: 1000/2 is 500, too large for a byte register (AL). The outcome is the same as with division by zero, and in fact some exception handlers may report this as a division by zero – it is not.
In the next example, division result (same 500) is to be placed into 2 byte accumulator.
MOV AX,1000
MOV CX,2
IDIV CX
this appears possible, but an overflow still may occur.
Exercise: Understand why
In general to avoid overflows one should use longer registers, a byte division is very overflow prone.
IMUL, unlike the other instructions in this group, has additional formats, including multiplication by an immediate.
TODO : discuss MULDIV TODO : discuss DIVMOD
101
3.20 LEA : Load effective address
LEA s,t (Load Effective Address)
s=&t;
LEA is not an arithmeric operations sensu stricta, but it is closely related to address computation used in arithmetic instruction.
We begin with a closer look at what arithmetic instructions actually do, for example
ADD AX,[BX+SI+10]
Two parts of the computation are:
1. compute the memory address : BX+SI+10
2. retrieve the data from the computed address and add it to the AX register.
We notice that our sample instruction actually does three additions, not one!
The first part of the computation is exactly what LEA does, and using LEA we can break the addition into two steps:
LEA DI,[BX+SI+10]
ADD AX,[DI]
This is an equivalent, and obviously slower computation – we bring it for illustration purposes only, not a suggestion for coding.
LEA is more restrictive than other 2-operand instruction in terms of the allowed operands: the second operand must be a memory (otherwise the concept of an address does not apply; the first argument is therefore always a register.
LEA, not being an arithmetic instructions, does not affect flags. The size must match rule does not apply to LEA – it actually does not
move data. LEA allows the arguments to be of different sizes, albeit such forms of the instruction are rarely useful. The following are all valid exam- ples with different sizes of the arguments (16 and 32):
102
LEA DI,[BX+SI+3] ; 16 and 16
LEA EDI,[BX+SI+3] ; 32 and 16, 0-extension used
LEA DI,[EAX+EBX+3] ; 16 and 32, truncation
LEA EDI,[EAX+EBX+3] ; 32 and 32
With static addresses, LEA does not accomplish anything more than MOV can do :
msg DB ’Hello, World!’
LEA SI,msg
MOV SI,offset msg
(but notice the difference in the syntax). In this case, MOV is slightly more efficient (3-byte encoding rather than 4) and some assemblers (TASM!) will actually compile LEA as MOV!
With addresses that include registers, LEA can be emulated as two or three instructions:
nums DD 1000 dup(?)
......
LEA AX,nums[BX]
......
MOV AX,offset nums
ADD AX,BX
One important but rarely known detail is that the ability of LEA to per- form “free” additions translates to its ability to do free multiplications, lead- ing to a new fast way to multiply by small numbers.
Recall1 that in 32-bit mode the addresses are in the form
[base*scale+index+offset]
with base being any 32-bit register but ESP, index being any 32-bit register, scale is 1,2,4 or 8.
Now, consider the following example:
1actually this section has not been written!—maybe it will be yet
103
LEA EAX,[EAX+EAX*2]
This is a valid 32-bit addressing instruction, we can use the same regis- ter as the “index” and the “base”. The effect of the instruction is multipli- cation by 3, not surprisingly faster than IMUL. Using
LEA EAX,[EAX+EAX*2]
LEA EAX,[EAX+EAX*4]
to multiply by 15 is also faster than IMUL, despite being two instructions.
LEA EAX,[EAX+EAX*2]
SHL EAX,4
is a faster way to multiply by 12, and so on. For many small numbers there are similar tricks.
3.21 Long integers: ADC and SBB
Integer arithmetic can be performed on numbers that consists of more bits than the register size. This was done already on 8-bit processors in the Dark Ages of CP/M; exactly the same approach works on modern 32- and 64- processors to work with numbers that are yet longer. While the technique is the same, the usability of it is less now since for most computations even 32-bit integers are sufficient.
We will first work out the technique mathematically using 8-bit registers only, while trying to add 16-bit numbers. (The techique is fully scalable, the choice of small register sie is to make the examples easier to understand.)
Consider this calculation:
short x,y,z;
z=x+y;
Using 16-bit registers we can write it in assembler as
; program 1
MOV AX,x
ADD AX,y
MOV z,AX
104
We now assume that the 16-bit registers are not available and try to perform it using 8-bit registers only. How about this code?
; program 2
MOV AL,byte ptr x
ADD AL,byte ptr y
MOV byte ptr z,AL
MOV AH,byte ptr x[1]
ADD AH,byte ptr y[1]
MOV byte ptr z[1],AH
for some values it will indeed work. For example, for x=1234h and y=5678h, the result would be 68ACh, same in both programs:
1234h 12h 34h
+ 5678h + 56h 78h
----- -------
68ACh 68h ACh
For others, the result would different. If x=80h (128 decimal) and y=80h, the 16-bit program will produce 100h (256 decimal) correctly, whereas the 8-bit program above will produce 0: adding 80h with itself on an 8-bit register gives 0!.
0080h 00h 80h
+ 0080h + 00h 80h
----- -------
0100h 00h 00h
The source of the problem is the carry produced by the first 8-bit addi- tion: we are not using it. 80h+80h indeed results in 8 zero bits, but also a carry, indicating an extra one to be added to the 9th bit – but we are ignoring it.
It is actually possible to correct the above program without introducing new instructions:
; program 3
MOV AL,byte ptr x
ADD AL,byte ptr y
105
MOV byte ptr z,AL
MOV AH,byte ptr x[1]
JNC skip
INC AH
skip:
ADD AH,byte ptr y[1]
MOV byte ptr z[1],AH
(the INC instruction adds the previously ignored carry), but the Intel in- struction set has a shortcut that would result in a faster code:
; program 4
MOV AL,byte ptr x
ADD AL,byte ptr y
MOV byte ptr z,AL
MOV AH,byte ptr x[1]
ADC AH,byte ptr y[1]
MOV byte ptr z[1],AH
The new ADC instruction adds the 2nd argument and the carry to the 1st argument – this is exactly the computation needed.
ADC t,s; t=t+s+carry (ADd with Carry)
Except for the addition of the carry, ADC follows exactly the same rules as ADD, including the syntax and setting of the flags.
We now scale the example: how about adding 128-bit numbers? This cannot be done with a single addition on any of the current processors – you would need 128-bit registers for that!
In this example we will assume that x occupies 16 bytes (128 = 16 × 8) of memory beginning with address 100, y and z (same size both) are located at 200 and 300. We further assume that the bytes in out 128-bit integers are written in the Intel reversed order. The objective is, as before, to compute z=x+y.
How about this?
106
; program 5
MOV EAX,dword ptr x
ADD EAX,dword ptr y
MOV dword ptr z,EAX
MOV EAX,dword ptr x[4]
ADC EAX,dword ptr y[4]
MOV dword ptr z[4],EAX
MOV EAX,dword ptr x[8]
ADC EAX,dword ptr y[8]
MOV dword ptr z[8],EAX
MOV EAX,dword ptr x[12]
ADC EAX,dword ptr y[12]
MOV dword ptr z[12],EAX
notice that this computation could have been done using only two 64-bit additions, or using 8 16-bit additions, or using 16 8-bit additions. The code would look similar in all cases; naturally the code using longer registers will have fewer instructions and would run faster.
The code shown in the previous program can be extended to handle interegs of any size. It may be preferrable, however, to write this code as a loop. We notice that all the editions are done using the ADC instruction, except for the very first one, done by ADD – this is because there is no carry on the very first addition. We can use ADC for all additions if we are certain that the initial value of the carry flag is 0. The loop form of the previous program thus becomes:
; program final??
CLC ; clear carry
MOV CX,4
SUB BX,BX
L: MOV EAX,dword ptr x[BX]
ADC EAX,dword ptr y[BX]
MOV dword ptr z[BX],EAX
ADD BX,4
LOOP L
NOTE: the program above should be seen just an idea; to make it work one needs to ensure that the carry flag is not corrupted by other instruc- tions! – this is left to the reader.
The new instruction
107
CLC (CLear Carry)
sets carry to 0, thus allowing us to use ADC for the first loop iteration. We may as well introduce two other instructions that also alter the carry
flag.
STC (SeT Carry)
set carry to 1
CMC (CompleMent Carr)
toggle the carry bit Yet longer numbers can be handled with the code above by simply in-
creasing the number of times the loop is executed. The execution time will be proportional to O(n/r) where n is the bitlength of the numbers, and r is the bitlength of the register.
A similar problem for subtraction will not be examined in details. We shall only say that the analog of ADC for subtraction is the SBB instruction:
SBB t,s; t=t-s-carry (SuBtract with Borrow)
and all of the above examples can be adapted to subtraction by changing all ADD’s to SUB’s and all ADC’s to SBB’s.
108
3.21.1 24-bit case
As a special example, consider 24-bit (3 byte) numbers.
Exercise: Why would one even want to look at such?
Addition of such numbers can be done as three byte size additions or as one byte size and one word size.
; program 6
x db 3 dup(?)
y db 3 dup(?)
....
MOV AL,byte ptr x[0]
ADD AL,byte ptr y[0]
MOV byte ptr z[0],AL
MOV AL,byte ptr x[1]
ADC AL,byte ptr y[1]
MOV byte ptr z[1],AL
MOV AL,byte ptr x[2]
ADC AL,byte ptr y[2]
MOV byte ptr z[2],AL
Notice that even in column addition of decimal numbers, chunks do not need to be of the same size:
923
+ 278
---
we can add 23+78, obtaining 01 and carry of 1! In programs below we break the numbers into byte+word and word+byte
chunks.
109
; program 6a
x db 3 dup(?)
y db 3 dup(?)
....
MOV AL,byte ptr x[0]
ADD AL,byte ptr y[0]
MOV byte ptr z[0],AL
MOV AX,word ptr x[1]
ADC AX,word ptr y[1]
MOV word ptr z[1],AL
; program 6b
x db 3 dup(?)
y db 3 dup(?)
....
MOV AX,word ptr x[0]
ADD AX,word ptr y[0]
MOV word ptr z[0],AX
MOV AL,byte ptr x[2]
ADC AL,byte ptr y[2]
MOV byte ptr z[2],AL
Exercise: Are 6a and 6b equivalent?
3.21.2 other operations
• Bitwise AND, OR, XOR, NOT simply repeat the same operation on each chunk (loops for long data).
• Shifts propagate carry.
• NEG is left as a (nice!) exercise.
• (I)MUL is more complicated
• (I)DIV is even more complicated
110
The idea of long MUL: Assume x,y,z is twice the size of the register that is capable of multipli-
cation (N-bit long). Then x=xh ×R + xl, y=yh ×R + yl, where xl,yl are the low halves of
the values, xh and yh are the high halves, R is 2N.
x∗y = (xh∗R+xl)∗(yh∗R+yl) = xhyh∗R2 + (xhyl +xlyh)∗R+xlyl)
The formula contains 4 multiplications (multiplications by R are simply data movements)
Assume now that x and y are 64 bit, z is 128 bit.
x dd ?,?
y dd ?,?
z dd ?,?,?,?
SUB EAX,EAX
MOV z[8],EAX
MOV z[12],EAX
MOV EAX,x[0]
MUL y[0]
MOV z[0],EAX
MOV z[4],EDX
MOV EAX,x[0]
MUL y[4]
ADD z[4],EAX
ADC z[8],EDX
; ADC z[12],0
MOV EAX,x[4]
MUL y[0]
ADD z[4],EAX
ADC z[8],EDX
ADC z[12],0
111
MOV EAX,x[4]
MUL EDX
ADD z[8],EAX
ADC z[12],EDX
This seems to required 4 multiplications (and generally m*N numbers would require m2 multiplications); in reality only 3 are needed. (Toom’s algorith – google for more)
Idea:
x∗y = (xh∗R+xl)∗(yh∗R+yl) = xhyh∗R2 + (xhyl +xlyh)∗R+xlyl)
requires 4 multiplications, but
x∗y = (xh∗R+xl)∗(yh∗R+yl) = xhyh∗R2+[(xh+xl)∗(yh+yl)−xhyh−xlyl]∗R+xlyl)
requires only 3:
• xh ∗yh
• xl ∗yl
• (xh + xl) ∗ (yh + yl)
Division algorithms
112
Chapter 4
Jumps
4.1 Unconditional Jumps
JMP label (unconditionally JuMP (goto) to label)
where label is any identifier, not a keyword, unique. Some scope rules apply.
JMP lab
lab:
JMP lab
(Jumps can go both forward and backward, multiple JMP instructions may target the same label).
l: JMP l
(fine, but infinite loop)
113
NOTE: label is a keyword, so this id cannot be used. In fact, the colon : is an abbreviation for label near.
Other forms of JMP syntax exist, including
JMP r/m (jump(goto) to address stored in r/m)
Only a trivial example for now:
MOV AX,offset lab
JMP AX
....
lab:
(offset is a keyword, cf. the “take address” & operator in C).
114
4.2 Conditional Jumps
J<cc> label (Jump based on cond code)
General idea: conditional jumps implement an equivalent of
if (<condition>) goto lab;
where <condition> is usually expressed in flags. NOTE: of course, this is exactly the programming style not recommended
for high-level languages!
115
4.2.1 Using Zero Flag
if (ZeroFlag) goto lab;
corresponds to an assembler instruction, JZ.
JZ label (Jump if Zero flag)
Example:
; int x;
;
; if (x!=0)
; { <BODY> }
;
CMP x,0
JZ skip
; <BODY>
skip:
if x is already on a register, CMP may not be needed.
; int x,y,z;
;
; if (x+y+z!=0)
; { <BODY> }
;
MOV EAX,x
ADD EAX,y
ADD EAX,z
JZ skip
; <BODY>
skip:
116
JE label (Jump if equal)
is a synomym of JZ
; int x,y;
;
; if (x!=y)
; { <BODY> }
;
MOV EAX,x
SUB EAX,y
JE skip
; <BODY>
skip:
JNZ label (Jump if not Zero flag)
JNE label (Jump if not Equal)
; int x;
;
; if (x==0)
; { <BODY> }
;
MOV EAX,x
CMP EAX,0
JNZ skip
; <BODY>
skip:
117
; int x,y;
;
; if (x==y)
; { <BODY> }
;
MOV EAX,x
SUB EAX,y
JNE skip
; <BODY>
skip:
NOTE: The four (actually, two) instructions JZ, JE, JNZ, JNE are based on the Zero Flag only.
NOTE: The N “prefix” in the instruction name is used with other instruc- tions.
NOTE: Conditions in conditional jumps are always reversed: we specify skipping, not doing!
118
4.2.2 Using Sign Flag
JS label (Jump if Sign)
JNS label (Jump if Not Sign)
; int x,y;
;
; if (x<0)
; { <BODY> }
;
CMP x,0
JNS skip
; <BODY>
skip:
119
JL label (Jump if Less)
JLE label (Jump if Less or Equal)
JG label (Jump if Greater)
JGE label (Jump if Greater or Equal)
JNL label (Jump if Not Less)
JNLE label (Jump if Not Less or Equal)
JNG label (Jump if Not Greater)
JNGE label (Jump if Not Greater or Equal)
(There are only 4 different instructions above!)
120
; int x,y;
;
; if (x<=y)
; { <BODY> }
;
MOV EAX,x
SUB EAX,y
JNLE skip ; reverse Less-or-Equal cond
; <BODY>
skip:
121
WARNING: the explanation below is simplified, actually JS and JL are different, and the Overflow flag plays a role. For nearly all practical code the results are the same. For more details (and true picture) check this.
Why this extra mess? Because CMP may overflow!
How does this work? Primary instructions: JL (==JNGE), JLE (==JNG), JNL (==JGE), JNLE
(==JG). We assume JL ≈ JS. Then
JL jumps if sign flag is set
JLE jumps if sign or zero flag is set
JNL jumps if sign flag is not set
JNLE jumps if neither sign or zero flag is set
122
4.2.3 A few examples
Example: implement ABS function
abs(x) = {−x if x < 0
x if x ≥ 0 Buggy!
; EAX=abs(EAX)
OR EAX,EAX
JNZ done
NEG EAX
done:
Fix the error above! The correct code is
; EAX=abs(EAX)
OR EAX,EAX
JNS done
NEG EAX
done:
Example: implement SIGN function (Sign definition:
sign(x) =
{−1 if x < 0 0 if x = 0 1 if x > 0
Actually
∀x,x = abs(x) × sign(x)
; EAX=sign(EAX)
OR EAX,EAX
JZ done
123
MOV EAX,1
JG done
NEG EAX
done:
Example: What does this do?
MOV CX,10
SUB AX,AX
l: ADD AX,CX
DEC CX
JNZ l
124
These instructions are not suitable for unsigned arithmetic! Example
MOV AL,-1
CMP AL,1
JL lab
...
lab:
This works correctly, JMP is taken
MOV AL,255
CMP AL,1
JL lab
...
lab:
Counter-intuitively, the jump is taken, this is because 255 = −1 and is interpreted as such!
Correct way to program this is
MOV AL,255
CMP AL,1
JB lab
...
lab:
4.2.4 Using Carry Flag
JC label (Jump if Carry)
JNC label (Jump if Not Carry)
125
MOV EAX,x
ADD EAX,y
JC error ; overflow !
...
error:
126
JB label (Jump if Below)
JBE label (Jump if Below or Equal)
JA label (Jump if Above)
JAE label (Jump if Above or Equal)
JNB label (Jump if Not Below)
JNBE label (Jump if Not Below or Equal)
JNA label (Jump if Not Above)
JNAE label (Jump if Not Above or Equal)
Only 4 different instructions here, further, JB==JC, JNB==JNC.
127
Exercise: Work out how the flags are used
4.2.5 Mixing signed and unsigned
Consider two examples:
//example S
int x,y;
if (x<y) ....
//example U
unsigned int x,y;
if (x<y) ....
In Example S, compiler will generate a CMP instruction, followed by an inverted signed jump, specifically here JNL.
In Example U, compiler will generate a CMP instruction, followed by an inverted unsigned jump, specifically here JNB.
This makes it easy for the user — he does not need to think which instruction to use. But there is a problem:
//example M
int x; unsigned y;
if (x<y) ....
Using either signed or unsigned instruction will work incorrectly in some cases! Compilers would typically issue a warning (f.e. “signed un- signed mismatch”, often ignored by the programmer.
How to do this correctly? Let’s consider a smaller case first.
128
//example MB
unsigned char x, /*signed*/ char y;
if (x<y) ....
8-bit comparison will not work. But conversion into a range that in- cludes both char and unsigned char will:
//example MB1
unsigned char x, /*signed*/ char y;
short sx,sy;
sx=x; sy=y;
if (sx<sy) ....
or, simply:
//example MB1
unsigned char x, /*signed*/ char y;
if ((short)x<(short)y) ....
Since short is a signed datatype, generated instruction will be signed. When comparing signed and unsigned shorts, we can convert both to
int (in fact, we could have used int above too). But what do we do when comparing signed and unsigned int’s?
In assembler, the above is of course also doable, but so is the 32-bit comparison!
(The code cannot be shown at this time : we bypassed conversion in- structions).
In C a proper comparison function can be written too.
129
4.3 Jumps encoding
opc del
where opc is the opcode (0x74 for JE, 0x78 for JS,...) and del is a signed byte.
IP is first incremented to point to the next instruction, then, if the jump is taken, IP=IP+del;
The short form of JMP is encoded similarly, using the opcode byte 0xEB
0xEB del
The near form of JMP uses opcode 0xE9 and 2-byte (16-bit segs) or 4- byte (32-bit segs) delta.
16 bit seg : 0xEB d1 d2
32 bit seg : 0xE9 d1 d2 d3 d4
If conditional jump cannot reach its target, one can overjump:
JZ lab
; more than 128 bytes
lab:
may not compile, but
JNZ temp
JMP lab
temp:
; more than 128 bytes
lab:
130
will.
80386+ has alternative longer forms of conditional jumps instructions, with 16-bit delta, they begin with bytes 0Fh 8?h ..
4.4 LOOPS
Revisiting the previously seen example:
MOV CX,10
SUB AX,AX
l: ADD AX,CX
DEC CX
JNZ l
We notice that adding numbers 10 down to 1 is more efficient than doing 1 to 10.
131
MOV CX,1
SUB AX,AX
l: ADD AX,CX
INC CX
CMP CX,11 ; EXTRA
JNZ l
Generally, running loops toward 0 is more efficient!. We next notice that we actually implemented a do while loop.
; cx=1; ax=0; do { ax=ax+cx; cx++; } while (cx<=10);
What about a while ?
; cx=1; ax=0; while (cx<=10) {ax=ax+cx; cx++; }
MOV CX,1
SUB AX,AX
l: CMP CX,10 ; EXTRA
JA done
ADD AX,CX
INC CX
JMP l ; EXTRA
done:
Each execution of while in n iteration loop requires 2n jumps vs n for do while.
Generally, do-while is more efficient than while!.
We will not consider for – generally equivalent to while. Returning to
MOV CX,10
SUB AX,AX
l: ADD AX,CX
DEC CX
JNZ l
This is a very common form of a loop, and Intel allows to optimize this further.
132
LOOP label (CX–; if (CX>0) JMP label)
Thus:
MOV CX,10
SUB AX,AX
l: ADD AX,CX
LOOP l
for one instruction fewer!
We can generalize this into
MOV CX,n
SUB AX,AX
l: ADD AX,CX
LOOP l
But what will happen if supplied n is 0 ?
One solution:
MOV CX,n
SUB AX,AX
OR CX,CX
JZ done
l: ADD AX,CX
LOOP l
done:
Again, Intel provides a shortcut,
133
JCXZ label (if (CX==0) JMP label)
Thus:
MOV CX,n
SUB AX,AX
JCXZ done
l: ADD AX,CX
LOOP l
done:
LOOP and JCXZ are encoded similarly to conditional jumps. LOOP’s op- code is 0E2h, JCXZ’s is 0xE3.
Mentioning only: there are also less commonly needed LOOPZ and LOOPNZ; All four instructions can work off ECX.
Double LOOPs should save CX!
MOV CX,n
L1:
....
**SAVE** CX
MOV CX,m
L2:
....
LOOP L2
**RESTORE** CX
LOOP L1
SAVE/RESTORE can be done using another reg (for example, XCHG CX,DX), memory, or stack (usually the best).
134
One more “real” example:
MOV CX,100
MOV AX,1
MOV BX,1
L: ; print AX
ADD BX,AX
XCHG BX,AX
LOOP L
4.5 Control statements templates
; goto lab;
;
JMP <lab>
lab:
; if <COND> <BODY>
;
; convert condition into a flag
JN<cond> skip
<BODY>
skip:
135
; if <COND> <BODY1> ;else <BODY2>
;
; convert condition into a flag
JN<cond> lelse
<BODY1>
JMP done
lelse:
<BODY2>
done:
; while (true) <BODY>
; for (;;) <BODY>
lagain:
<BODY>
JMP lagain
ldone: ; may be used for break in <BODY>
; while <COND> <BODY>
;
lagain:
; convert condition into a flag
JN<cond> ldone:
<BODY>
136
JMP lagain
ldone:
; do <BODY> while <COND>;
;
lagain:
<BODY>
; convert condition into a flag
J<cond> lagain
; while <COND> { <BODY1> break; <BODY2>}
;
lagain:
; convert condition into a flag
JN<cond> ldone:
<BODY1>
JMP ldone
<BODY2>
JMP lagain
ldone:
137
; while <COND> { <BODY1> continue; <BODY2>}
;
lagain:
; convert condition into a flag
JN<cond> ldone:
<BODY1>
JMP lcont
<BODY2>
lcont:
JMP lagain
ldone:
NOT covered: for, switch, and how to deal with more complex expres- sions (see “complete and shortcut” section).
138
Chapter 5
Variables
5.1 Declaring variables
Assembler language includes operators for declaring variables. The primitive data declarations operators correspond to the primitive
types in the assembler language:
139
[name] DB value[s] (Declare Byte)
[name] DW value[s] (Declare Word)
[name] DD value[s] (Declare Doubleword)
[name] DQ value[s] (Declare Qword)
[name] DT value[s] (Declare Tenbyte)
The [name] field of the declaration should be a unique identifier; this field is usually present but not required. The [name] field is used to refer to the variable and if absent, there is no way to refer to it.
The value field refers to the initial value of the variable, this field is always required.
Examples:
year dw 2020
a1 db ’a’ ; as char
a2 db 97 ; as decimal
a3 db 61h ; as hex
a4 db 01100001b ; as binary
dwrd dd 12345678
notice that a1,a2,a3 and a4 all provide the same initial value, in different
140
formats. ’a’ is 97 since the character ’a’ is in the 97th position in the ASCII table.
Multiple initialization values are allowed, in such cases the declared variable is actually an array:
primes DW 2,3,5,7,11,13
str1 DB ’H’,’e’,’l’,’l’,’o’
str2 DB ’Hello’
str3 DB "H",’e’,"L","L",’o’
str4 DB "HELLO"
str5 DB ’HE’,"LLO"
All five strings above contain identical data: both single of double quotes are allowed – but single quotes must match single quotes and double quotes must match double. Further, strings can be enteres either as strings or as sequences of characters.
s DB ’Hello, World’,0
shows a zero-terminated string (C language string format). One would use single quotes to enter a string that includes double
quotes and vice versa:
m DB ’"Hello", he said’
and will have to break the string into parts if it includes both single and double quotes
m DB ’"Don’,"’",’t", he said’
Long sequence of initialization values can be broken into multiple dec- larations:
msg DB "ERROR: you must not use"
DB "this operator the way you do"
DB "please delete the program and"
DB ’read the textbook’,0
141
in the example above we do not name the lines after the first – the program will not refer to them.
Often, one needs to calculate the length of a string like msg above. This can be done during the run time, by counting the characters up to the terminator, or during compilation:
msg DB "ERROR: you must not use"
DB "this operator the way you do"
DB "please delete the program and"
DB ’read the textbook’
msgend DB 0
msglen DW offset msgend-offset msg
....
MOV CX,offset msgend-offset msg
The offset operator is similar to the address operator & in C/C++. offset msgend-offset msg is a constant, computed during the compile time, it is equal to the number of the characters in the message, excluding the terminating null character. With this value known we actually do not need the null terminator at all and can instead use
msg DB "ERROR: you must not use"
DB "this operator the way you do"
DB "please delete the program and"
DB ’read the textbook’
msgend LABEL BYTE
msglen DW offset msgend-offset msg
....
MOV CX,offset msgend-offset msg
where LABEL declares a byte type variable with NO initial values. LABEL is used instead of DB since the latter will always allocate at least one byte. Other types that may follow LABEL include WORD, DWORD, QWORD, TENBYTE.
To specify an uninitialized variable, use “?”:
x DD ?
y DD ?
142
Declared variable can be referred to in code:
x DW ?
y DW ?
z DW ?
.....
MOV AX,x
ADD AX,y
MOV z,AX
Notice that declared variables have size and the usual size rules apply:
MOV x,10 ; OK, size known from x
MOV AL,x ; error, size mismatch
MOV AL,byte ptr x ; OK, size casted to byte
The dup operator can be used to specify large arrays without having to list all the initial values:
ar1 DW 1,1,1,1,1
ar2 DW 5 dup(1)
two arrays above contain exactly the same data. More practically useful example would be
ar3 DW 1000 dup(?)
– a 1000-element unitialized array.
Initialized and unitialized entries can be used within the same declara- tion:
aa DW 1,2,?,?,5,6
fib DW 1,1,998 dup(?)
the fib array will be used to compute the Fibonacci numbers below. First two elements are initialized statically, the subsequent entries will be com- puted.
143
Data declarations are not executable instructions and in general one should separate them from code so they do not get executed. This is usually accomplished by placing them into a separate segment. If there is only one segment (.com files), then the typical approach is to structure the program as follows:
start: ; entry point of the program
JMP real_start
.....
data and procedure declarations
.....
real_start:
; code begins here
because of overjumping the data, it will never get executed. An exception to the above is the case when we know the exactly which
instructions the data corresponds to. For example,
MOV AX,BX
DB 90h
ADD SI,DI
is totally safe, 90h is the NOP operation! While there is no good reason to code NOP as data, similar approach can be used to enter an instruction that is not supported by the assembler compiler, or a form of the instruction assembler will not produce.
There are two different ways to encode ADD BX,CX as machine code. One of the is normally produced by the assembler compiler, the other can only be entered using db’s.
Exercise: Do it.
5.2 Using arrays
This section contains several examples of common snippets of code. Example 1: initialize a 256-element byte array to contain all possible
256 characters in ascending order.
144
ba DB 256 dup(?)
.....
SUB BX,BX
MOV CX,256
l: MOV ba[BX],BL
INC BX
LOOP l
notice that we use the same register as both index and the current value to be stored; this is possible only because of the byte size of the data.
Example 2: initialize a 1000-element word array to contain numbers 0 through 999 in ascending order.
nums DW 1000 dup(?)
.....
SUB AX,AX
SUB BX,BX
MOV CX,1000
l: MOV nums[BX],AX
INC AX
ADD BX,2
LOOP l
Exercise: Explain what would happen if BX is incremented only by 1.
Generally, the index in loops like above needs to be incremented by the byte size of the data – the index counts bytes, and not elements. One very important implication is that in high language loops of this type there is a hidden multiplication by the element size, depending on the compiler it may or may not be implemented as a multiplication.
Consider two ways to implement the above code in C :
short s[1000]; int i;
for (i=0; i<1000; i++)
s[i]=i;
145
this code has a hidden multiplication by 2, index i needs to be translated into an offset to compute the address of s[i]. The other way to code the example
short s[1000]; int i; short *p=s;
for (i=0; i<1000; i++)
*p++=i;
does not have a hidden multiplication and often would result in better com- piled code.
Example 3: for collection, let us also initialize a 1000-element double- word (int) array to contain numbers 0 through 999 in ascending order.
nums DD 1000 dup(?)
.....
SUB EAX,EAX
SUB BX,BX
MOV CX,1000
l: MOV nums[BX],EAX
INC EAX
ADD BX,4
LOOP l
Example 4: Compute sum and average of the elements in a 1000- element word array (in this and subsequent examples we assume nums to be initialized with some values first.)
nums DW 1000 dup(?)
sum DW ?
ave DW ?
.....
SUB AX,AX
SUB BX,BX
MOV CX,1000
l: ADD AX,nums[BX]
ADD BX,2
LOOP l
MOV sum,AX
CWD
146
MOV BX,1000
IDIV BX
MOV ave,AX
(note that the answer will be rounded down to an integer) Example 5: Find the largest element in an array of signed numbers
nums DW 1000 dup(?)
max DW ?
.....
MOV AX,nums[0]
MOV BX,2
MOV CX,999
l: CMP AX,nums[BX]
JGE skip ; use JAE for unsigned
MOV AX,nums[BX]
skip:ADD BX,2
LOOP l
MOV max,AX
Use JLE and JBE if searching for the smallest entry. Example 6: Compute Fibonacci numbers
fibs DW 1,1,998 dup(?)
.....
MOV BX,4
MOV CX,998
l: MOV AX,fibs[BX-4]
ADD AX,fibs[BX-2]
MOV fibs[BX],AX
ADD BX,2
LOOP l
147
Exercise: Would all the values be computed correctly?
Example 7: Find value t in an array of numbers
nums DW 1000 dup(?)
max DW ?
......
MOV AX,t
SUB BX,BX
MOV CX,1000
l: CMP AX,nums[BX]
JE found
ADD BX,2
LOOP l
not_found:
.....
found:
.....
A more interesting program would be a binary search... it not all that difficult in assembler. We assume nums contains signed numbers in ascend- ing order.
nums DW 1000 dup(?)
......
MOV AX,t
MOV SI,0 ; left interval bound
MOV DI,999*2 ;right interval bound
l: CMP SI,DI
JA not_found
148
MOV BX,SI
ADD BX,DI
SHR BX,2
SHL BX,1 ; middle point
CMP t,nums[BX]
JE found
JG right
left:
MOV DI,BX
SUB DI,2
JMP l
right:
MOV SI,BX
ADD SI,2
JMP l
....
found:
....
not_found:
....
Exercise: Reverse the elements in an word array.
Exercise: Implement bubble sort.
149
5.3 Memory addressing syntax
16 bit addressing offers only the following nine schemes of entering the offset
• const
• BX+const
• BP+const (***)
• SI+const
• DI+const
• BX+SI+const
• BX+DI+const
• BP+SI+const (***)
• BP+DI+const (***)
Notes: How to remember: Base, or Index, or Both, or Neither
Double register addressing allows to efficiently access two dimensional array, but more often is used to access dynamic one-dimensional arrays (base points to the beginning of an array, index “indexes” it.)
This limitation allows for a compact encoding of the addressing scheme. Despite nine choices offered, only three bits are needed.
Options marked with (***) imply SS: segment use (others default to DS:).
150
5.4 Video memory in text mode.
For the default text video mode, the video memory begins in the middle of the B band, corresponding to the segment value B800. (the reasons are historical).
Let us begin with an example showing the idea, we will specify the exact rules later.
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
INT 20h ; terminate the program
If compiled (you would need to add extra lines to the file per assembler language syntax requirements) and run this program will deposit letter A into the upper left corner of the screen!
The general layout of video memory is the following: The screen has 25 rows and 80 columns of characters; each cell cor-
responds to two bytes in the video memory. The even bytes (0, 2, 4, ...) are the ASCII characters, the odd bytes define the attributes (colors) of the character in the preceding even address. Let us provide an example of this:
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
MOV byte ptr ES:[1],71h
INT 20h ; terminate the program
151
71h defines the attributes of the letter A on screen. The first nibble (7) is the background color, the second (1) is the foreground color: A would appear as blue on white, following these definitions:
• 0 – black (0x000000)
• 1 – blue (0x0000AA)
• 2 – green (0x00AA00)
• 3 – cyan (0x00AAAA)
• 4 – red (0xAA0000)
• 5 – magenta (0xAA00AA)
• 6 – brown (0xAA5500)
• 7 – white/light gray (0xAAAAAA)
• 8 – (dark) gray (0x555555)
• 9 – bright blue (0x5555FF)
• 10 – bright green (0x55FF55)
• 11 – bright cyan (0x55FFFF)
• 12 – bright red (0xFF5555)
• 13 – bright magenta (0xFF55FF)
• 14 – yellow (0xFFFF55)
• 15 – bright white (0xFFFFFF)
(In some modes the high bit of the background color indicates blinking rather than color, other variations of interpretation exist).
While MOV is the most common instruction to change video memory, any other instruction can be used. Consider:
152
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
MOV CX,256
L: INC byte ptr ES:[0]
LOOP L
INT 20h ; terminate the program
The program above will show A, then change it to B,.... eventually coming back to an A.
Will you be actually able to see the letters changing onscreen? Well.. no! This will happen too fast to notice.
Exercise: Fix it
Writing to video memory is the best way of doing full-screen output, most suitable for tables and games; other methods are used to produce scrollable console output (see INT 21h).
Assuming that the rows are numbered up-to-down 0 through 24, and the columns are numbered left-to-right 0 through 79, to write at position (X,Y) we should store information at offset 160 ∗ Y + 2 ∗ X in the video segment.
153
Chapter 6
Project #1
Due Date: 04/19/2021 Goal: implement Game of 1024. This a new simple board game(rather a puzzle), you can play the online
version Here. Project specifics:
1. Use direct write to the video memory (seg 0xB800).
2. Output will be explained(today), input can be done using int 16h (look up).
3. Recommendation to use color and line draw characters (2nd half of the ASCII OEM set)
4. Projects are individual and should not be copies of source code found elsewhere; such submissions will not be accepted.
5. You do not need to follow the layout exactly, shortcuts that simplify programming but essentially keep the game the same are fine.
154
Basic grading guidelines:
• C something works (but perhaps not really playable)
• B playable with glitches/deficiencies.
• A enjoyable to play (well... to the degree the entire idea of such puzzle is)
6.0.1 Program outline
Hopefully, of some help – this is an overall structure of the program one can use.
0 Initialize the program, display.
1 Initialize the configuration
2 Display the board
3 Wait for a key (int 16h)
4 If the key indicates a move of a piece, update the configuration, go to step 2.
5 .If the key indicates new game, go to step 1.
6 If the key indicates quit, quit.
7 Ignore the key, go to step 3.
6.0.2 Submission
Submit the source (.asm) and the executable (.com or .exe), ideally by email.
Please not that some email providers kill attachments of certain types. Google is likely to reject .com or .exe, CCNY email does not like .asm.
To avoid problems : rename the files to give them “safe” extension, then zip(or rar) files together. Please do not name the archive “project.zip”, rather use your name to name the archive.
(Example: rename project.com to project.com.txt)
155
When submitting, CC: yourself – this way you would see that the project does come through.
Submission can be made either my CCNY email or to mv@panix.com (the latter address does not kill any attachments!).
Submission is your responsibility.
156
6.1 How to compile and run a program.
Assembler distribution (Download, unpack, make sure it works. The distribution contains DOS-
BOX, TASM, TLINK). At this time we only need to know how to create a binary from an
assembler program, a sample HELLO.ASM is provided.
.model tiny
.code
org 100h
start:
; BEGIN BODY
mov dx, offset hello
mov ah, 9
int 21h
mov ah, 4ch
int 21h
hello db ’Hello, world.’,13,10,’$’
; END BODY
end start
The part between BEGIN BODY and END BODY is replaced by your own code.
To compile: (assuming you are in the directory where hello.asm is unpacked; under DOSBOX or 32-bit OS)
C> BIN\TASM hello
This will create file HELLO.OBJ. To link
C>BIN\TLINK hello /t
157
This will create file HELLO.COM. To run
C>Hello
This will say (guess what?) This output method (console write) will not work right for full-screen
apps, write into video directly, instead:
.model tiny
.code
org 100h
start: jmp real_start
; place your variables and/or procedures here.
real_start:
; BEGIN BODY
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
MOV CX,256
L: INC byte ptr ES:[0]
LOOP L
INT 20h ; terminate the program
end start
158
Chapter 7
Stack and procedures
Hardware stack discussed in the chapter is not a necessary feature of a hardware, many earlier designs did not provide it. However, having it helps in many ways, most notably in the ability to implement procedures efficiently.
7.1 Using stack
We will begin working with the stack by using it. For this section it is not important just how it works psysically, we only state that there is – somewhere in memory – a data structure, implemented in hardware, that parallels the stack software data structure. Namely, we can push things on the stack, we can pop them back, and the operations work in the LIFO fashion: Last In, First Out.
Let us begin with the syntax:
PUSH w/d m/r (PUSH word or doubleword)
POP w/d m/r (POP word or doubleword)
Notice that byte operations are not supported. Further, notice that push-
159
ing or poping memory actually results in a memory-to-memory operation! These stack operations are one of the exceptions to the general no memory- to-memory rule.
PUSH AX
....
POP AX
The example above shows saving a register on the stack and later restor- ing it. This would be helpful if the register – and we always are short on them – is needed for some other purpose in the code between PUSH and POP. This situation is very common and use of PUSH+POP tends to be shorter and more efficient than any alternative solution (saving a temporary variable, recomputing AX to restore the old value,...)
The code above will work if the omitted part does not have any stack operations, or if they are balanced. For example,
PUSH AX
POP AX
is fine, while
PUSH AX
PUSH CX
POP AX
or
PUSH AX
POP CX
POP AX
will not work right and – as will see later – will quite probably lead to a program crash.
A more sophisticated example is a typical double loop. We would like to execute a piece of code N times and then the entire loop M times. Irrespec- tively of exactly what we want to do, the optimal implementation would use LOOP instruction and thus look like this:
160
MOV CX,M
L1:
...
MOV CX,N
L2:
...
LOOP L2
...
LOOP L1
As written this code will not work correctly since the inner loop resets the value of the counter for the other loop! We therefore change it to
MOV CX,M
L1:
...
**save** CX
MOV CX,N
L2:
...
LOOP L2
**restore** CX
...
LOOP L1
**save** and **restore** are not assembler operations, just indica- tions of action to undertake. They can be implemented in multiple ways, PUSH and POP are generally the easiest implementation:
MOV CX,M
L1:
...
PUSH CX
MOV CX,N
L2:
...
LOOP L2
POP CX
...
LOOP L1
161
but something different can be done, for example save in another register:
MOV CX,M
L1:
...
MOV DX,CX
MOV CX,N
L2:
...
LOOP L2
MOV CX,DX
...
LOOP L1
this is actually a bit faster than using the stack, but one may not have an available register.
When saving multiple registers (or memory variables), one retrieves them in the reverse order:
PUSH AX
PUSH BX
PUSH CX
...
POP CX
POP BX
POP AX
Consider an example where we don’t follow the LIFO ordering:
PUSH BX
PUSH AX
POP BX
POP AX
This code would restore BX with the value previously in AX and vice versa, in effect this example is equivalent to
XCHG AX,BX
162
and in this specific case using single XCHG is obviously much better. How- ever, this would not be the case with segment registers, also supported by PUSH and POP:
PUSH DS
PUSH ES
POP DS
POP ES
there is no XCHG alternative, since XCHG does not allow segment registers as arguments!
One can also push and pop the FLAGS register, albeit the syntax is a bit different since FLAGS is not a keyword.
PUSHF (PUSH Flags register)
POPF (POP Flags register)
These instruction can be used to examine or change the contents of the FLAGS register. Consider:
PUSHF
POP AX ; now AX has all the bits of the FLAGS
OR AL,1 ; change the last bit of flags
PUSH AX
POPF ; now FLAGS are changed -- its last bit is set to 1.
What did we do exactly? To get the answer we need to know just what the changed bit of FLAGS is. The last bit happens to be the Carry Flag, and our code set it to one.
Using five instructions for just this is an overkill, STC would have done the same, but we do not have instructions like STC for other flags! – and to work with them we would have to be use code similar to the above.
(Related instructions SAHF and LAHF are not discussed here.)
163
PUSH immed16 (PUSH immediate)
This form was added in 80186. There is no POP counterpart. 80386 and newer processors allow the immediate argument to be a
doubleword, adding new syntax:
push 123h ; push word
push dword 123456h ; push dword
while pushing doubleword can be accomplished with a single instruc- tion (prefixed by 066h), some assemblers compile it as two word-size PUSH instructions.
Noticing that many programs include many sequences of PUSH register or POP register instructions like
PUSH AX
PUSH BX
PUSH CX
....
in 80286 Intel added shortcut instructions
PUSHA (PUSH All)
POPA (POP All)
PUSHA pushes AX,BX,CX,DX,SI and DI; POPA pops them all back. PUSHA and POPA are one-byte instructions, so use of them is always more efficient space-wise. Time-wise, they are roughtly equivalent to three sin- gle register PUSHes or POPs, thus replacing a sequence of three or more PUSH(POP) instructions by a single PUSHA(POPA) is warranted, but only two is not. in 80386 Intel added 32-bit support to PUSHes and POPs:
164
1. arguments of PUSH and POP can be doubleword (32bit) registers and memory locations
2. new instructions PUSHFD, POPFD push and pop the EFLAGS register.
3. new instructions PUSHAD, POPAD push and pop 32-bit regs: EAX, EBX, ECX, EDX, ESI, EDI.
7.1.1 Encoding of PUSH/POP instructions
Register forms of PUSH and POP are the most efficient, they use 1-byte opcodes: 50h-57h for PUSH and 58h-5Fh for POP
One byte opcodes are also used for segment register commands:
PUSH ES ; 06h
POP ES ; 07h
PUSH CS ; 0Eh
;
PUSH SS ; 16h
POP SS ; 17h
PUSH DS ; 1Eh
POP DS ; 1Fh
For the flags register
PUSHF ; 09Ch
POPF ; 09Dh
notice the absense of POP CS : this instruction actually does not exist, and putting it into the source code would cause a compilation error. The reason is that resetting the code segment register without simulteniously chang- ing the instruction pointer will almost always lead to a non-working code. The byte opcode OFh where this instruction would have been if it existed was undefined in 8086 and later used as a prefix for a large set of new instructions.
The forms of PUSH and POP for memory arguments are less used and therefore are implemented less efficiently: they use multibyte sequences with opcodes 8Fh for PUSH and 0FFh for POP.
PUSH immediate encoding uses opcode 68h (followed by a word con- stant) or 6Ah (followed by a signed byte constant, extended to a word).
165
7.2 Stack implementation
We now turn our attention to the actual stack implementation: what hap- pens when we push and pop data?
The stack is controlled by two registers: SS indicates the segment where the stack is; SP points to the end of the stock. (We avoid using the usual word “top” for the reasons seen below). The stack segment may be a sepa- rate segment or be the same as other program segments; in a .com program all segment registers are the same and therefore stack is physically the same segment as code and data.
When an item is being pushed on the stack, SP is decremented by the item’s size (2 – for 16-bit arguments or 4 – for 32-bit) and the data is copied to the memory location SS:[SP]. When an item is popped from the stack, data from the memory location SS:[SP] is copied to the argument and SP is incremented by the argument size.
Notice that unlike the usual stack data structure the stack grows down, not up. This does not affect the functionality and makes it more convenient to check for stack overflow. There is, however, a problem with terminology: if in a normal stack the last element resides on the “top” of the stock, here it would reside in the “bottom”; to avoid confusion we shall use the term end of stack instead of “top”.
7.2.1 Stack Balancing
We now look more carefully at the meaning of “stack balancing” men- tioned in the previous section. While in simple examples balancing may mean merely matching push’es and pop’s, its true definition is preserving the value of SP. an decrease in SP corresponds to bytes being pushed; an increase in SP corresponds to bytes being popped; both amounts should be the same and thus the value of SP should not change. This does not necesserily mean equal number of PUSHes and POPs.
Some examples:
PUSH AX
PUSH BX
POP CX
POP DX
166
the stack is balanced; there is nothing wrong in popping information to different registers if this is what the logic of the program dictates.
PUSH AX
POP EAX
the stack is not balanced: 2 bytes were pushed and then 4 bytes are popped off.
PUSH ESI
POP AX
POP DX
the stack is balanced (despite having one PUSH and two POPs). This snip- pet actually breaks the 32-bit integer in ESI into its low and high 16-bit parts.
PUSHA
POPA
the stack is balanced.
PUSHA
POP DI
POP SI
POP DX
POP CX
POP BX
POP AX
the stack is balanced.
167
7.2.2 Working with Stack Pointer
We shall start with an example, as usual
PUSH AX
....
POP AX
In the snippet above, assume that we do not really need the restored value of AX, it will never be used. We cannot neglect removing it from the stack, this would unbalance the stack, but we can remove it merely by adjusting the stack pointer:
PUSH AX
....
ADD SP,2
the stack is balanced, despite having a PUSH and no POP’s at all. More than a single PUSH can be balanced by an adjustment of SP:
PUSH AX
PUSH EDX
PUSH ECX
....
ADD SP,10
and more complicated scenarios are possible :
MOV BP,SP ; save SP to BP
...
MOV CX,n
L: PUSH CX
LOOP L ; n pushes
...
MOV SP,BP ; undo them all at once, restoring SP
Notice that while usually one would Save/Restore on the stack, it is not going to work in the last example!
168
The flip side of SP controlling the stack is that while SP supports the same arithmetic instructions as other general purpose registers, performing them on SP will almost always lead to a disasterous unbalancing of the stack. For example, XOR SP,0123h is a legal instruction, but it would not be trivial to find a situation where executing this instruction would not result in a crash. The three most common operations with SP, and still not universally save are
ADD SP,n ; remove n bytes from the stack
MOV SP,prev ; set SP to previous safe value
SUB SP,n ; allocate n bytes on the stack
The last operation is the most interesting. Subtrackting 2 bytes from SP is equivalent to pushing 2 undefined bytes on the stack, this effectively allocates a 2-byte variable. Subtracking n likewise allocates |n| bytes. This is a very cheap way to get dynamic memory, albeit limited by the stack size and thus not suitable for allocating large blocks – that would require a heap manager and/or using OS calls for allocation.
A typical allocation/deallocation sequence is
SUB SP,n ; allocate n bytes on the stack
MOV reg,SP ; use reg (often BP) to access the dynamic blocks
.....
; any balanced code
.....
ADD SP,n ; free
7.3 Procedures
The most important application of stack is the implementation of proce- dures1. Before looking at Intel specifics, we shall look at generic procedure implementation on any hardware.
We start by looking a simple procedure : no arguments, no return value2, no local variables and scopes. All of these extra features can be
1cf.function in C/C++ 2cf. void in C/C++
169
added later, for now we only care about implementing a call. Calling a procedure is nothing but transferring control to it (think goto
or JUMP) with one extra feature: the ability to return to the place in the code from where the call was made. Thus,a procedure implementation can be accomplished by two instructions:
CALL p == **save** IP ; JMP p
RET == **restore** IP
Just like before, specifics of implemenations of **save** and **restore** would depend on the hardware. If it has has stack, it would be most reason- able to use it, and in fact this is how instructions CALL and |RET| work on Intel processors: **save** is PUSH and **restore** is POP. If the hardware does not provide a stack, it becomes the responsibility of the programmer to implement a software stack and software equivalents of PUSH and POP; without having some kind of stack, nested procedure calls simply cannot be implemented. Naturally, the software approach actually used on some computers (IBM mainframe) would always be less efficient than a hardware stack as well as complicate programming.
On Intel processor the two key instructions are exactly as described:
CALL p (== PUSH IP, JMP p)
RET (== POP IP)
notice that the value of IP pushed by CALL corresponds to the address of the next instruction after CALL, otherwise an infinite loop of calls may result.
Simple arguments and return values can be passed on registers, this is actually the most efficient way in assembler.
With this in mind, let us implement the abs function as an assembler procedure. We will use the accumulator AX for both the input and output value:
170
abs: ; compute ABS value of AX, return in AX
OR AX,AX
JNS done
NEG AX
done:
RET
....
MOV AX,-3
CALL abs
; return here, with AX being +3
MOV AX,5
CALL abs
; return here, with AX being 5
No additional syntax is required but we will consider a later alternative syntax which makes the procedure more “structured-looking”.
One more simple example:
min: ; compute min of numbers in AX and BX
CMP AX,BX
JL L
XCHG AX,BX
L: RET ; return in AX
notice that this function considers numbers as signed, an unsigned version would use JB rather than JL. Further, this function also computes the max: this is the value returned in BX. Multiple return values are fine in assembler procedures, but such functions cannot be called from high-level language : there is no syntax in, for example, C/C++ to express this!
Two important rules apply to procedures: (1) one must never walk into a procedure and (2) procedure should balance the stack. (Exceptions to these rules are possible!)
Let us illustrate:
MOV DX,SI
ADD DX,DI
171
abs: ; compute ABS value of AX, return in AX
OR AX,AX
JNS done
NEG AX
done:
RET
In the example above we get to execute the procedure by walking into it, not by calling. It would execute (it does not matter that AX is not defined) but where would it return? We never pushed the return address on the stack! RET would still work, taking an undefined value from the stack and sending us into a random location in memory where before long we will encounter illegal instructions of some kind. A crash.
(As we will see later, walking into procedures may be a valid idea in some cases – specifically when we know what is on the stack. In most cases, we don’t.)
The unbalanced stack would lead to the same type of troubles. Con- sider:
p:
PUSH CX
PUSH BX
POP BX
RET
....
CALL p
When RET is executed, the return address is not at the end of the stack, instead the value of CX is. This value would define where we will end up after the return, and most likely it is not a place where code execution would succeed.
Alternatively, consider the case when we POP more than PUSH.
p:
PUSH CX
POP CX
172
POP BX
RET
....
CALL p
Here we removed the return address from the stack. RET would take what- ever value is at the end of the stack, and again we are not likely to survive this.
To summarize: it is your responsibility to ensure that at the time of RET being executed, it pops the return address and not something else.
7.3.1 Structured syntax
Most assembler programs use alternative syntax, based on the same CALL/RET ideas. In our previous examples we wrote:
p:
.....
RET
CALL p
This indicates where the procedure begins (label p:), but not where it physically ends, there may be more code after the RET and there may be multiple RET statements inside a body of the procedure. We can alterna- tively write
p proc
.....
RET
p endp
CALL p
173
where proc and endp are assembler keywords which must match: every proc should be followed eventually by a endp that has the same label. Both types of syntax produce the same code. In addition to endp clearly showing where the code ends, there are other advantages like declarations of local labels and variables within proc–endp blocks. while in most cases one would prefer to use proc–endp syntax for more structured code, in some cases this does not work – namely, when proce- dures overlap. This scenario is possible in assembler, an example will be supplied shortly (see ¶).
7.3.2 Printing a number.
We will provide more interesting examples of procedures now. In high level languages, printing a number is usually done by a call to
supplied library. In assembler this task, like many others, is left to the programmer (who can, of course, include it in a library he uses). The procedures shown below rely on a procedure printchar which can be im- plemented easily in different ways, depending on what kind of printing we want! The choices include
• print to the video memory, via segment 0B800h at a specified screen location
• print at cursor position via int 10h
• print to console using int 21h
• print to a file
• print to a string
Out procedure for printing a number can do all of these, depending on the choice of the printchar.
For the option of writing to the console, printchar can be coded as
printchar: ; argument AL=char
PUSH DX
MOV AH,6
MOV DL,AL
INT 21h
174
POP DX
RET
For the option of writing to the video memory, printchar can be coded as
printchar: ; argument AL=char, DI=current screen position
; ES assumed to be 0B800h
MOV ES:[DI],AL
ADD DI,2
RET
A number of other alternatives can be written, including “write-to-string” and “write-to-file”.
Our printint procedure would use argument in AX, treated as an un- signed short integer.
We will split the number into digits and print them one by one. To split the number we will use repeated division by 10; on each step the remainder would be a digit to be printed, and the quotent would be split further. We will stop when the quotent becomes zero. Notice that we do not know how many digits the number contains, and also that we cannot simply print the digits as we obtain them: this would lead to the number written backward. Therefore, we will store the obtained digits and once we have them all, we will print them in the reversed order. In a high level language store would be most likely store in an array; in assembler we will use more convenient stack.
printnum: ; argument: AX
SUB CX,CX ; counter of digits
MOV BX,10 ; print in base 10
l1: SUB DX,DX ; extend # to DX:AX
IDIV BX ; quotent=AX, remainder=DX
PUSH DX ; save on stack
INC CX
OR AX,AX ; are we done?
JNZ l1 ; if remained!=0 continue
l2: POP AX ; restore a digit
175
ADD AX,’0’; convert to ascii char
CALL printchar
LOOP l2
RET
This is the entire procedure. Make sure you understand this code fully before continuing!
If we want to print a signed number, we will have to slightly modify the above code. The idea is to check if the number is negative; if it is, we will print a minus and negate the number. We then proceed with printing the number as unsigned!
printsignedint: ;argument: AX
OR AX,AX
JNS l3
NEG AX
XCHG AX,BX
MOV AL,’-’
CALL printchar
XCHG AX,BX
l3: CALL printnum
RET
Another possible improvement would be give printnum an extra param- eter to specify how to print the number : decimal, binary, hex, octal... in fact, what about simply making BX (see the code of printnum) a parameter indicating the base? Then BX==10 will print decimal, BX==2 will print bi- nary, BX==3 will print in base 3 (not that it is easy to find a reason why you would want this!), etc.
Exercise: which bases are supported?
Exercise: what needs to be fixed for hex to work?
176
7.3.3 Printing a number in hex
We will provide a different program for printing a hexidecimal number. The printint routine in the previous section always prints a number using as many digits as needed, so 000Ah will be printed as one digit, whereas 1234h will be printed as four. For hex dumps and other application we would prefer to print a byte value using two digits (padding with zeroes if needed), word value using four digits (padding with zeroes if needed), and a doubleword value using eight digits (padding with zeroes if needed) – this way the output would align nicer.
While it is possible (how?) to add this option to printnum, we will show a totally different code to demonstrate a couple of assembler tricks.
We will use an array of hex digits
hexd DB ’0123456789ABCDEF’
and first implement a procedure for printing a single hex digit
print_nibble: ; input AL -- must be 0..15
MOV BL,AL
SUB BH,BH
MOV AL,hexd[BX]
CALL printchar
RET
(This code can be written better using the XLAT instruction – you may want to take a look).
Out next task is to print a byte, to do this we will split it into two digits.
print_byte: ; input AL
PUSH AX
SHR AL,4 ; AL is the low nibble now
CALL print_nibble
POP AX
AND AL,15 ; AL is the high nibble now
CALL print_nibble
RET
We can optimize the code by overlapping (!) both procedures as follows:
177
print_byte: ; input AL
PUSH AX
SHR AL,4 ; AL is the low nibble now
CALL print_nibble
POP AX
AND AL,15 ; AL is the high nibble now
print_nibble: ; input AL -- must be 0..15
MOV BL,AL
SUB BH,BH
MOV AL,hexd[BX]
CALL printchar
RET
Notice that this is a rare case when we can walk into a procedure, with- out a call!–and this is of course faster. Further, this is one case where using proc and endp directives would not be a good idea.
The same trick will be used now to upgrade the code to printing a word:
print_word: ; input AX
PUSH AX
MOV AL,AH
CALL print_byte
POP AX
print_byte: ; input AL
PUSH AX
SHR AL,4 ; AL is the low nibble now
CALL print_nibble
POP AX
AND AL,15 ; AL is the high nibble now
print_nibble: ; input AL -- must be 0..15
MOV BL,AL
SUB BH,BH
MOV AL,hexd[BX]
CALL printchar
RET
We now have tree procedures overlapping each other, each having its own entry point.
178
Exercise: add printdword to the code
NOTE: in high level languages it is easy to have a procedure with mul- tiple exit points: for example, in C/C++ we can have several return state- ments in one procedure. But is it possible to have multiple entry points like we see in the above code? For most languages the answer is no, but there are exceptions that provide syntax for this (PL/I).
179
Chapter 8
Interrupts
Assembler instructions do not define a complete language suitable for pro- gramming: some tasks require interraction with the operating system. These include console I/O (with the exception of the output via screen memory), work with files, memory allocation, process management. Surely, one can implement these tasks in assembler, but then one would have to write huge amounts of code which will use data structures internal to the OS... effec- tively reinventing the wheel. To make it worse, any change to the hardware or the OS will require reworking this code, and under protected mode oper- ating systems user application will not have sufficient privelege to execute this code.
The solution is to call routines provided by the OS... but now there is a question to resolve: where are they located? How would we find them? Surely they are not going to be in fixed addresses, changes to the OS or the system configuration would shift their position in memory.
The Intel solution to this is to have the OS store the addresses of system routines in a fixed place: the interrupt table. This allows an application to call provided routines by number, the address is computed by a simple lookup into the interrupt table.
Specifically, the interrupt table is an array of 256 addresses (each con- taining the offset and the segment parts of the address of the system rou- tine, usually called interrupt handler); it is located at the very begin- ning of the system memory (0000:0000) and occupies exactly one kilobyte (256 ∗ 4 = 1024).
180
INT n (≈ CALL inttable[n])
where n is an unsigned byte. INT is similar to CALL but pushes on the stack three values: offset (IP), segment (CS) and flags. (pushing segment is needed since the system routine is likely to be in a different segment from the caller, flags are saved to prevent interrupts from irreversibly modifying them, see below). Because of additional values pushed on the stack, returning from an inter- rupt can not be done with RET; instead one uses the
IRET (Interrupt RETurn)
instruction to come back, it pops all three values off the stack. Some of the important interrupt numbers are INT 20h (terminate), INT 10h (video), INT 16h (keyboard), and INT 21h. In this chapter we will go sys- tematically through many interrupts numbers, but before doing this, let us systematize the interrupts.
*More detailed descriptions of the interrupts mentioned in the next sec- tion will be given later).
8.1 Interrupt types
Far the most common usage of interrupts is via the INT instruction: this is indeed a call to the operating system to perform a particular service. We specify the interrupt number and the required parameters, given in registers – each interrupt has its own syntax. In the mainframe interrupt classification (there is no PC classification!) such an interrupt would be called an SVC, for SuperVisor Call.
Other types of interrupts occur without an explicit INT instruction. The hardware interrupt (perhaps better called “hardware fault”) oc-
curs if the hardware detects a malfunction – an example of this would be INT 2. No instruction explicit instruction INT 2 is needed in the code, the
181
routine corresponding to this interrupt would be called regardless of what is being executed, the moment a problem is detected.
The software interrupt (perhaps better called “software fault”) occurs if the program is trying to execute an instruction that cannot be executed. Examples include trying to execute an opcode that does not correspond to a valid instruction, or division by zero/overflow on division (INT 0).
The external interrupt is triggered by an event that is external to the processor that must be processed immediately. One example of such an event is pressing a key on the keyboard. Unless the key is processed or saved right away, it may be forgotten! Therefore an INT 9 is issued, and the called routine saves the key, together with the modifiers (status of the Shift, Control and Alt keys). One other example is the timer interrupt (INT 8).
The debug interrupt is intended for use in system software like debug- gers and profilers. Debug interrupts may occur because of explicit instruc- tions (INT 1, INT 3) or without them.
The mainframe classification also includes CPU-to-CPU interrurps: sig- nals from one processor to another. This is very similar to external inter- rupts.
Here is an example how an external interrupt may occur.
CMP AX,BX
; <<<=====
JL LAB
....
LAB:
Assume that a key on the keyboard is pressed when the CMP instruction was executed. Then the control will be switched to the INT 9 routine, we will execute JL only upon return. Naturally if the INT 9 routine alters flags our code will not work right (and this is why flags are saved automatically).
The same type of problem may occur with other non-SVC interrupts. With SVC, on the other hand, this is never an issue – and in fact some
of the SVC interrupts use the carry flag to provide the return.
182
Different interrupts were added by different vendors at different times. We distinguish the following ranges:
• 0..7: these interrupt numbers were assigned by Intel; these interrupts interplay with other instructions and hardware.
• 5..1Fh,40h: these interrupt numbers were assigned by IBM at the time original IBM PC was designed1.
• 20h..33h: these interrupt numbers were assigned by Microsoft, the routines point to the OS (DOS and Windows).
Other interrupt numbers may be uses by software apps and viruses, such interrupts may work only on some systems and not others.
Notice that this classification is mostly “chronological”: first Intel de- signed the CPU (and reserved the first few numbers), then IBM built the computer (and reserved some more numbers), and finally Microsoft added the OS (and again reserved some more numbers). Numbers beyond 40h were never reserved.
The special case of 40h is because this interrupt number corresponds to low-level workings of the to hard disk controller, and there were no hard disks in the very first PC. Similarly 33h corresponds to the mouse – again a late arrival.
We now look at the interrupts in a systematic way.
8.2 INT 0
INT 0 is a software interrupt, triggered by a division overflow. What hap- pens next depends on the interrupt handler installed, either in the OS or in the program library:
• Under original DOS, any program that does not install its own han- dler, the control is given to the BIOS handler. It switches the monitor to 40 column mode, displays DIVIDE BY ZERO and hangs the com- puter. High level language programs typically would install a new handler automatically, assembler programs requite the user to supply the handler.
1notice that Int 5 occurs both as an Intel and as IBM interrupt – this will be explained below
183
• Under Windows, any program that does not install its own handler, Windows display Blue Screen of Death, usually allowing the applica- tion to be terminated, sometimes resulting in a system crash.
• Typical runtime library handler in a high level language would display a message like Runtime error ... and terminate.
While INT 0 is not particularly important (do not divide by zero and you don’t need to know it), we use it to illustrate some general points.
Here is how an assembler program can install a handler:
Int0Handler:
; say something
MOV AX,4CFFh
INT 21h ; (terminate with code 255)
IRET ; unneeded in this particular handler
....
SUB AX,AX
MOV ES,AX
MOV word ptr ES:[0*4+0],offset Int0Handler
MOV word ptr ES:[0*4+2],CS
.....
The code above is simplified, in most cases the following corrections are needed:
1. It is best to save the old interrupt handler pointer and restore it upon program exit.
2. Fully replacing interrupt handler is possible only if one does not need functionality offered by the original handler. With INT 0 it is indeed fine, but with INT 21h (DOS) it is not: the moment the vector is replaced the entire file system and much else becomes unavailable to the application. Same applies to most other interrupts that do something generally important. In cases like this one should ensure that the new interrupt handler calls the original one, you can only add code before or after the call to provide more functionality (See example in the section on the antivirus software – TBW)
184
3. Finally, under protected mode operating systems it is not possible to change the interrupt table by simply storing data into it (as we do above). Instead one uses functions 25h and 35h of INT 21h to accomplist this change in a controlled fashion. (going through these methods is safer since then the OS knows about alterations to the interrupt table and can undo them when the program is exited or task switches to another program.)
INT 0 would not normally appear in a program source – we expect this interrupt to be triggered by IDIV or DIV – what do you think would happen if one writes
INT 0
? – well, the processor would think you actually divided by 0! – and you would get exactly the same response as if you had.
8.3 INT 1
INT 1 is a debug interrupt that implements singlestep: ability to execute exactly one machine instruction. (We will not describe it in this book)
8.4 INT 2
INT 2 is a hardware interrupt that occurs if the processor detects and in- valid change of memory – indicating a memory failure. (We will not de- scribe it in this book)
8.5 INT 3
INT 3 is a debug interrupt that implements breakpoint: ability to stop ex- ecution when a specific condition (for example IP reaching the address of a set breakpoint, or value of a variable changing) occurs. (We will not describe it in this book)
185
8.6 INT 4
INT 4 is a software interrupt used for detection of overflows (alternative name : INTO.
8.7 INT 5
INT 5 serves two entirely different purposes; Intel implemented it as a soft- ware interrupt used for detection of out-of-range conditions, whereas IBM used the same number to implement an external interrupt.
(This will be expanded)
8.8 INT 8
8.9 INT 9
8.10 INT 10h
8.11 INT 13h
8.12 INT 16h
INT 16h provides a series of functions for reading the keybord. The most useful is function 0:
• AH= 00h ; wait for a key
• =>AL ASCII value of the key pressed
• =>AH AH scan code of the key pressed
Some of the keyboard keys do not have a corresponding ASCII character value, for them the returned ASCII value is 0. Such keys include cursor control keys and Function keys, as well as “normal” keys in combination with Control or Alt keys. To distingush among them one looks at the scan codes (low bytes in the table below):
186
Key Normal Shifted w/Ctrl w/Alt
A 1E61 1E41 1E01 1E00
B 3062 3042 3002 3000
C 2E63 2E42 2E03 2E00
D 2064 2044 2004 2000
E 1265 1245 1205 1200
F 2166 2146 2106 2100
G 2267 2247 2207 2200
H 2368 2348 2308 2300
I 1769 1749 1709 1700
J 246A 244A 240A 2400
K 256B 254B 250B 2500
L 266C 264C 260C 2600
M 326D 324D 320D 3200
N 316E 314E 310E 3100
O 186F 184F 180F 1800
P 1970 1950 1910 1900
Q 1071 1051 1011 1000
R 1372 1352 1312 1300
S 1F73 1F53 1F13 1F00
T 1474 1454 1414 1400
U 1675 1655 1615 1600
V 2F76 2F56 2F16 2F00
W 1177 1157 1117 1100
X 2D78 2D58 2D18 2D00
Y 1579 1559 1519 1500
Z 2C7A 2C5A 2C1A 2C00
Key Normal Shifted w/Ctrl w/Alt
1 0231 0221 7800
2 0332 0340 0300 7900
3 0433 0423 7A00
4 0534 0524 7B00
5 0635 0625 7C00
6 0736 075E 071E 7D00
7 0837 0826 7E00
187
8 0938 092A 7F00
9 0A39 0A28 8000
0 0B30 0B29 8100
Key Normal Shifted w/Ctrl w/Alt
- 0C2D 0C5F 0C1F 8200
= 0D3D 0D2B 8300
[ 1A5B 1A7B 1A1B 1A00
] 1B5D 1B7D 1B1D 1B00
; 273B 273A 2700
’ 2827 2822
‘ 2960 297E
\ 2B5C 2B7C 2B1C 2600 (== Alt-L)
, 332C 333C
. 342E 343E
/ 352F 353F
Key Normal Shifted w/Ctrl w/Alt
F1 3B00 5400 5E00 6800
F2 3C00 5500 5F00 6900
F3 3D00 5600 6000 6A00
F4 3E00 5700 6100 6B00
F5 3F00 5800 6200 6C00
F6 4000 5900 6300 6D00
F7 4100 5A00 6400 6E00
F8 4200 5B00 6500 6F00
F9 4300 5C00 6600 7000
F10 4400 5D00 6700 7100
F11 8500 8700 8900 8B00
F12 8600 8800 8A00 8C00
Key Normal Shifted w/Ctrl w/Alt
BackSpace 0E08 0E08 0E7F 0E00
Del 5300 532E 9300 A300
Down Arrow 5000 5032 9100 A000
188
End 4F00 4F31 7500 9F00
Enter 1C0D 1C0D 1C0A A600
Esc 011B 011B 011B 0100
Home 4700 4737 7700 9700
Ins 5200 5230 9200 A200
Keypad 5 4C35 8F00
Keypad * 372A 9600 3700
Keypad - 4A2D 4A2D 8E00 4A00
Keypad + 4E2B 4E2B 4E00
Keypad / 352F 352F 9500 A400
Left Arrow 4B00 4B34 7300 9B00
PgDn 5100 5133 7600 A100
PgUp 4900 4939 8400 9900
PrtSc.... 7200
Right Arrow 4D00 4D36 7400 9D00
SpaceBar 3920 3920 3920 3920
Tab 0F09 0F00 9400 A500
Up Arrow 4800 4838 8D00 9800
8.13 INT 20h
Terminates the program. There are no parameters for this function call. Alternative ways of terminating the program are via INT 21h, function
calls 0 (no other parameters) and 4Ch (AL is the return code).
8.14 INT 21h
In terms of the number and variety of functions, INT 21h is the definite leader. Nearly the entire OS support is done via INT 21h. Rather than listing the entire set, we will concentrate only on one group of functions, especially important for applications: File operations.
Before getting into the specifics, let us outline the general idea of how file systems are implemented in programming languages.
There are two different entities called files:
• Physical file stored usually on a secondary devide (drive)
189
• Logical file variable defined in the programming language.
Logical file variable is defined in different data types, depending on the language and/or system:
• Integer number : FORTRAN, BASIC, Unix IO in C, DOS
• Structure : CP/M
• Pointer to a structure: ANSI C
• Stream: C++
Since programming code rarely examines file variables, only passes them between function calls, the actual format is not overly important.
We note that Microsoft uses the term handle to refer to file variables. Handle is actually just an integer, a different word is intended to underscore that one should not look at the value too closely – the number may mean nothing to the programmer, only the OS can make sense of it. (In reality, in many cases the number is informative.)
Work with file begins by establishing a correspondence between these two entities. Physical file is identified by name (string), and tied to a vari- able. We typically call this Opening a file. This makes the subsequent calls to file operations more efficient: whereas opening a file requires directory (disk) search and is therefore costly, subsequent calls use already available information about the physical file. Furthermore, more than a single file variable can correspond to one physical file, allowing two different types of work with a file to be done simulteniously.
Generally this is the only time when the file name appears in the syntax, all the subsequent calls refer to the file variable.
Some systems separate opening a file (which requires to the file to exist and does not create a new physical file) from Creating a file (which always creates new physical file) leading to having two functions, some lump both operations together.
During the operations with the file (read, write, seek) position file ma- chinery maintains the current offset in the file (read/write pointer). Usually it is initialized to the beginning of the file and is automatically incremented by the number of files read or written. When the application reads or writes
190
entire file from the beginning to the end (sequential access), automatic up- dating of the read/write pointer makes it unnecessary to track it. For appli- cations that want to read or write particular parts of the file (direct access) it becomes necessary to be able able to reposition the read/write pointer (direct access files). This is also a needed feature to determine the size of the file before reading it – even if reading itself is to be done sequentially.
A minimal set of functions to support sequential working with files would include opening (and/or creating) a file, reading, writing and clos- ing. To support direct access files one would also need a seek function.
(This is the minimal list; typical file systems, including the ones in INT 21h, have many other functions, some of them will be mentioned after we master the important ones)
In INT 21h there are two different sets of functions supporting these features: the original set that uses file variables in the form of structures was inhereted from CP/M; the newer set, with file variables being integer number, was modeled after UNIX.
We will describe the second set: the usage of the original CP/M function calls was shortlived and UNIX-style calls are much easier to use.
This set consists of the following functions, listed together with their ANSI C (header stdio.h) and UNIX C (header unix.h) analogs:
3Ch Create file (ansi: fopen, unix: creat).
3Dh Open file (ansi: fopen, unix: open).
3Eh Close file (ansi: fclose, unix: close).
3Fh Read file (ansi: fread, unix: read).
40h Write file (ansi: fwrite, unix: write).
42h Seek file (ansi: fseek and ftell, unix: lseek and ltell).
(Notice that fopen does both opening and creating, while open and creat are different functions; INT 21h follows the UNIX way.)
While every function of INT 21h has its own syntax, let us begin by outlining the general scheme of register usage. For input parameters:
AH : function number.
191
BX : file handle (if applicable).
CX : count (if applicable).
DX : pointer to data (if applicable).
the output is given by the carry flag and the AX register. If carry is set, the call failed, and the AX contains the error code; if no carry, then AX contains the result. Some of the common error codes are
• 0 invalid function number (ah value)
• 1 invalid handle (bx value)
• 2 file not found (on open, for example)
• 3 path not found (on open or create, for example)
• 5 access denied (attempt to open a read only file for writing, for ex- ample)
• 8 out of memory (rarely with file operations)
8.14.1 Copying a file
First five of the file functions will be demonstrated in a single example, a complete program that copies a file. To simplify the problem we will provide the filenames hardwired in the program, in a real application they should be obtained from the command line – this will be demonstrated later.
Here are the variables used by our program:
inname db "input.dat",0 ;name of the input file
outname db "output.dat",0 ;name of the output file
inh dw ? ;input file handle
outh dw ? ;output file handle
b dw N dup (?) ;data buffer
We begin by opening the input file:
192
MOV AX,3D00h ;open for read
MOV DX,offset inname
INT 21h
JC error
MOV inh,AX ;save the handle
The error handler should print an error message and terminate the pro- gram – there is no point in continuing if the input file cannot be opened. A more sophisticated program would examine the error code (in AX) and issue correct diagnostics, in our application this is not necessary and we will use the same error handler for all possible errors; the handler can be simply this:
errmsg DB "ERROR!",13,10,"$"
......
error:
MOV AH,09
MOV DX,offset errmsg
INT 21h
MOV AX,4CFFh
INT 21h ;terminate
We next attempt to create the output file:
MOV AH,3Ch ;create file
SUB CX,CX ;normal file
MOV DX,offset outname
INT 21h
JC error
MOV outh,AX ;save the handle
Before getting to the actual work of reading and writing the data, let us show the end of the program:
MOV AH,3Eh ;close input
MOV BX,inh
INT 21h
JC error
MOV AH,3Eh ;close output
193
MOV BX,outh
INT 21h
JC error
Is this part really necessary? For the input file, only to return the re- source to the OS – this is more about good style of programming. For the output file : absolutely. If an output file is not closed, part of the data we wrote to it may not be actually saved – write’s are delayed by the OS to optimize the performance and the final syncronization only occurs during closing.
The errors in these calls should not be happening – the handles we supply came from the OS and should be valid. But in a larger program variables inh or outh may be accidentally overwritten, we may as well have diagnostics for this kind of problem.
If nothing else added to our program, no data would be copied... but running the program should still result in a file output.dat being created in the current directory, this file should have size zero.
We now proceed with with read and write operations. Firstly, a simpli- fied version:
MOV AH,3Fh ;read
MOV BX,inh
MOV CX,1
MOV DX,offset b
INT 21h
JC error
MOV AH,40h ;write
MOV BX,outh
MOV CX,1
MOV DX,offset b
INT 21h
JC error
Adding this code would copy exactly one byte from the input to output. Since our aim is to copy the entire file, we either need to know the number of bytes in the input or be able to detect the end of file.
To determine the size of the file, one can use the seek function (ah=42h). An example will follow, but in the copy program it is the other approach
194
that is easier2. Recall that the read function, when successful, would re- turn the number of bytes read. Since we only ask for 1 byte, it can possibly return only the values 1 (byte was read) or 0 (it could not read one byte == end-of-file condition).
This allows us to construct a loop:
again: MOV AH,3Fh ;read
MOV BX,inh
MOV CX,1
MOV DX,offset b
INT 21h
JC error
OR AX,AX
JZ done
MOV AH,40h ;write
MOV BX,outh
MOV CX,1
MOV DX,offset b
INT 21h
JC error
JMP again
done:
We now have a working copy program: it will correctly copy a file of any size. But running this program will show abysmal performance – it will be visibly much slower than the OS’ command copy.
The source of the problem is reading/writing one byte at a time: ac- cessing external device is slow and it is best to make fewer accessing to it, having each access do more work. In our case this simply means: copy more than one byte at a time. Here is an updated version:
again: MOV AH,3Fh ;read
MOV BX,inh
MOV CX,N
MOV DX,offset b
INT 21h
JC error
2and avoids a nasty situation when the file size cannot be expressed as a 32-bit integer!
195
OR AX,AX
JZ done
MOV CX,AX
MOV AH,40h ;write
MOV BX,outh
MOV DX,offset b
INT 21h
JC error
JMP again
done:
Here N signifies the number of bytes to read on each iteration (==the buffer size); notice that we always write the number of bytes we have read.
Assuming N is 100 and the input file had 323 bytes, the first time we will read 100 bytes and write 100 bytes, same will happen on the second and third iterations, but the fourth time we will read only 23 bytes and write them out. The fifth iteration will result in no bytes read and exit from the loop.
What kind of value should we choose for N for the optimal perfor- mance?
Two considerations here: firstly, powers of 2 are most efficient. sec- ondly, while generally larger values of N will be more efficient than smaller values, increasing N beyond about 4k tends not to improve the performance noticeably while consumes more memory for the buffer. The exact value of optimal N would depend on the hardware, but 4K (4096) is generally a good choice.
8.14.2 INT 21h,function 09h Write string to console
• AH= 09h
• DS:DX= string to write, terminated with the $ character.
8.14.3 INT 21h,function 3Ch Create File
• AH= 3Ch
• CX= file attributes (0 for normal file)
196
• DS:DX= points to the file name (0-terminated string, it may include drive and path)
• => AX handle
Common error codes: 3,5
8.14.4 INT 21h,function 3Dh Open File
• AH= 3Dh
• AL= (0 for read, 1 for write, 2 for both)
• DS:DX= points to the file name (0-terminated string, it may include drive and path)
• => AX handle
This function sets read/write pointer to the beginning of the file. Common error codes: 2,3,5
8.14.5 INT 21h,function 3Eh Close File
• AH= 3Eh
• BX= handle
This function sets read/write pointer to the beginning of the file being created.
Common error codes: 1
8.14.6 INT 21h,function 3Fh Read File
• AH= 3Fh
• BX= handle
• CX= number of bytes to read
• DS:DX= offset of buffer to receive file data
197
• =>AX number of bytes actually read
Read operation advances read/write pointer by the number of bytes read.
Common error codes: 1
8.14.7 INT 21h,function 40h Write File
• AH= 40h
• BX= handle
• CX= number of bytes to write
• DS:DX= offset of buffer to write
• =>AX number of bytes actually written
Write operation advances read/write pointer by the number of bytes written.
Common error codes: 1
8.14.8 INT 21h,function 42h Seek file
• AH= 42h
• AL= Seek method: 0==from beginning, 1==from current, 2==from end
• BX= handle
• CX:DX= offset to move to (relative to the method!)
• =>DX:AX new offset in the file (absolute, that is relative to the begin- ning of the file)
Common error codes: 1 Examples: Move the r/w pointer to location (n*0x10000+m)
198
MOV AX,4200h
MOV BX,handle
MOV CX,n
MOV DX,m
INT 21h
Rewind:
MOV AX,4200h
MOV BX,handle
MOV CX,0
MOV DX,0
INT 21h
Skip m bytes
MOV AX,4201h
MOV BX,handle
MOV CX,0
MOV DX,m
INT 21h
“Unget” one byte
MOV AX,4201h
MOV BX,handle
MOV CX,-1
MOV DX,-1
INT 21h
Get current position (in DX:AX).
MOV AX,4201h
MOV BX,handle
MOV CX,0
MOV DX,0
INT 21h
Move to the end of the file–this can be used to append or find out the file size.
199
MOV AX,4202h
MOV BX,handle
MOV CX,0
MOV DX,0
INT 21h
8.15 INT 33h
200
Chapter 9
Project #2
Due Date: 05/17/2021, 8pm. Goal: implement Text-to-binary and Binary-to-text conversions for the
purpose of sending email attachments, independent from the MIME stan- dard. (Test that this actually works!)
Basic idea: the protocol used by email is suitable only for text, transmission of bi-
nary files via email is only possible by converting binary files to text and embedding them within email body.
Different conversion algorithms are possible. Implement Base64-type algorithm (for a “B” grade), or that and Ascii-85 type algorithm (for an “A” grade).
Example. The smallest program that can be made in assembler is built out of a
single instruction:
int 20h
Compiled into a .com file (say, try.com) this will be a two-byte binary file, consisting of bytes 0CDh and 020h.
We will convert it to text. Notice: in all cases the text file is going to be larger than the original,
the overhead is smaller if the used alphabet is larger.
Decimal 205,32. Alphabet size=11. File size is about 3.5x the size of the original.
201
Hex CD20. Alphabet size=16. File size is about 2x the size of the original.
Base64 [:((. Alphabet size=64 File size is about 1.33x the size of the origi- nal. See the algorithm below
Ascii85 ???. Alphabet size=85. File size is about 1.25x the size of the origi- nal.
The idea of the Base64-type algorithm: Every 3 bytes of the input (any bytes) are converted into 4 bytes of the output (text chars only). If we do not have 3 bytes, we pad. In our example:
1 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 x x x x x x x x
We rearrange these 3 8-bit bytes into 4 6-bit “bytes”
1 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 x x x x x x x x
Since bytes are really 8 bit, we add padding:
- - 1 1 0 0 1 1 - - 0 1 0 0 1 0 - - 0 0 0 0 x x - - x x x x x x
The new bytes contain values in the 0..63 range, adding 40 to them moves them to the ASCII range ( This is not the only formula possible, of course; f.e. one may prefer converting character set that includes all letters and digits.) In our example the four obtained values are 51, 18, 0, 0; after shifting by 40 they become 91,58,40,40 or [:((.
How does the email may look like? We need to separate the ascii (text) and converted-to ascii parts. We
need header and footer for the data. For example:
Here is some normal message text. Trying encodings:
$!ATTB file="try.com" size="2" encoding=10
202
205,32.
$!ATTE
$!ATTB file="try.com" size="2" encoding=16
CD20
$!ATTE
$!ATTB file="try.com" size="2" encoding=64
[:((
$!ATTE
(The author is not suggesting any particular syntax).
Refs: Base64. Base85.
Technical note: email body is subject to line length limits; newlines should be inserted during conversion to text to break the lines, during con- version back to binary they should be ignored.
The projects must be individual and not downloads of copy one finds on web.
203
Chapter 10
Sample code
10.1 Reading in an entire file
In this example we assume that the we want to read an entire file (its name in the inname variable) into buffer buff). The example is simplified: it cannot read more than the buffer size (10k). More general code would dynamically allocate buff using for example INT 21h function 48h.
inname DB "C:\work\test.dat",0
buff DB 10000 dup (?)
handle DW ?
; open the file
MOV DX,offset inname
MOV AX,3D00h ;open for read
INT 21h
JC error ; file not found
MOV handle,AX
SUB CX,CX
SUB DX,DX
MOV BX,handle
MOV AX,4202h
INT 21h ; position to the end
JC error ; this should not happen
204
OR DX,DX
JNZ error ; the file is too large
CMP AX,10000
JA error ; the file is too large
PUSH AX ; file size
SUB CX,CX
SUB DX,DX
MOV BX,handle
MOV AX,4200h
INT 21h ;rewind file
MOV AX,3Fh
MOV BX,handle
MOV DX,offset buff
POP CX
INT 21h ; read the entire file
MOV AX,3Eh
MOV BX,handle
INT 21h ; close file
..
10.2 Reading command line
Let us begin with an example: imagine an application was invoked with the command
C> mycopy a.dat b.dat
The text typed on the command line has two parts :
• mycopy is used by the operating system to locate an application (mycopy.exe, mycopy.com, . . . ) and invoke it1.
1Generally the application does not need to know how it was invoked, but this infor- mation can be obtained elsewhere.
205
• a.dat b.dat is used by the application, the operating system cannot know the format of the arguments and passes them to the application. We will refer to this part as the “command line” below. Notice that it begins right after the application name, with a space.
(The argument part) of the command line is passed to the application as one string in a special segment called PSP (Program Segment Prefix). The length of the command line is passed in the byte at the offset 80h of the PSP, the command line itself follows from the next byte (offset 81h) and is terminated with the carriage return character (0Dh).
In a .com application there is only one segment, therefore PSP is the same segment as the code, data, or stack. In an .exe PSP is a separate segment; initially DS points to it.
Since the command line arguments are given as a single string we need to parse the string. The following code illustrates how this can be done for the mycopy example; our goal is to extract the two filenames from the command line string. We assume that our application is a .com file, thus we would not have to worry about the segments.
inname db 128 dup (0)
outname db 128 dup (0)
......
MOV SI,81h
L1: MOV AL,[SI] ; loop to skip spaces
INC SI
CMP AL,0Dh
JZ error ; no arguments are given!
CMP AL,’ ’
JBE L1
MOV DI,offset inname
L2: MOV [DI],AL ; loop to extract inname
INC DI
MOV AL,[SI]
INC SI
CMP AL,’ ’
JAE L2
206
MOV byte ptr [DI],0
INC DI
The code above should be followed by two more loops : to skip spaces between the filenames and then to extract outname. These loops are nearly identical to the loops shown above and is left to the reader
Notice that the code above can be shortened with the string instruc- tions.
207
Chapter 11
Branch prediction
One the most important issues in assembler program performance is that of Branch prediction: the code that does not contain conditional branches executes considerably faster than the one that has. The earliest models of the x86 architecture did not employ branch prediction and this was not an issue, as the branch prediction got introduced and became more sophisti- cated, the importance of the issue grew.
The goal of cutting down on conditional jumps was addressed by both
• the programmers (by finding new ways to code – see the abs() func- tion example
• the manufacturer – Intel introduced new instructions that make cod- ing without conditional jumps possible in some selected special cases, SET<cc> and CMOV<cc> are notable cases.
11.1 The abs() function
Of all functions one can think of, abs() is about the simplest, it is therefore odd that one can find something interesting in its implementation
The standard code for the abs() function is shown here:
abs: ; assume EAX is the argument
OR EAX,EAX
JNS l
NEG EAX
l: RET
208
There are two inefficiencies here (and a bug – do you see what the bug is?). One inefficiency comes from the code being a function – an inline implementation would be better. The other – from the conditional jump (JNS). Is it possible to implement abs() without it?
Surprisingly, yes:
abs: ; assume EAX is the argument
CDQ
XOR EAX,EDX
SUB EAX,EDX
RET
Naturally, the code would be much faster, especially on modern proces- sors.1.
It is common to want to compute not abs(x) but rather abs(x-y). Can we use the same trick? For example:
MOV EAX,x
SUB EAX,y
CDQ
XOR EAX,EDX
SUB EAX,EDX
RET
The answer – oddly! – is a NO. We will leave it to the reader to realize the problem and find a counter-example.
1This trick was discovered in very early days and even used by some compilers – for example good ol’ Turbo/Borland pascal implemented abs() as an inline, via the 16-bit version of exactly this code
209
Chapter 12
String Instructions
The instuctions explained in this chapter are non-essential in terms of abil- ity to program: everything they do can be accomplished by other means. However, for they offer superior performance and programming conve- nience for some of the more common and important tasks.
The word string in the name of this group should not be taken too se- riously : yes, we will see some examples related to strings, but in general these are instructions that work with any blocks of data in memory, strings are just one special case.
12.1 MOVS: copying memory data
We begin by considering a common problem: copy a chunk of memory from one place to another. In C/C++ the problem would look like this:
char a[N],b[N]; //N is a constant
// copy b to a
for (int i=0; i<N; i++)
a[i]=b[i];
this is actually a bad way to code, we will come back to the C example later, but for now the task is to write an assembler equivalent.
The problem resstated in assembler is
210
a DB N dup(?)
b DB N dup(?)
; copy b to a
and we can solve it by the following loop:
P1:
MOV CX,N
SUB BX,BX
L1: MOV AL,b[BX]
MOV a[BX],AL
INC BX
LOOP L1
or, alternatively:
P2:
MOV CX,N
MOV SI,offset b
MOV DI,offset a
L1: MOV AL,[SI]
MOV [DI],AL
INC SI
INC DI
LOOP L1
Either code works, but there is a much faster program that will use new instruction
MOVSB (MOV String Byte)
While MOVSB does not have any arguments, it uses several registers im- plicitly. The logic of the instruction is
byte ES:[DI] = byte DS:[SI]
SI = SI +- 1
DI = DI +- 1
211
Whether SI and DI are incremented or decremented depends on the Direction Flag; increment is done if the flag is clear (DF==0), decrement if it is set (DF==1). We will assume for now that both index registers are incremented – this is the most common case.
NOTE: MOVSB is an exception to the general no-memory-to-memory rule (as are some other string instructions).
We can use this instruction to copy a single byte of memory (case N==1), instead of the usual
MOV AL,b
MOV a,AL
we can write
MOV SI,offset b
MOV DI,offset a
MOVSB
This is clearly not an improvement: we replaced a 2-instruction program with a 3-instruction one – and possibly even longer, if the DS or ES segments or the direction flag also need to be set. But for copying more than one byte the balance is different:
P3: MOV CX,N
MOV SI,offset b
MOV DI,offset a
L: MOVSB
LOOP L
In P3 the body of the loop is only one line, comparing to two in P1 or three in P3! this code therefore will run faster, about twice. The savings are coming from the index registers being advanced automatically – we do not need extra instructions to update them. While P3 is a faster program indeed, we will be able to speed it up – about twice! – two more times.
For the next version we will need another new instruction
212
REP (REPeat next instruction)
REP is a special form of loop: the body is exactly one instruction, the next one, to be executed CX times. Since out loop’s body is exactly one instruction long, we can use REP instead of LOOP:
P4: MOV CX,N
MOV SI,offset b
MOV DI,offset a
REP
MOVSB
This would run about twice faster (specifics depend on the processor), but why? LOOP is a conditional jump, and all conditional jumps break in- struction prefetching or parallel execution: the processor cannot predict whether a conditional jump going to be taken or not, thus it cannot opti- mize execution beyond the jump. But REP does not suffer from this prob- lem: the processor knows in advance to execute the next instruction a given number of times.
NOTE: Assembler syntax allows to write REP on the same line as the body of the loop instruction1. This, we can change the above example to its shorter equivalent:
P5: MOV CX,N
MOV SI,offset b
MOV DI,offset a
REP MOVSB
NOTE: While REP can be used with any assembler instruction (for ex- ample REP NOP or REP CALL ... is fine), it is only with string instructions that REP leads to useful code.
NOTE: The string instructions are mostly used as part of REP or similar loops, but we will see exceptions.
The next version will be again much faster than the previous, and again, we will use yet another instruction:
1yes, two opcodes in one line!—only here
213
MOVSW (MOVe String Word)
word ES:[DI] = word DS:[SI]
SI = SI +- 2
DI = DI +- 2
MOVSW moves two bytes at a time (and updates the index registers by 2, to point to the next item, just as we need). This allows us to make the loop twice shorter, assuming that N is even:
P6: MOV CX,N/2
MOV SI,offset b
MOV DI,offset a
REP MOVSW
Since a single MOVSB takes as much time as MOVSW, this is exactly twice faster. And, if desired, we can go further:
MOVSD (MOVe String Dword)
dword ES:[DI] = dword DS:[SI]
SI = SI +- 4
DI = DI +- 4
MOVSD moves four bytes at a time, here we must assume that N is divisible by 4:
P7: MOV CX,N/4
MOV SI,offset b
MOV DI,offset a
REP MOVSD
Can we use word or doubleword MOVS if N is odd (or not divisible by 4)? Yes, but we will only work out the program for MOVSW. For odd N we can code:
214
P8: MOV CX,N/2
MOV SI,offset b
MOV DI,offset a
REP MOVSW
MOVSB
(We use a standalone MOVSB to copy one remaining byte). The more interesting situation is when we do not know if N is even or
odd at the compile time. One way to solve the problem would be to text N and use the code from the even and odd examples, something like this:
P9:
MOV SI,offset b
MOV DI,offset a
MOV CX,N
TEST CX,1
JNZ odd
even:SHR CX,1
REP MOVSW
JMP done
odd: SHR CX,1
REP MOVSW
MOVSB
done:
but shorter,prettier and faster code is possible :
P10:
MOV SI,offset b
MOV DI,offset a
MOV CX,N
SHR CX,1
REP MOVSW
RCL CX,1
REP MOVSB
The code is faster because of the absence of condtional jumps.
215
Exercise: Understand how this works!
One can optimize the code further using MOVSD, we will not do this but leave this as an exercise.
Let us now return to the original problem: how to copy an array in C? The correct answer is
char a[N],b[N]; //N is a constant
// copy b to a
memcpy(a,b,N);
where memcpy is a function that is typically written in assembler and is equivalent to either P10 or its MOVSD-based analog. memcpy (this stands for memory copy) is a part of standard C/C++ and is found in the string.h header; use of memcpy allows one to replace a slow C loop by a fast assem- bler one.
We will see other useful functions in string.h later.
12.1.1 Deleting from a character array
In this and the following section we will consider a typical problem that comes up when working with strings: deleting and inserting characters. A sample deletion problem we will look at is to remove one character (say, D) from the array buff.
buff DB "ABCDEFGHI...XYZ"
This is a perfect problem for string move:
MOV SI,offset buff+4 ;-->E
MOV DI,offset buff+3 ;-->D
MOV CX,len ;number of bytes after D
REP MOVSB
Notice that we do not try to optimize this with MOVSW etc; notice also that in any application we probably keep the length of the string stored somewhere and we would need to update (decrement it) as well.
Similar code can be used to delete more than one character.
216
12.1.2 Inserting into a character array
The reverse problem would be to insert a character or a string into another string, for example we may want to insert "789" after C. This task has to be broken into two: (1) we should make space by shifting DEF.... to the right by three positions; then (2) we will copy "789" into the provided space.
Consider the following code:
MOV SI,offset buff+3
MOV DI,offset buff+6
MOV CX,len ; number of bytes after C
REP MOVSB
While this would indeed create three spaces in the array, it would also destroy the data. While REP MOVSB is a super-optimized form of a loop, it is still a loop and copying the data to the right would override and destroy the contents of buff. Very first copy, for example, would copy D over G and G will be irreversibly gone.
217
- Syllabus
- Preliminary material
- Introduction #1: looking at new hardware
- Introduction #2: History
- Introduction #3: Fundamentals
- x86 CPU
- 8086 registers -- full list
- General addressing scheme
- Back to History: Original IBM PC (1981)
- Back to History: 32 bit OS?)
- Instructions
- Overall structure of asm program
- The NOP instruction
- The MOV instruction
- Examples
- Byte order
- The XCHG instruction
- Binary encoding of XCHG
- The ADD instruction
- Overflow detection
- The SUB instruction
- The INC instruction
- The DEC instruction
- The NEG instruction
- The CMP instruction
- Logical or Bitwise
- Bits and Masks
- Checking for odd/even
- Sets
- Images
- The AND instruction
- The TEST instruction
- The OR instruction
- The XOR instruction
- The NOT instruction
- Shift and Rotate operations
- SHL
- SHR
- SAR
- SAL
- ROL
- ROR
- RCL
- RCR
- Data conversion
- Multiplication and Division
- LEA : Load effective address
- Long integers: ADC and SBB
- 24-bit case
- other operations
- Jumps
- Unconditional Jumps
- Conditional Jumps
- Using Zero Flag
- Using Sign Flag
- A few examples
- Using Carry Flag
- Mixing signed and unsigned
- Jumps encoding
- LOOPS
- Control statements templates
- Variables
- Declaring variables
- Using arrays
- Memory addressing syntax
- Video memory in text mode.
- Project #1
- Program outline
- Submission
- How to compile and run a program.
- Stack and procedures
- Using stack
- Encoding of PUSH/POP instructions
- Stack implementation
- Stack Balancing
- Working with Stack Pointer
- Procedures
- Structured syntax
- Printing a number.
- Printing a number in hex
- Interrupts
- Interrupt types
- INT 0
- INT 1
- INT 2
- INT 3
- INT 4
- INT 5
- INT 8
- INT 9
- INT 10h
- INT 13h
- INT 16h
- INT 20h
- INT 21h
- Copying a file
- INT 21h,function 09h Write string to console
- INT 21h,function 3Ch Create File
- INT 21h,function 3Dh Open File
- INT 21h,function 3Eh Close File
- INT 21h,function 3Fh Read File
- INT 21h,function 40h Write File
- INT 21h,function 42h Seek file
- INT 33h
- Project #2
- Sample code
- Reading in an entire file
- Reading command line
- Branch prediction
- The abs() function
- String Instructions
- MOVS: copying memory data
- Deleting from a character array
- Inserting into a character array