Computer Architecture Midterm

mgarzona
Chapter2.docx

Figure 9: The organization of a simple computer.

· Computer Systems Organization

Computer organization

· A digital computer is an interconnected system of processors, memories and input/output devices.

· In the following we focus on each of these 3 categories.

Figure 9 illustrates the organization of a digital computer.

· CPU (Central Processing Unit) - the “brain” of the computer.

Its function is to execute the programs stored in the main memory by

fetching instructions,

examining them,

executing them one after the other.

· A bus:

connects the components of a computer,

is a collection of wires (parallel, serial) for transmiting address, data and control signals,

external buses connect the CPU to the memory and the I/O devices,

internal buses are found inside the CPU.

1

A modern computer today

Processor

Desktop workstation

Mobile

Core i7X 990x 3.20 GHz Quad

A5 (ARM Cortex-A9 MPCore 1GHz)

Memory

36 GB DDR3

512 MB DDR2

Secondary storage

2 x 2 TB SATA 7200 rpm HDD

64 GB SSD

Keyboard, mouse

Optimus Maximus

none

Videocard

AMD Radeon HD6870 2 GB

Integrated on-chip (PowerVR SGX543MP2)

Monitor

2 x 30” LCD

9.7” LCD multitouch

Network card

Intel(R) PRO/1000 EB

Wi-Fi, 3G

Optical drive

Blu-ray Disk Burner

-

2.1 Processors

CPU Organization

· Control Unit

fetches instructions from the main memory and determines their type.

· ALU

performs operations needed to carry out the instructions,

e.g.: addition, boolean AND.

· Registers

small high-speed memory used to store temporary results and certain control information,

general purpose and special purpose, e.g.:

Program Counter (PC) - points to the next instruction to be exe-cuted,

Instruction Register (IR) - holds the instruction that is currently executed.

Data Path

· The registers, together with the ALU and several buses form the data path.

· Instructions:

register-memory

· fetch words from memory,

· store words back into memory.

register-register (data path cycle)

· fetch two operands from registers into the input registers,

· bring the content into the ALU input registers,

2

Figure 10: The data path of a typical von Neumann machine.

· perform operation into the ALU,

· store result back intro a register.

Example: Figure 10 illustrates an addition operation on a typical von Neumann machine.

The data path cycle is the core of the CPU.

Instructions execution

· The CPU executes the instructions in a series of small steps:

1. Fetch the next instruction from memory into an instruction register.

2. Change the program counter to point to the next instruction.

3. Determine the type of instruction just fetched.

4. If the instruction uses a word in the memory, determine where it is.

5. Fetch the word, if needed, into a CPU register.

6. Execute the instruction.

7. Go to step 1 to begin executing the next instruction.

· The above sequence is the fetch-decode-execute cycle, central to the op-eration of all computers.

· The way the CPU works can be described into some programming lan-guage. Figure 11 illustrate such an interpreter.

3

Figure 11: An interpreter for a simple computer (written in Java).

4

Interpretation of instructions

· Instruction interpreter - a program that can imitate the functions of a CPU (fetch, examine, execute).

· Hardware (processor) – software (interpreter) equivalence!

· After having specified the machine language L, design decision for L:

direct hardware implementation,

implementation of a software interpreter,

hybrid solutions.

· Interpreter: breaks the instructions of L into small steps the interpreter’s machine becomes much simpler and cheap than the target machine for L.

· Instructions were initially simple (1950’s), then more and more complex.

· Transition to complex instructions: sequence of simple instructions occurring frequently or special cases (floating point, array operations).

· Hardware implementation of complex instructions more powerful computers. However, instruction compatibility requirements and the rising cost of software development created the need to implement complex instructions even on low-end computers.

So, how to build a low-cost computer?

· Economic factors:

hardware instructions - higher cost (Cray supercomputers).

solution: interpretation.

RISC vs. CISC

CISC (Complex Instruction Set Computer) (term coined later, in opposi-tion to RISC.)

since 1950’s the instructions were interpreted,

more and more instructions (DEC VAX: several hundred instructions, more than 200 ways to specify operands in each instr.),

interpretation provided a way to add new instructions quickly,

interpretation provised easy bug fixing,

interpretation ensured backward compatibility.

RISC (Reduced Instruction Set Computer)

1980, David Patterson and Carlo Séquin (Berkeley): VLSI CPU without interpretation,

coined the term RISC and named the CPU RISC I (! SPARC),

1981 John Hennessy (Stanford): MIPS CPU (! MIPS),

no need to be backward compatibility to clog the design,

5

key idea: instructions can be issued quickly and executed in hard-ware (the key to good performance: designing instructions that could be issued quickly).

RISC vs. CISC ?

RISC supports’ belief: even if a RISC machine takes four or five instr.to do what a CISC machine does in one instruction, if RISC instructions are 10x faster (they are not interpreted), RISC wins.

However, nothing like this has happened. Why not?

too much invested in CISC software, hardware, RISC has not taken over,

hybrid approaches (CISC architecture with RISC core - Intel Pentium + successors):

CISC was able to contain RISC core that execute the simplest instruction, while interpreting complicated instructions in the usual CISC way

Design principles for modern computers

· All instructions directly executed by hardware.

CISC instructions should be broken down into RISC instructions.

· Maximize the rate at which the instructions are issued.

it is less important how long instructions will take to execute,

parallelism could play an important role.

· Instructions should be easy to decode.

instructions should be regular, with fixed length and a small number of fields.

· Only Loads and Stores should reference the memory.

operands for all other operations should come from and return to registers.

· Provide plenty of registers

access to memory is slow ! the more registers, the better.

· Attention: The above principles are relative to the current technological resources and limitations!

6

Figure 12: (a) A five stage pipeline. (b) The state of each stage as a function of time (nine clock cycles).

Improving computer performance

· Brute force: make chips faster by increasing the clock speed. (Limited by the technological development at a fixed moment in time.)

· Parallelism: doing more things at once - get more performance for a given clock speed.

Instruction level parallelism - get more instructins per second out of the machine.

Processor (thread) level parallelism - use more processors to share the work on the same problem.

Data level parallelism - spread data over multiple system (e.g. “cloud”).

Instruction-level parallelism: pipelines

· Pipelining: divide instruction execution into small steps, each supported by a dedicated piece of hardware (as illustrated in Figure 12).

Instruction level parallelism: superscalar architectures

· “One pipeline good, two pipelines better.”

pairs of instructions that do not conflict over resources or do not depend on eachother can be sent on different pipelines,

correctness of this process is ensured by the compiler, or using extra hardware,

7

Figure 13: Dual five-stage pipeline with a common instruction fetch unit.

Figure 14: A superscalar processor with five functional units.

Example: Intel processors (Pentium):

· the u pipeline: arbitrary instructions,

· the v pipeline: integer and floating-point instructions.

· this led to important performance gains (Pentium up to 2 times faster than 486).

Figure 13 illustrates the use of two pipelines.

· “Two pipelines good, four pipelines better”?

Actually, no - too much hardware duplication.

· For high-end CPUs: one pipeline, but with multiple functional units, as illustrated in Figure 14.

8

Figure 15: An array processor of the ILLIAC IV type.

Processor-level parallelism: array computers

· Array processors - large number of identical processors, performing the same sequence of instructions on different sets of data.

· ILLIAC IV, 1972 (4 quadrants, each quadrant having 8 x 8 square grid of processor-memory elements), A single control unit per quadrant broadcast a single instruction to all processors,

· Architecture known as SIMD (Single Instruction-stream Multiple Data), different form from the standard von Neumann architecture.

· The design of an ILLIAC IV quadrant is illustrated in Figure 15.

· A modern incarnation is the Fermi GPU, illustrated in Figure 16.

· Vector processor -Like a SIMD it is very efficient at executing a sequence of operations on pairs of data element, with the difference that all of operations are performed in a single, heavily pipelined functional unit (Cray-1, 1974).

· No array computers are in use today, but the same principle is used in Intel x86-* processors for multimedia instructions (MMX, SSE).

Processor-level parallelism: multiprocessors/multicomputers

· Multiprocessors - systems with multiple full-blown CPUs sharing a common memory

9

Figure 16: SIMD core of the Fermi GPU.

easy to build for small number of processors (up to 64), difficult for more.

Figure 17 illustrates possible implementation schemes for multipro-cessor systems.

· Multicomputers - systems of interconnected computers, each with its memory, but no common memory.

several topologies for the connections: 2D, 3D, grid, trees, rings,

up to 10000 CPUs.

Hybrid approaches - multicomputers that simulate a shared memory.

Processor clock

· Instruction execution is driven by a high-frequency processor clock. e.g.

3.5 GHz computer (3.5 billion Herz),

the clock ticks 3.5 billion times a second,

· Execution time (clock cycles):

instructions may take 1, 2, 3, : : : clock cycles,

different instructions have different execution time,

10

processor speed is not a measure of processor power (megaherz/gigaherz myth).

Examples:

· AMD vs. Intel,

· Intel vs. PowerPC,

· Intel Pentium M vs. Pentium 4.

2.2 Primary Memory / Secondary Memory / Input/Out-put

11

Primary Memory

The Primary Memory

Primary Memory

· Memory = the part of a computer where programs and data are stored.

· Primary memory (fast, small, volatile) vs. secondary memory (slow, large, persistent).

· Synonyms (British) – storage, store (more and more used to refer to disk storage).

Bits

· basic unit of memory (holding 0 or 1);

· binary arithmetic : “more efficient” (= easier to distinguish between 2 physical values rather than between 10);

· Example: BCD(*) (Binary Coded Decimal) vs. binary:

1944:

­ BCD:

00001 1001 0100 0100

· binary: 0000011110011000

BCD:0­9999

binary:0­65,535

(*) [use 4 bits to represent one digit, 16 combinations possible, 6 not used, the rest 0 ,..., 9]

12

Primary Memory

Memory Addresses

· memories consist of a number of cells (locations) – used to store information;

· each cell has a number, which gives its address;

· if an address is expressed on m bits, the maximum number of addressable cells is 2m.

The organization of a 96 bit memory, on 8 bits (a),

12 bits (b) and 16 bits (c), respectively.

Primary Memory

Memory Addresses (continued)

· a cell = smallest addressable unit;

· most computer manufacturers have standardized a 8­bit cell, called byte;

· bytes are grouped into words;

· word = instructions operate on entire words

32 bit machines will have 32 bit registers;

64 bit machines will have 64 bit registers.

Ordering bytes:

· big endian, left­to­right (SPARC, IBM mainframes);

· little endian (b), right­to­left (Intel).

· Problems with data transfer from one format to the other!

If we store “Jim Smith, age 21, department 260 (=1x256+4)

13

Primary Memory

Units for Measuring Memory

The multiplication factor for the next unit is 1024 = 210 .

Primary Memory

Error Correcting Codes

· memories can occasionally produce errors;

· to avoid such problems, memories use error­detecting or error­correcting codes;

· an n bit word (codeword) usually contains m data bits and r redundant bits:

n = m + r

· Hamming distance = the minimum number of bits that are different between 2 codewords;

· Example:

10001001 and 10110001 have the Hamming distance 3;

· with Hamming distance d, d single bit errors are required to convert from one into another;

· for a n bit codeword (n = m + r), there are 2n codewords that can be formed, of which only 2m are legal; whenever the computer detects an illegal codeword, an error has occurred;

· Example: parity bits.

14

Hamming’s algorithm using parity bits

In a Hamming code,

· r parity bits are added to an m-bit word.

· m+r bits are numbered starting at 1, not 0

· all bits whose bit number is a power of 2 are parity bits; the rest are used for data.

(Example) for a 16-bit word, 5 parity bits (bit 1,2,4,8,16) are added total 21 bits

· Each parity bit checks specific bit positions; the parity bit is set so that the total number of 1s in the checked position is even (when we use even parity system).

(Example) Bit 1 checks bit 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21

Bit 2 checks bit 2, 3, 6, 7, 10, 11, 14, 15, 18, 19

Bit 4 checks bit 4, 5, 6, 7, 12, 13, 14, 15, 20, 21

Bit 8 checks bit 8, 9, 10, 11, 12, 13, 14,15

Bit 16 checks bit 16, 17, 18, 19, 20, 21

In general, bit b is checked by those parity bits ,,, such that .

· Construction of the Hamming code for the memory word 1111000010101110 by adding 5 check bits to the 16 data bits.

· What happen if bit 5 is inverted by an electrical surge on the power line?

· So, the new codeword : 001001100000101101110

(a). Parity bit 1 incorrect (1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 contain five 1s)

(b). Parity bit 2 correct (2, 3, 6, 7, 10, 11, 14, 15, 18, 19 contain six 1s)

(c). Parity bit 4 incorrect (4, 5, 6, 7, 12, 13, 14, 15, 20, 21 contain five 1s)

(d). Parity bit 8 correct (8, 9, 10, 11, 12, 13, 14,15 contain two 1s)

(e). Parity bit 16 correct (16, 17, 18, 19, 20, 21 contain four 1s)

(a) should be even. The incorrect bit must be one of them

(d) should be even. The incorrect bit must be one of them

By the way, (b) is correct. So, eliminate 7, 15

(c) is also correct. So, eliminate 13

(e) is also correct. So, eliminate 21

The only bit left is bit 5, which is the one in error. In this way, errors can be corrected

15

Primary Memory

Cache Memory

· historically, CPUs have always been much faster than memories;

· this leads to inefficiencies in execution (CPU has to wait for the data from the memory);

· technologically, it is possible to have fast enough memories, but this is not economically viable;

· solution: use a small amount of very fast – cache memory at the CPU level;

· the most heavily used words are kept in the cache;

· CPU first looks in the cache, and only then in the main memory;

· logically, the cache lies between the main memory and the CPU, in practice it can be located in several other places.

Primary Memory

Cache Memory (continued)

· programs run faster if the data they need is in the cache and not in the main memory;

HOW TO ACHIEVE THAT ?

· locality principle: memory references made in a short time interval tend to use only a small fraction of the total memory the basis for all caching systems.

· if a word is read or written k times in a short interval, then 1 slow reference to the main memory

is needed, and k­1 references to cache – the bigger the k the better the performance;

· memories and cache are organized in cache lines: whenever a word is needed from the memory, the whole cache line will be loaded;

· Using the locality principle, main memory and caches are divided up into fixed-sized blocks (cache lines)

· cache design issues:

· size of cache: 16KB, ...., 2 MB, 6MB – the bigger the more expensive the CPU;

· organization of cache;

· whether instructions and data are kept in different caches:

· unified cache: simpler to build, balance between instructions and data;

· split cache (Harvard architecture): allows parallel accesses, good with pipelined CPU

· number of caches: not uncommon to have one on the CPU, one off chip but still in the same package, and a third one further away.

16

Primary Memory

Memory Packaging

· initially, memory chips were separate units (1Kb­1Mb), plugged directly in sockets;

· at present, memory chips ( 8 or 16 chips) are packaged into circuit boards, which are sold as units;

· memory units:

· SIMM (Single Inline Memory Module):

· 72 connectors;

· transfer 32 bits per clock cycle.

· typical sizes: 32MB, 64MB, ...

· DIMM (Double Inline Memory Module):

· 84 gold­plated connectors;

· 64 bits transfer;

· typical sizes: 64MB, ...

· variants: SO­DIMM (Small Outline DIMM) – laptops.

· Usually no error detecting/correcting features, as average error rate: 1/10 years.

Primary Memory

RAM Types

· SRAM (Static RAM)

· used for cache, motherboard memories, digital signal processing;

· retains its contents as long as power remains applied;

· NVRAM (Non­Volatile RAM)

· user­programmable memory chip whose data is

retained when power to the chip is turned off;

· modems, MP3 players;

· DRAM (Dynamic RAM)

· stores each bit of data in a separate capacitor;

· use refresh logic to refresh the memory;

· bigger, better, faster, more:

· Fast Page Mode DRAM;

· EDO RAM (Extended Data Out DRAM);

· SDRAM (Synchronous DRAM) ;

· DDR SDRAM (Double Data Rate Synchronous DRAM);

· RDRAM (Rambus DRAM).

17

Secondary Memory

The Secondary Memory

Secondary Memory

Memory Hierarchies

· The trouble with the registers, cache, main memory is that although more and more is available, there isn't enough to go around to store all that needs to be stored.

· Example: 50.000.000 books (U.S. Library of Congress) each with 1 MB of textand & 1 Mb pictures would fit on about 100 TB.

· Solution: use a multilevel memory hierarchy.

18

Secondary Memory

Memory Hierarchies (continued)

· On the 5 level memory hierarchy, as we move down, 3 parameters increase:

1. the access time:

· registers (nanoseconds),

· cache (small multiple of registers),

· main memory (tens of nanoseconds),

[~~~~~ huge gap ~~~~~]

· disc access (at least 10 milliseconds),

· tape, optical disc (seconds, taking into account handling).

2. the storage capacity:

· registers (~ 128 bytes),

· cache (a few MB),

· main memory (~ 1 GB),

· magnetic disks (hundreds of GB, now TB),

· etc...

3. number of bits/$ (or favorite currency)

· prices change rapidly,

· main memory: $/ MB;

· magnetic disks: cents/ MB;

· magnetic tape: $/GB.

Secondary Memory

Performance Comparison

Operation

Clock cycles

CPU instructions

1­3

Register access

1

Cache access

1

Main memory access

10

Disk seek time

10,000,000

19

Secondary Memory

The Hard Disk

Secondary Memory

Magnetic Disks

· magnetic disk =

· one or more aluminum platters with a magnetizable coating;

· a disk head floating above the surface on an air cushion, at the end of an arm;

· when a positive or negative current passes through the head, it magnetizes the surface of the platter (write);

· when the head passes over the platter area, a positive or negative current is induced

on the head (read);

The geometry of a disk track (the circular sequence of bits written while the disk makes a complete rotation):

20

Secondary Memory

Organization of Magnetic Disks

· tracks are divided up in sectors of fixed length;

· sectors = preamble (allows synchronization of the head) + data (512 bytes) + error correcting code (ECC);

· EEC: Hamming code (single errors) or Reed­Solomon code (multiple errors);

· between each sector there is an intersection gap;

· unformatted disk capacity – total of sectors, preambles, data, EEC, intersection gaps;

· formatted disk capacity – just the data capacity;

· storage capacity is influenced by:

· radial density: number of tracks per radial centimeter (800­2000);

· linear bit densities: number of bits per track centimeter (~ 100.000);

· most disks consist of several platters stacked vertically, each with its own arm and head;

· all arms are ganged together, so they move to different radial positions all at once;

· cylinder = the set of tracks at a given radial position;

Secondary Memory

Performance

· a disk with 4 platters:

· factors that influence the performance of a magnetic disk:

· seek: the time needed to move the heads at a certain radial position (5­15 ms);

· rotational latency: the time needed for the desired sector to be placed under the head (4­8 ms);

· rotation speeds: 4200, 5400, 7200 ... RPM;

· transfer rates 5­20 MB/sec;

· burst rate: transfer rate when the head is over the first bit in the sector;

· sustained rate: much smaller than the burst rate.

21

Secondary Memory

Disk Controllers

· disk controllers are chips (sometimes full CPUs) associated with each drive;

· their task:

· accept commands from software: READ, WRITE, FORMAT;

· control the arm motion;

· detect, correct errors;

· buffering of multiple sectors;

· caching sectors read for potential future use

· remapping bad sectors.

Floppy Disks

· miniature disks (difference: head on the surface, small capacity – up to 1.44 MB);

· cheap, unreliable !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!; [always make multiple copies]

· surprisingly not yet extinct (spring­summer 2008, in fact essential to the economy!!!!)

· (retro­cool? ­ not likely).

Secondary Memory

IDE disks

· Integrated Drive Electronics – mid 1980s

· controller integrated with the driver;

· BIOS calling convention: not changed for reason of backward compatibility;

· because of backward compatibility, it could address sectors with 4 bits for the head,

6 bits for the sectors, 10 bits for the cylinder max disk:

16 heads, 63 sectors, 1024 cylinders ~ 1,032,192 sectors ~ 528 MB;

· EIDE (Extended IDE)

· support for a second addressing scheme, LBA (Logical Block Addressing);

· 228 ­1 sectors;

· the controller has to convert LBA addresses to head, sector, cylinder addresses;

· other improvements: ability to control up to 4 drives, higher transfer rate;

· can also control CD­ROM drives, tape drives, etc.

· ANSI standard: ATAPI (Advanced Technology Attachment Packet Interface);

: Increased the addressable space & transfer rate to 30MB/sec

· This technology is comparatively cheap, widely used for the garden variety PCs;

· Drawback: internal only (limitations on cable length);

· The terms (E)IDE and ATA are used interchangeably.

22

Secondary Memory

SCSI Disks

· Small Computer System Interface (“scuzzy”)– 1986

:Not different from IDE disks in terms of the organization of cylinders, tracks, and sectors, but have different interface and much higher transfer rate.

· since 1986, increasingly fast versions have been standardized:

· SCSI 1: 8 data bits, 5 MHz bus, 5 MB/sec transfer,

· Fast SCSI: 8 data bits, 10 MHz bus, 10 MB/sec transfer,

· Wide Fast SCSI: 16 data bits, 10 MHz bus, 20 MB/sec transfer,

· Ultra SCSI: 8 data bits, 20 MHz bus, 20 MB/sec transfer,

· Wide Ultra SCSI: 16 data bits, 20 MHz bus, 40 MB/sec transfer,

· Ultra2 SCSI: 8 data bits, 40 MHz bus, 40 MB/sec transfer,

· Wide Ultra2 SCSI: 16 data bits, 40 MHz bus, 80 MB/sec transfer,

· Ultra3 SCSI: 16 data bits, 40 Mhz DDR bus, 160 MB/sec transfer, CRC, error correction (1999),

· Ultra­320 SCSI: 16 data bits, 80 Mhz DDR, 320 MB/sec,

· Ultra­640 SCSI – pushing the hardware limits, very short cables, impractical,

· New research and development: Serial SCSI (various approaches);

· standard disk controllers for UNIX workstations (Sun, HP, SGI), Macs, high­end Intels;

· SCSI: controller + several devices (hard­disks, CD­ROMs, scanners, etc.)

· each device has a unique ID;

· each device has 2 connectors: one for input and one for output;

· each output connected to the input of the next device, last device terminated;

· SCSI controllers and peripherals act as initiators or targets:

· Usually, the controller, acting as initiator, issues commands to disks

· commands and responses occur in phases,

· arbitration allows all the devices to run at once (as opposed to EIDE, where only one

device can run at a time;

Secondary Memory

SUMMARY: Disk Performance

23

24

Secondary Memory

RAID

· Redundant Array of Inexpensive [/Independent] Disks:

· as opposed to SLED (Single Large Expensive Disk);

· specific disk organization can improve performance/reliability;

· a RAID controller + many disks (RAID SCSI obvious choice);

· visible to the operating system as a single disk.

· RAID levels:

Level 0: ­ distribute data over multiple disks: stripping;

­ parallel I/O, reliability potentially worse than SLED;

Level 1: ­ RAID 1 + backup;

­ distributed (fast) read / write twice ­ bad;

­ backup disks (fault tolerance);

Level 2: ­ working on word basis;

­ distribute bits over a large number of disks (and add Hamming code);

Level 3: ­ simplified version of level 2: use parity bit instead;

­ fault tolerance in case of 1 drive failing;

Level 4: ­ like level 0, with 1 disk used for parity;

­ fault tolerance, but poor performance for small updates;

Level 5: ­ like level 0, with parity bits distributed among the disks;

­ good performance, but in event of a crash, reconstructing the content is difficult;

Secondary Memory

RAID Levels

Other RAID configurations

· combinations are possible (e.g. RAID 1+0/0+1/5+0);

· proprietary RAID formats: RAID 1.5, 7, S, Z, ServerRAID 1E, etc.

Secondary Memory

CD-ROMs

CD-Recordables

CD-Writables

DVDs

Input/Output Devices

Buses

Terminals (keyboards, touch screens, flat panels displays, video RAM)

Mice

Game Controllers

Printers

Telecommunication Equipment (Modems, Digital Subscriber Lines, Internet cables)

Digital Cameras

Character Codes

25