CDA-4101 Lecture 3 Notes



Processors: Parallelism

Instruction Level Parallelism - multiple instructions executing on the same CPU Processor Level Parallelism - using multiple CPUs to execute instructions

Pipelining


Pipelining?

Monte-Carlo Assembly Line

Henry Ford meets CPUs

Early Pipelining: Prefetch Buffers


Before Prefetch Buffers

IF = Instruction fetch, EX = Instruction execute, In = Instruction n

  Time
  t1 t2 t3 t4 t5 t6
I1 IF EX        
I2     IF EX    
I3         IF EX


After Prefetch Buffers

IF = Instruction fetch, EX = Instruction execute, In = Instruction n

  Time
  t1 t2 t3 t4 t5 t6
I1 IF EX      
I2   IF EX      
I3    IF EX    
I4      IF EX  
I5        IF EX


Five Stage Pipeline

  Time
  t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
I1 IF ID OF EX WB       
I2   IF ID OF EX WB      
I3    IF ID OF EX WB     
I4     IF ID OF EX WB    
I5      IF ID OF EX WB   
I6       IF ID OF EX WB  
I7        IF ID OF EX WB

Pipeline Math

Example

  • Assume each stage takes 20 nanoseconds.
  • Without pipelining:
    • instruction latency is 100 ns
    • execute 10 instructions per microsecond, or 10 MIPS
  • With pipelining:
    • cycle time is 20 nanosecods
    • execute 50 instructions per microsecond, or 50 MIPS.

Definitions:

  • nanosecond (ns) - one billionth of a second (1 x 10-9 secs.)
  • microsecond (μs) - one millionth of a second (1 x 10-6 secs.)
  • millisecond (ms) - one thousanth of a second (1 x 10-3 secs.)
  • MIPS - millions of instructions per second.

Pipelining Realities

There is an apparent factor of 5 speed-up in a 5 stage pipeline.

Problems:


Pipeline Hazard Types

Examples to follow.
  • Data Hazards - when an instruction needs data that is not yet available.
  • Structural Hazards - when the same hardware is needed by more than one instruction in the pipeline.
  • Control Hazards (Branch Hazards) - when changes to the program counter affect the pipeline execution.

Structural Hazards

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I2   IF ID OF EX WB   
I3    IF ID OF EX WB  

Structural Hazard Remedy: Duplicate Hardware


Structural Hazard Remedy: Stalling

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I2   IF ID OF EX WB   
I3    -- IF ID OF EX WB

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I2   IF ID OF EX WB   
I3    -- -- IF ID OF EX


Data Hazards

Consider the following 2-instruction code using our 5 stage pipeline:
    I1: R3 := R1 + R2
    I2: R4 := R3 + 10

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I2   IF ID OF EX WB   


Data Hazard Remedy: Stalling

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I2   IF ID -- -- OF EX WB


Data Hazard Remedy: Data Forwarding

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I2   IF ID -- OF EX WB  


Data Hazard Remedy: Instruction Reordering

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I3   IF ID OF EX WB   
I4    IF ID OF EX WB  
I2     IF ID OF EX WB


Types of Data Hazards


Control Hazards

  Time
  t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID OF EX WB    
I2   IF ID OF EX WB   
I3 | I4    IF ID OF EX WB

Control Hazard Remedies: Instruction Reordering


Real Pipelining

Pentium I:
  • 5 stage pipeline (instructions)
  • 9 stage pipeline (floating point)
Pentium III:
  • 10 stage pipeline
  • Two ALUs and can operate at 1/2 clock cycle
Pentium IV:
  • Transistor Count: 55 million
  • 20 stage pipeline
UltraSparc III:
  • Transistor Count: 29 million
  • 14 stage pipeline

Superscalar Architectures

Pentium
  • Even the Pentium I had two pipelines.
UltraSparc III
  • Six execution pipelines (2 integer, 2 FP/VIS, 1 load/store, 1 addressing)

Superscalar Configuratons


Processor Level Parallelism


Flynn's Taxonomy* of Parallel Processors

*Flynn, M. J., Some Computer Organisations and their Effectiveness., IEEE Transactions on Computers, C-21 (9), pp 114-118, Sept 1972.

Images of parallel machines borrowed from: http://www.cems.uwe.ac.uk/teaching/notes/PARALLEL/ARCHITEC


SISD (no parallelism)


SIMD


Illiac IV
  • array processors - same instruction executes all all processor, only data varies
  • vector processor - similar to array processor, only a single instruction executes by passing many data words through a heavily pipelined ALU

Cray 1

Cray SV1

MIMD


Motherboard
(2 processors, shared-memory)
MIMD can use either:
  • shared memory (a.k.a., multi-processor machine) - communicate through shared memory
  • distributed memory (a.k.a., multi-computer machine) - commuicate through message passing

Cray Y-MP
(8 processors, shared-memory)

Multiprocessor Arrangements


Multicomputers

  • Rely on message passing for communication and coordination
  • Too many connections is costly and unmanageable
  • Which processors can talk to which other processors?

Distributed Memory MIMD

Message Passing Topologies


MISD?


Flavors of Parallelism


Final Observation

  • A typical PC is sort of a fully connected multi-computer
  • Multiple processors:
    • CPU
    • video card
    • bus controllers
    • disk controllers
    • etc.
  • Control signals are the messages
  • Common communication bus gives full connectivity