

#### BCS-29 Advanced Computer Architecture

**Pipelined Processing** 

ARITHMETIC PIPELINE DESIGN Other ILP Architectures

# where P is the 16-bit product.This fixed-point multiplication

**Multiplication** 

 This fixed-point multiplication can be written as the summation of eight partial products as shown below: P = A X B = P0 + P1 + P2 + ...+ P7.

Consider as an example the multiplication of two 8-bit integers, A X B = P,

|    |   |   |   |   |   |   |    | 1  | 0 | 1 | 1  | 0  | 1  | 0 | 1 |      | $A_{-}$ |
|----|---|---|---|---|---|---|----|----|---|---|----|----|----|---|---|------|---------|
|    |   |   |   |   |   | 3 | ×) | 1  | 0 | 0 | 1  | 0  | 0  | 1 | 1 | =    | В       |
|    |   |   |   |   |   |   |    | 13 | 0 | 1 | J. | 0  | 13 | 0 | 1 | - 10 | $P_0$   |
|    |   |   |   |   |   |   | I  | 0  | 1 | I | 0  | I  | 0  | Ĩ | 0 | - 10 | $P_1$   |
|    |   |   |   |   |   | 0 | 0  | 0  | 0 | 0 | 0  | 0  | 0  | 0 | 0 | 80   | $P_2$   |
|    |   |   |   |   | 0 | 0 | 0  | 0  | 0 | 0 | 0  | 0  | 0  | 0 | 0 | in.  | $P_3$   |
|    |   |   |   | 1 | 0 | I | 1  | 0  | 1 | 0 | 1  | 0  | 0  | 0 | 0 | -    | $P_4$   |
|    |   |   | 0 | 0 | 0 | 0 | 0  | 0  | 0 | 0 | 0  | 0  | 0  | 0 | 0 | 98   | $P_5$   |
|    |   | 0 | 0 | 0 | 0 | 0 | 0  | 0  | 0 | 0 | 0  | 0  | 0  | 0 | 0 | 100  | $P_6$   |
| +) | 1 | 0 | 1 | 1 | 0 | 1 | 0  | Ĩ. | 0 | 0 | 0  | 0  | 0  | 0 | 0 | -    | $P_7$   |
| 0  | 1 | 1 | 0 | 0 | I | I | Ĩ  | Ĩ. | 1 | Ĩ | 0  | Ĩ. | Ĩ. | 1 | 1 | 10   | Р       |

# Pipeline Multiplication





## **Pipeline Multiplication**

#### Carry Propagate Adder





#### **Carry Save Adder**



#### **Pipeline Multiplication**





MMMUT, Gorakhpur

## Floating Point Addition

- Linear pipeline with four functional stages. Inputs are two normalised floating-point numbers a\*2^p and b\*2^q
- Output is a normalised floatingpoint number d\* 2^s which is the sum of the two inputs.
- The hardware units other than the latches can all be implemented using combinational logic.
- If time delay of interface latches is 10ns and if the time delays of the four stages are 60, 50, 90 and 80ns, respectively, then cycle time of pipeline can be chosen to be 100ns.





# **Combined Adder and Multiplier**





#### **Reservation Table for Multiplication**



|   | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| Α | X |   |   |   |   |   |   |
| В |   | X | X |   |   |   |   |
| С |   |   | X | X |   |   |   |
| D |   |   |   |   | X |   | X |
| E |   |   |   |   |   | Х |   |
| F |   |   |   |   |   |   |   |
| G |   |   |   |   |   |   |   |
| Н |   |   |   |   |   |   |   |

#### **Reservation Table for Addition**



|   | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| Α | Υ |   |   |   |   |   |   |   |   |
| В |   |   |   |   |   |   |   |   |   |
| С |   |   |   | Υ |   |   |   |   |   |
| D |   |   |   |   |   |   |   |   | Υ |
| Е |   |   |   |   |   |   |   | Υ |   |
| F |   | Y | Y |   |   |   |   |   |   |
| G |   |   |   |   | Υ |   |   |   |   |
| Н |   |   |   |   |   | Υ | Υ |   |   |

#### Superscalar Architectures



- Superscalar processors attempt to issue multiple instructions per cycle
  - However, essential dependencies are specified by sequential ordering so operations must be processed in sequential order
  - This proves to be a performance bottleneck that is very expensive to overcome
- Program contains no explicit information regarding dependencies that exist between instructions
- Dependencies between instructions must be determined by the hardware
  - It is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completed
- Compiler may re-order instructions to facilitate the hardware's task of extracting parallelism

#### Superscalar Architectures





#### Superscalar Architectures



| IF | ID | EX | WB |    |    |
|----|----|----|----|----|----|
| IF | ID | EX | WB |    |    |
| IF | ID | EX | WB |    |    |
|    | IF | ID | EX | WB |    |
|    | IF | ID | EX | WB |    |
|    | IF | ID | EX | WB |    |
|    |    | IF | ID | EX | WB |
|    |    | IF | ID | EX | WB |
|    |    | IF | ID | EX | WB |

- Superscalar:
  - Issue parallelism = IP = n inst / cycle
  - Operation latency = OP = 1 cycle
  - Peak IPC = n instr / cycle (n x speedup?)

## Superscalar Performance



Estimate the ideal execution time of N independent instructions through the pipeline.

• The time required by the scalar base machine(Single Pipeline) is

T(1,1) = k + N - 1 (base cycles)

• The ideal execution time required by an m-issue superscalar machine is

T(m, 1) = k + (N - m)/m (base cycles)

• The ideal speedup of the superscalar machine over the base machine is

$$S(m,1) = T(1,1)/T(m,1)$$
  
=  $(k + N - 1)/(k + (N - m)/m)$   
=  $m(k + N - 1)/(N + m(K-1))$ 

## **Other ILP Architectures**

#### Superpipelined Architecture:

- cycle time = 1/m of baseline
- Issue parallelism = IP = 1 inst / minor cycle
- Operation latency = OP = m minor cycles
- Peak IPC = m instr / major cycle (m x speedup?)





#### **Other ILP Architectures**



- VLIW: Very Long Instruction Word
  - Issue parallelism = IP = n inst / cycle
  - Operation latency = OP = 1 cycle
  - Peak IPC = n instr / cycle = 1 VLIW / cycle



#### **Other ILP Architectures**

- Superpipelined-Superscalar
  - Issue parallelism = IP = n inst / minor cycle
  - Operation latency = OP = m minor cycles
  - Peak IPC = n x m instr / major cycle



