

# 2-way Superscalar P6 Processor

Group 17: Yinwei Dai, Tianchi Zhang, Xiaoxue Zhong, Ramchandra Apte

# Overall Design





# Overall Design

- Dispatch stage
  - o RS(20 entries) and ROB(32 entries) are not full
  - 2 way associative I-cache
- Issue stage
  - FU: 8 ALU, 4 Multiplier (4 stages), 4 Load-and-store Units, 4 Branch Calculators
- Execute stage
  - LSQ
    - Out of order load and in order store
    - Forward store data to load (same size)
    - 4-entry load queue and 12-entry store queue
- Complete stage
  - Function selector (Priority selector and Rotating priority selector)
- Retire stage
  - D-cache (write through)



### Advance features

- 2-way superscalar
- I-cache
  - Prefetching, early branch resolution, 2-way associative
- D-cache
  - o 2-way associative, dual-ported, non-blocking, victim cache
- Branch Predictor
  - o Return address stack
- GUI debugger



# I-Cache, D-Cache and LSQ

#### I-Cache

- 2-way associative
- Prefetching
- Early branch resolution

#### D-Cache

- 2-way associative
- Dual-ported, non-blocking
- Victim cache
- Write-through

#### LSQ

- 4-entry Out-of-Order load queue
- 16-entry in-Order store queue
- RAW forwarding













## D-Cache

Read

Write Y















### **Branch/Address Prediction**

#### BTB

- 32-entry directed mapped BTB with dual ports
- Store PC-relative target addresses
- Updated in complete stage.

#### Branch Predictor

- 32-entry PHT of 2-bit saturating counter with initial value 2'b01
- Local predictor with index depending on PC[6:2]

#### RAS

- 8-entry return address stack
- Circulative stack pointer for the latest 8 return addresses
- Add a reg\_empty in case of squash



### **Branch Predictor Performance**





# Testing and Final Results

#### Testing strategies

- Wrote strong testbenches for individual components like ROB and Maptable.
- Run scripts to compare the writeback and program output files.
- Adopt Incremental testing.
- Passed all the public test cases in both simulation and synthesis.

#### Final results

- Synthesized Clock Period: 11ns
- Average CPI with all advanced features added: 2.23
- #total Cycles / #total Instructions: 1.444



### Trade-off between CPI and Period





### Final Performance



