Raptor is an out-of-order, single-issue RISC-V processor core with register renaming, a reorder buffer (ROB), reservation stations, and virtual memory support. The commit stage supports dual commit – up to 2 instructions can retire from the ROB per cycle when consecutive entries are both ready.
ISA: RV32/RV64 I + M (mul/div) + A (atomics: LR/SC, AMO) + C (compressed) + Zicsr + Zifencei + Sv32 MMU
The core supports configurable RV32 and RV64 modes via a compile-time switch (YSYX_RV64). When YSYX_RV64 is defined, XLEN=64 and all datapath, register file, AXI bus, and DPI-C interfaces widen to 64 bits. RV64 adds W-variant instructions (ADDIW, SLLIW, etc.) with 32-bit result sign-extension.
Pipeline Overview -- 3 major partitions
+- FRONTEND (in-order, speculative) ----------------------+
| IFU -- Instruction Fetch Unit (3-state FSM) |
| +- L1I (direct-mapped I-cache, TLB, Sv32 PTW, IFQ) |
| +- BPU (bimodal PHT, BTB, GHR, RSB) |
| IDU -- Instruction Decode Unit |
| +- RVC expansion + Chisel-generated decoders |
+---------------------------------------------------------+
+- BACKEND (rename -> dispatch -> execute -> commit) -----+
| RNU -- Register Naming Unit (pure rename) |
| +- RNQ (rename queue, circular, RIQ_SIZE) |
| +- Freelist (rnu_fl_if, circular FIFO, PHY_SIZE) |
| +- MapTable (rnu_mt_if, spec MAP + committed RAT) |
| PRF -- Physical Register File (top-level, 2R/2W) |
| ROU -- Re-Order Unit (ROB + dispatch) |
| +- UOQ (dispatch queue, circular, IIQ_SIZE) |
| +- ROB (rob_entry_t[], ROB_SIZE, head/tail) |
| +- Operand bypass (EXU/IOQ -> dispatch) |
| EXU -- Execution Unit (out-of-order) |
| +- RS (reservation station, RS_SIZE) |
| | +- ALU (RV32I combinational) |
| | +- MUL (RV32M, Booth's / iterative div) |
| +- IOQ (in-order queue, IOQ_SIZE, for ld/st/amo) |
| CMU -- Commit Unit (broadcast retire info, in-order) |
| CSR -- M/S-mode CSR file + trap/interrupt handling |
+---------------------------------------------------------+
+- MEMORY SUBSYSTEM --------------------------------------+
| LSU -- Load/Store Unit |
| +- STQ (store temp queue, speculative, SQ_SIZE) |
| +- SQ (store queue, committed, SQ_SIZE) |
| +- Store-to-load forwarding (CAM, virtual address) |
| L1D -- Data Cache (2-way set-assoc, banked SRAM, RMW) |
| +- Reservation register (LR/SC atomics) |
| TLB -- Translation Lookaside Buffer (reusable module) |
| +- ITLB (L1I), DTLB (L1D load), DSTLB (L1D store) |
| PTW -- Page Table Walker (Sv32, reusable module) |
| +- IPTW (L1I), DPTW (L1D) |
| BUS -- AXI4 master bridge (L1I/L1D arbitration) |
| +- CLINT (mtime timer, periodic interrupt) |
+---------------------------------------------------------+
The processor has a variable-depth pipeline with 8 logical stages in the common case (L1I hit, ALU instruction, no dependency stall). All stage boundaries use valid/ready handshaking; queues (RNQ, UOQ, RS, IOQ, ROB) decouple stages and absorb backpressure.
IF0 IF1 ID RN DI IS/EX WB CM
+-----++------++------++-------++--------++-------++-----++------+
|L1I ||IFU ||IDU ||RNU ||ROU ||EXU ||ROB ||CMU |
|SRAM ||FSM ||decode||RNQ-> ||UOQ-> ||RS->ALU ||->WB ||bcast |
|read ||latch ||latch ||rename ||dispatch||issue ||state||retire|
|(spec)| | | pipe | +ROB | | | |
+--+--++--+---++--+---++--+----++---+----++--+----++--+--++--+---+
| | | | | | | |
v v v v v v v v
1 cyc 1 cyc 1 cyc 1 cyc 1 cyc 1 cyc 0 cyc 1 cyc
| Stage | Module | Register boundary |
|---|---|---|
| IF0 | L1I | SRAM address latched |
| IF1 | IFU | inst, pc, pc_ifu registered |
| ID | IDU | uop_t, operands registered |
| RN | RNU | rn_pipe_{pr1,pr2,prd,prs,uop} registered |
| DI | ROU | UOQ consumed, ROB entry allocated |
| IS/EX | EXU | RS/IOQ entries updated |
| WB | ROB | ROB state -> ROB_WB |
| CM | CMU | Architectural state committed |
Cycle 0: IF0 -- L1I SRAM speculative read (pc_ifu -> banks)
Cycle 1: IF1 -- IFU latches instruction (L1I hit, tag match)
Cycle 2: ID -- IDU decodes, latches uop_t
Cycle 3: RN -- RNU: RNQ enqueue + rename comb logic
Cycle 4: RN -- RNU: rename pipe register latched (rn_pipe_valid)
Cycle 5: DI -- ROU: UOQ enqueue + PRF pre-read + bypass
Cycle 6: DI -- ROU: UOQ dequeue -> dispatch to EXU + ROB alloc
Cycle 7: EX -- EXU: RS issue + ALU (comb) -> writeback (ROB_EX->ROB_WB)
Cycle 8: CM -- CMU: commit (ROB_WB, head ready) -> retire + broadcast
Instruction latency: ~9 cycles (first instruction, empty pipeline) Throughput: 1 IPC (single-issue, fully pipelined with all queues flowing)
Loads go through the IOQ path and interact with L1D:
Cycle N+0: DI -- IOQ enqueue (load dispatched from UOQ)
Cycle N+1: IS -- IOQ head ready, drives exu_lsu.raddr -> L1D SRAM spec read
Cycle N+2: EX -- L1D tag match (LD_A), SRAM data ready -> data_hit
Cycle N+3: WB -- exu_ioq_bcast.valid -> ROB_WB, PRF written, result broadcast
Load-use latency: 3 cycles from IOQ issue to result broadcast (L1D hit)
Stores are split across execution and commit:
Execute: IOQ computes store address + data -> STQ buffer (speculative)
Commit: ROB head retires store -> SQ enqueue (committed)
Drain: SQ head -> L1D write-through -> BUS -> AXI4 (background)
Store-to-load forwarding: loads check STQ (speculative) then SQ (committed) via virtual address CAM, youngest-match-wins.
On misprediction detected at commit (CM stage):
cmu_bcast.flush_pipe=1 -> all speculative state purgedcmu_bcast.cpc (correct PC) +- frontend (in-order, speculative) ----------------------+
| v- [BUS <-load- AXI4] |
| IFU[l1i,tlb,ifq] =issue=> IDU |
| ^- bpu[btb(COND/DIRE/INDR/RETU),pht,ghr,rsb] |
+---------------------------------------------------------+
| ifu_idu_if
+- backend (rename -> dispatch -> execute -> commit) -----+
| IDU =issue=> RNU[rnq,freelist,maptable] |
| RNU =rename=> ROU[uoq] |
| ROU[uoq] -dispatch-> ROU[rob(rob_entry_t)] |
| =dispatch=> EXU[rs] / EXU[ioq] |
| |
| EXU[rs] =writeback=> ROU[rob] (via exu_rou_if) |
| ^- ALU :alu ops |
| +- MUL :mult/div (Booth's / iterative) |
| EXU[ioq] =writeback=> ROU[rob] (via exu_ioq_bcast_if) |
| ^- CSR :Zicsr read |
| +- AMO :atomics (LR/SC/AMO*) |
| |
| ROU[rob] =commit=> CMU & LSU[sq] & CSR |
| CMU =broadcast=> frontend: IFU[pc], BPU[update] |
+---------------------------------------------------------+
| exu_lsu_if, rou_lsu_if, lsu_l1d_if
+- memory subsystem --------------------------------------+
| LSU[stq] (speculative store temp queue) |
| LSU[sq ] (committed store queue -> L1D/BUS) |
| L1D[dtlb,dstlb,dptw] -> BUS -> AXI4 (write-through) |
| TLB (reusable: itlb, dtlb, dstlb) |
| PTW (reusable: iptw, dptw -- Sv32 walker) |
+---------------------------------------------------------+
ysyx_ifu.sv)IDLE -> VALID -> STALL)ysyx_bpu.sv)PHT_SIZE entries (default 512), bimodal indexingBTB_SIZE total entries (default 128 = 64 sets x 2 ways), 7-bit tag, XOR-hash index, 1-bit LRU replacement per set
COND (conditional), DIRE (direct jump), INDR (indirect), RETU (return)RSB_SIZE entries (default 8)fence_timeysyx_l1i.sv)2^L1I_LEN entries (default 64, 6-bit index)2^L1I_LINE_LEN words (default 4 words per line)ysyx_sram_1r1w instances (one per word position) with synchronous read (1-cycle latency)pc_ifu_d1 register tracks PC stability; sram_data_ready = (pc_ifu_d1 == pc_ifu) ensures SRAM output matches the current fetch address before asserting validysyx_tlb instance (u_itlb, default 4 entries, ASID-aware, combinational lookup)ysyx_ptw instance (u_iptw), shares AXI read channel with cache fill; ptw_req issued on TLB miss in IDLEysyx_pkg::addr_valid()fence.i invalidation supportIDLE, PTWAIT, TRAP, RD_A, RD_0, RD_1, FINA)ysyx_idu.sv)IDLE, VALID) with valid/ready handshakeysyx_idu_decoder_c (RVC expansion) and ysyx_idu_decoder (Chisel-generated)csr_addr_valid() function validates all M-mode + S-mode CSR addressespnpc != pc + inst_len && !ben && !jen && !jren). Sends one-shot resteer pulse + resteer_pc back to IFU via ifu_idu_if, corrects pnpc in the downstream UOP to avoid redundant commit-time flushuop_t micro-op structysyx_rnu.sv)RIQ_SIZE entries (default 4), buffers uops from IDUysyx_rnu_freelist.sv): Circular FIFO, PHY_SIZE entries (default 64)
ysyx_rnu_maptable.sv): Dual-table design
map_snapshot / rat_snapshot for PRF debug viewysyx_prf.sv)PHY_SIZE entries (default 64), 2 read ports + 2 write portsysyx.sv), shared resourceprf_valid[] bits per physical registerprf_transient[] bits for speculative writes
exu_rou_if (ALU/RS writeback)exu_ioq_bcast_if (IOQ/LSU/CSR writeback)exu_prf_if (operand fetch for ROU dispatch)rf[] (committed view via RAT), rf_map[] (speculative view via MAP)ysyx_rou.sv)IIQ_SIZE entries (default 4)rob_entry_t struct array, ROB_SIZE entries (default 8), head/tail pointers
ROB_EX (executing) -> ROB_WB (written back) -> ROB_CM (committed)ROB_WB + store queue readyclint_trap)rou_exu_if (dispatch to EXU), rou_csr_if (CSR commit), rou_lsu_if (store commit)ysyx_exu.sv)RS_SIZE entries (default 4)
ysyx_exu_alu (combinational), ysyx_exu_mul (multi-cycle)IOQ_SIZE entries (default 4), circular FIFO
ysyx_exu_alu.sv): Purely combinational RV32I ops (ADD/SUB/SLT/XOR/OR/AND/SLL/SRL/SRA/comparisons)ysyx_exu_mul.sv): RV32M ops with two modes:
YSYX_M_FAST): single-cycle (for simulation)exu_l1d_ifdifftest_skip from LSU. Propagated via exu_rou.difftest_skip and exu_ioq_bcast.difftest_skip to ROB.ysyx_cmu.sv)rpc (retire PC), cpc (correct/redirect PC), branch resolution, flush_pipe, fence_i, fence_time, time_trappmu_inst_retire)YSYX_DPI_C_NPC_DIFFTEST_SKIP_REF at commit time when rou_cmu.difftest_skip is set (MMIO loads/stores, CSR time reads)ysyx_csr.sv)ecall, ebreak, mret, sretmedeleg / mideleg)MSTATUS <-> SSTATUS mirroringmcycle / time counterspriv, satp_ppn, satp_asid, immu_en/dmmu_en, tvec, interrupt_enysyx_lsu.sv)SQ_SIZE entries, speculative stores before commit; stores virtual address (stq_waddr) for forwarding comparisonSQ_SIZE entries, committed stores pending L1D write; tracks both physical address (sq_waddr for bus write-through) and virtual address (sq_vaddr for forwarding)LS_S_V, LS_S_R)sq_vaddr), age-ordered youngest-match-winsdifftest_skip propagated from L1D through to EXUysyx_l1d.sv)2^L1D_LEN sets (default 32, 5-bit index)2^L1D_LINE_LEN words per line (default 2), parameterizedysyx_sram_1r1w instances (one per word position via gen_data_bank generate loop) with synchronous read (1-cycle latency)l1d_valid[set][word], l1d_tag[set][word]), enabling simultaneous load/store hit check and fast fence invalidationtag_hit (combinational from register tags in LD_A, guarded by sram_bypass_r to prevent write-first bypass corruption) and data_hit (SRAM data ready). Bus requests suppressed by tag_hit; load data returned by data_hit.l1d_rmw, rmw_merged_data). RMW only fires in IDLE when SRAM read port is free (partial_store_rmw). Falls back to invalidation when read port is busy.L1D_LEN+L1D_LINE_LEN+OFFSET_BITS < 12). RMW steers sram_raddr to store’s set when active.ysyx_tlb instance (u_dtlb, default 4 entries, ASID-aware, combinational lookup)ysyx_tlb instance (u_dstlb) for store address translationysyx_ptw instance (u_dptw) for both load and store TLB misses; ptw_req issued on TLB miss in IDLE, stlb_mmu flag distinguishes store vs load PTWysyx_pkg::addr_cacheable()fence_time invalidates entire cacheIDLE, PTWAIT, TRAP, LD_A, LD_D)ysyx_tlb.sv)ENTRIES, default 4), ASID-awarelookup_vtag + lookup_asid against all entries simultaneouslyfill_valid (posedge clock), only fills on miss (no duplicate entries)rr_ptr)flush input (used by fence_time)u_itlb (L1I), u_dtlb (L1D load), u_dstlb (L1D store)ysyx_ptw.sv)IDLE -> LVL1 -> LVL0)
LVL1: Reads first-level PTE using vpn[1]LVL0: Reads second-level PTE using vpn[0] (if LVL1 was non-leaf)PTE.R || PTE.X -> superpage (LVL1 leaf) or regular page (LVL0 leaf)done (translation complete), fault (page fault), result_ptag/result_vtag (for TLB fill)bus_arvalid/bus_araddr/bus_rdata)ppn_a[2])u_iptw (L1I), u_dptw (L1D)ysyx_bus.sv)LD_A, LD_AS, LD_D)LS_S_A, LS_S_W, LS_S_B)L1I / L1D / TLBI / TLBDl1d_load_is_mmio flag (latched on L1D load request), propagated via l1d_bus.difftest_skip to commit stageysyx_clint.sv)mtime countermtime[18:0] == 0x40000)mtime based on addressysyx_if.svh, ysyx_*_if.svh)| Interface | Direction | Description |
|---|---|---|
ifu_bpu_if |
IFU -> BPU | PC for prediction, NPC + taken back |
ifu_l1i_if |
IFU -> L1I | PC fetch request, inst + trap response |
ifu_idu_if |
IFU <-> IDU | Forward: inst + PC + predicted NPC; Backward: early resteer (resteer + resteer_pc) |
idu_rnu_if |
IDU -> RNU | Decoded uop + operands + arch reg IDs |
rnu_rou_if |
RNU -> ROU | Renamed uop + physical reg mappings |
rou_exu_if |
ROU -> EXU | Dispatched uop + operands + ROB dest |
rou_lsu_if |
ROU -> LSU | Store commit (addr/data/alu) |
rou_csr_if |
ROU -> CSR | CSR write + trap/system on commit |
rou_cmu_if |
ROU -> CMU | Commit info (PC, branch, fence, flush) |
exu_rou_if |
EXU -> ROU | RS writeback (result, branch, CSR, trap) |
exu_ioq_bcast_if |
EXU -> ROU/PRF/LSU | IOQ broadcast (ld/st/CSR result) |
exu_prf_if |
ROU -> PRF | Operand read (2 ports, valid check) |
exu_lsu_if |
EXU(IOQ) -> LSU | Load request (addr/alu/atomic) |
exu_csr_if |
EXU -> CSR | CSR read port |
exu_l1d_if |
EXU -> L1D | Store MMU + SC reservation check |
cmu_bcast_if |
CMU -> all | Retire broadcast (flush, fence, branch) |
csr_bcast_if |
CSR -> all | Priv mode, SATP, MMU enable, tvec |
lsu_l1d_if |
LSU -> L1D | Load/store data path |
l1i_bus_if |
L1I -> BUS | I-cache miss read |
l1d_bus_if |
L1D -> BUS | D-cache miss read + write-through |
ysyx_rnu_internal_if.svh)| Interface | Description |
|---|---|
rnu_fl_if |
Freelist: alloc_req/alloc_pr (rename), dealloc (commit), flush recovery |
rnu_mt_if |
MapTable: 3 read ports (rs1, rs2, rd_old), spec write (MAP), commit write (RAT) |
ysyx_config.svh)| Parameter | Default | Description |
|---|---|---|
YSYX_XLEN |
32 | Register width |
YSYX_I_EXTENSION |
1 | RV32I base |
YSYX_M_EXTENSION |
1 | M extension (mul/div) |
YSYX_M_FAST |
1 | Single-cycle mul/div (sim mode) |
YSYX_L1I_LINE_LEN |
2 | L1I line: 2^2 = 4 words |
YSYX_L1I_LEN |
6 | L1I entries: 2^6 = 64 |
YSYX_L1I_N_WAYS |
1 | L1I ways (1 = direct-mapped) |
YSYX_PHT_SIZE |
512 | PHT entries |
YSYX_BTB_SIZE |
128 | BTB total entries (64 sets x 2 ways) |
YSYX_RSB_SIZE |
8 | Return stack entries |
YSYX_RIQ_SIZE |
4 | Rename queue (RNQ) entries |
YSYX_IIQ_SIZE |
4 | Dispatch queue (UOQ) entries |
YSYX_ROB_SIZE |
8 | Reorder buffer entries |
YSYX_RS_SIZE |
4 | Reservation station entries |
YSYX_IOQ_SIZE |
4 | In-order queue entries |
YSYX_SQ_SIZE |
8 | Store queue entries |
YSYX_L1D_LINE_LEN |
1 | L1D line: 2^1 = 2 words per line |
YSYX_L1D_LEN |
5 | L1D sets: 2^5 = 32 |
YSYX_L1D_N_WAYS |
2 | L1D ways (2-way set-associative) |
YSYX_DUAL_COMMIT |
defined | Dual commit: retire up to 2 ROB entries/cycle |
YSYX_ISSUE_WIDTH |
1 | Instructions dispatched per cycle |
YSYX_REG_SIZE |
32 | Architectural registers |
YSYX_PHY_SIZE |
64 | Physical registers |
ysyx_pkg.sv)| Type | Description |
|---|---|
uop_t |
Micro-op: decoded instruction fields (alu, branch, mem, CSR, trap, pc, inst, imm) |
prd_t |
Physical register descriptor: op1/op2 values + pr1/pr2/prd/prs mappings |
rob_state_t |
ROB entry state enum: ROB_CM (committed), ROB_WB (written-back), ROB_EX (executing) |
rob_entry_t |
Full ROB entry: phys regs, arch rd, state, branch/jump, memory, atomics, CSR, trap, fence, difftest_skip, inst/PC |
addr_cacheable() |
Function: returns true if address is in a cacheable region (mrom, flash, psram, sdram) |
addr_valid() |
Function: returns true if address is in any valid memory-mapped region |
mermaid 1 diagram 2 of the uarch:
flowchart TD
subgraph FE["Frontend (in-order)"]
IFU["IFU (3-state FSM)"]
L1I["L1I (direct-mapped, ITLB, PTW)"]
IFQ["IFQ (2-entry)"]
IDU["IDU (RVC + decoder)"]
BPU["BPU (PHT/BTB/GHR/RSB)"]
end
subgraph BE["Backend (rename -> dispatch -> execute -> commit)"]
subgraph RNU_TOP["RNU (pure rename)"]
RNQ["RNQ (RIQ_SIZE)"]
FL["Freelist (rnu_fl_if)"]
MT["MapTable (rnu_mt_if)"]
end
PRF["PRF (2R/2W, valid+transient)"]
ROU["ROU (UOQ + ROB: rob_entry_t[])"]
EXU["EXU"]
RS["RS (RS_SIZE)"]
IOQ["IOQ (IOQ_SIZE, in-order)"]
ALU["ALU (RV32I)"]
MUL["MUL (RV32M, Booth)"]
CMU["CMU (retire broadcast)"]
CSR["CSR (M/S-mode, Sv32)"]
end
subgraph MEM["Memory Subsystem"]
LSU["LSU (STQ + SQ, forwarding)"]
L1D["L1D (2-way, banked SRAM, RMW)"]
TLB["TLB (reusable: ITLB, DTLB, DSTLB)"]
PTW["PTW (reusable: IPTW, DPTW)"]
BUS["BUS (AXI4 bridge)"]
CLINT["CLINT (mtime, timer IRQ)"]
end
BPU -->|"npc,taken"| IFU
IFU -->|"ifu_l1i_if"| L1I
L1I --- IFQ
L1I -->|"inst"| IFU
IFU -->|"ifu_idu_if"| IDU
IDU -->|"idu_rnu_if"| RNU_TOP
RNQ --> FL & MT
RNU_TOP -->|"rnu_rou_if"| ROU
ROU -->|"exu_prf_if"| PRF
ROU -->|"rou_exu_if"| EXU
EXU --> RS & IOQ
RS --> ALU & MUL
IOQ -->|"exu_lsu_if"| LSU
IOQ -->|"exu_csr_if"| CSR
RS -->|"exu_rou_if"| ROU
IOQ -->|"exu_ioq_bcast_if"| ROU
EXU -->|"write"| PRF
ROU -->|"rou_cmu_if"| CMU
ROU -->|"rou_csr_if"| CSR
ROU -->|"rou_lsu_if"| LSU
CMU -->|"cmu_bcast_if"| IFU & BPU
LSU -->|"lsu_l1d_if"| L1D
EXU -->|"exu_l1d_if"| L1D
L1I --- TLB & PTW
L1D --- TLB & PTW
L1I -->|"l1i_bus_if"| BUS
L1D -->|"l1d_bus_if"| BUS
BUS <-->|"AXI4"| AXI["AXI4 Master"]
CLINT --- BUS
CSR -->|"csr_bcast_if"| L1I & L1D & EXU