Why is used Tomasula's Algorithm, explain it.
Why Tomasulo’s Algorithm is Used and How It Works
Tomasulo’s Algorithm is a dynamic scheduling technique used in modern processors to achieve out-of-order execution. It helps a CPU keep its pipelines busy by resolving data hazards at runtime, rather than relying solely on the compiler. This boosts Instruction-Level Parallelism (ILP) and overall performance, especially when some instructions have long execution latencies (like memory or floating-point operations).
Why Tomasulo’s Algorithm is Used
- Reduces pipeline stalls: Minimizes waiting due to data hazards (dependencies) by allowing independent instructions to proceed.
- Enables out-of-order execution: Instructions can execute as soon as their operands are ready, not strictly in program order.
- Eliminates false dependencies: Uses register renaming to remove WAR (Write After Read) and WAW (Write After Write) hazards.
- Improves hardware utilization: Keeps multiple functional units (ALU, multiplier, load/store) busy concurrently.
- Adapts at runtime: Handles unpredictable memory latencies and branch outcomes better than static scheduling.
Key Concepts and Components
- Reservation Stations (RS): Small buffers in front of each functional unit that hold pending operations and their operands (or tags if operands are not yet ready).
- Register Renaming (via Tags): Instead of directly reading from architectural registers, instructions refer to tags that represent the producer of a value. This removes WAR and WAW hazards.
- Common Data Bus (CDB): A broadcast network used by functional units to publish results. Any RS or register waiting on a matching tag captures the value instantly.
- Load/Store Buffers: Specialized reservation stations for memory operations. They compute effective addresses and manage memory dependencies.
- Register File with Status: Each register tracks either a ready value or a tag indicating which RS will produce it.
- (Modern extension) Reorder Buffer (ROB): Not in the original IBM 360/91 design but widely used today with Tomasulo-style scheduling to commit results in order and provide precise exceptions.
How Tomasulo’s Algorithm Works (Core Steps)
- Issue (Dispatch):
- The next instruction is decoded in program order.
- If a suitable reservation station is free, the instruction is placed there.
- For each source operand:
- If the operand is ready, its value is copied into the RS.
- If not ready, the RS stores the producer’s tag and waits for it.
- The destination register is renamed to the RS’s tag, removing WAR/WAW hazards.
- Execute:
- When all operands in an RS are ready (no pending tags), the functional unit begins execution.
- Different operations may have different latencies; multiple RS entries can execute in parallel on different units.
- Write Result (Broadcast on CDB):
- When execution completes, the result and the producing tag are broadcast on the CDB.
- Any RS or register waiting for that tag captures the value and marks the operand as ready.
- This wake-up mechanism often triggers new instructions to start executing immediately.
- (With ROB) Commit:
- Results are written to the architectural state in program order to ensure precise exceptions and correct recovery on mispredictions.
What Hazards Are Handled?
- RAW (Read After Write): Managed by waiting on the correct producing tag; execution starts only when true operands are ready.
- WAR and WAW: Eliminated by register renaming; different writes map to different tags, avoiding false conflicts.
- Structural: Mitigated using multiple functional units and reservation stations; issue can stall only when no RS is available.
- Memory dependencies: Load/store buffers compute addresses; loads can proceed when it’s safe (no older conflicting store).
Mini Example
; Example sequence 1: R3 <- R1 + R2 2: R5 <- R3 * R4 3: R3 <- R6 + R7 Issue: - Inst 1 reserves RS AddA, dest tag T1 -> R3=Tag(T1) - Inst 2 reserves RS MulB, needs R3 but sees Tag(T1), so waits - Inst 3 reserves RS AddC, dest tag T2 -> R3 now renamed to Tag(T2) (no WAW hazard) Execute: - AddA runs first, produces T1 - CDB broadcasts T1; MulB captures R3 value and starts - AddC can run independently; both AddC and MulB proceed without false conflicts
Advantages
- High ILP through dynamic, out-of-order execution.
- Effective hazard handling via register renaming and CDB broadcasts.
- Better performance on variable-latency operations (memory, FP).
Limitations
- Hardware complexity and area/power overhead (RS entries, CDB bandwidth).
- Original design lacks precise exceptions; modern CPUs add a Reorder Buffer.
- CDB can become a bottleneck as core width scales; designs may use multiple buses or alternative wake-up/bypass networks.
Quick Summary
Tomasulo’s Algorithm is used to dynamically schedule instructions and execute them out of order by employing reservation stations, register renaming, and a common data bus. It dramatically reduces stalls from data hazards, eliminates false dependencies, and improves pipeline utilization—making it a foundational technique in advanced CPU microarchitecture.
