2. strobe Program Memory output into instruction register (INSTR)
3. decode instruction and read Register File (RFRD)
4. strobe Register File output (OPS)
5. execution or Unified Memory access (ALU)
6. write Register File (RFWR)
Each pipeline stage is pretty much of an independent state machine.
Basically, each pipeline stage receives values from the previous one, in a shift-like flow. Only the `terminal' registers contain data actually used, the previous ones are used just for synchronization.
For example, this is how a particular hardware resource request flows through pipeline stages s3, s4 until it is processed in s5:
Exceptions from this `normal' flow are the stall and flush actions, which can basically independently stall or reset to zero (force a nop into) any stage. Other exceptions are when several registers in such a chain are actually used, not only the terminal one.
Apart from the (main) pipeline stages above (stages s1-s6), there are a number of pipeline stages only needed by a few instructions (such as 16 bit arithmetic, some of the skips, returns): s61, s51, s52, s53 and s54. During these pipeline stages, the main stages are stalled.
Stages s1, s2 are common to all instructions. They bring the instruction from Program Memory (PM) into the instruction register (instruction fetch stages).
During stage s3, the instruction just read from PM is decoded. That is, the following pipeline stages (s4, s5, s6, s61, s51, s52, s53, s54) are instructed what to do, by means of dedicated registers.
At a given moment, a pipe stage stage can do one of the following actions:
The registers in that stage are loaded with:
Values from the previous stage, if that stage is different from s1 or s2 or s3
Some particular values if that stage is s1 or s2 (those values are set by the Program Memory manager)
Values from the instruction decoder, if that stage is s3
flush (execute nop)
All registers in that stage are reseted to zero.
All registers in that stage are kept unchanged.
Hardware resource managing
Pipeline stages can request access to hardware resources. Access to hardware resources is done via dedicated hardware resource managers (one manager per hardware resource; one VHDL process per manager).
Main hardware resources:
Register File (RF)
Bypass Unit (BPU)
Bypass Register 0 (Bypass chain 0) (BPR0)
Bypass Register 1 (Bypass chain 1) (BPR1)
Bypass Register 2 (Bypass chain 2) (BPR2)
IO File (IOF)
Status Register (SREG)
Stack Pointer (SP)
Arithmetic and Logic Unit (ALU)
Data Access Control Unit (DACU)
Program Memory (PM)
Stall and Flush Unit (SFU)
Only one such request can be received by a given resource at a time. If multiple accesses are requested from a resource, its access manager will assert an error during simulation; that would indicate a design bug.
The pipeline is built so that each resource is normally accessed during a fixed pipeline stage:
RF is normally read in s3 and written in s6.
IOF is normally read/written in s5.
DM is normally read/written in s5.
DACU is normally read/written in s5.
PM is normally read in s1.
However, exceptions can occur. For example, LPM instructions need to read PM in stage s5. Also, loads/stores must be able to read/write RF in stage s5.
Exceptions are handled at the hardware resource managers level.
Stall and Flush Unit
Because of the exceptions above, different pipeline stages can compete for a given hardware resource. A mechanism must be provided to handle hardware resource conflicts. The SFU implements this function, by arbitring hardware resource requests. The SFU stalls some instructions (some pipeline stages), while allowing others to execute.
Stall handling is done through two sets of signals:
SFU requests (SFU inputs)
SFU control signals (SFU outputs)
There is one pair of stall-flush control signals for each of the pipeline stages s1, s2, s3, s4, s5, s6.
Each instruction has an embedded stall behavior, that is decoded by the instruction decoder.
Various instructions in the pipeline, in different execution phases, access the SFU exactly the same way they access any other hardware resources, through SFU access requests.
The SFU prioritizes stall/flush/branch/skip/nop requests and postpones younger instructions until older instructions free the hardware resources (SFU hardware resource including). The postponing process is done through the stall-flush controls, on a per-pipeline stage basis.
The `SFU rule': when a resource conflict appears, the older instruction wins.
Some instructions need to insert a nop before the instruction `wave front', for freeing hardware resources normally used by younger instructions. For example, loads must `steal' the Register File read port 1 from younger instructions.
Nops are inserted by stalling certain pipe stages and flushing other, or possibly the same, stages.
Other instructions need a nop after the instruction wave front, for the previous instruction to complete and free hardware resources. For example, stores must wait a clock, until the previous instruction frees the Register File write port.
The two situations differ pretty much from the point of view of the control structure. In the second situation, the instruction is required to stall and flush itself, which adds additional problems. These problems are solved by introducing a dedicated noping state machine in stage s4, whose only purpose is to introduce at most one nop after any instruction. On the other hand, introducing nops before an instruction wave front is straightforward, as any instruction can stall/flush younger instructions by means of SFU requests.
Let's consider the following situation: a load instruction reads the Data Memory during pipe stage s5. Suppose that next clock, an older instruction stalls s6, during which Data Memory output was supposed to be written into the Register File. After another clock, the stall is removed, and s6 requests to write the Register File, but the Data Memory output has changed during the stall. Corrupted data will be written into the Register File. With the shadow protocol, the Data Memory output is saved during the stall. When the stall is removed, the Register File is written with the saved data.
If a pipe stage is not permitted to place hardware resource requests, then mark every memory-like entity in that stage as having its output `shadowed', and write its associated shadow register with the corresponding data output. Else, mark it as `unshadowed'.
As long as the memory-like enity is marked `shadowed', it will be read (by whatever entity needs that) from its associated shadow register, rather than directly from its data output.
In order to enable shadowing during multiple, successive stalls, shadow memory-like entities only if they aren't already shadowed.
Basically, the condition that shadows a memory-like entity's output is `hardware resources are disabled during that stage'. However, there are exceptions. For example, LPM family instructions steal Program Memory access by stalling the instruction that would normally be fetched that time. By stalling, hardware resource requests become disabled in that pipe stage. Still, LPM family instructions must be able to access directly Program Memory output. Here, the PM must not be shadowed even though during its pipe stage s2 (during which PM is normally accessed) all hardware requests are disabled by default.
Fortunately, there are only a few such exceptions (holes through the shadow protocol). Overall, the shadow protocol is still a good idea, as it permits natural & automatic handling of a bunch of registers placed in delicate areas.
Branch prediction with hashed branch prediction table and 2 bit predictor.
Super-RAM interfacing to Program Memory.
A super-RAM is a classic RAM with two supplemental lines: a mem_rq input and a mem_ack output. The device that writes/reads the super-RAM knows that it can place an access request when the memory signalizes it's ready via mem_ack. Only then, it can place an access request via mem_rq.
A super-RAM is a super-class for classic RAM. That is, a super-RAM becomes classic RAM if the RAM ignores mem_rq and keeps continousely mem_ack to 1.
The super-RAM protocol is so flexible that, as an extreme example, it can serially (!) interface the Program Memory to the controller. That is, about 2-3 wires instead of 38 wires, without needing to modify anything in the controller. Of course, that would come with a very large speed penalty, but it allows choosing the most advantageous compromise between the number of wires and speed. The only thing to be done is to add a serial to parallel converter, that complies to the super-RAM protocol.
After pAVR is made super-RAM compatible, it can run anyway from a regular RAM, as it runs now, by ignoring the two extra lines. Thus, nothing is removed, it's only added. No speed penalty should be payed.
A simple way to add the super-RAM interface is to force nops into the pipeline as long as the serial-to-parallel converter works on an instruction word.
Modify stall handling so that no nops are required after the instruction wavefront. The instructions could take care of themselves. The idea is that a request to a hardware resource that is already in use by an older instruction, could automatically generate a stall.
generally simplify instruction handling
make average instruction execution slightly faster.
Generated on Tue Dec 31 20:26:30 2002 for Pipelined AVR microcontroller by
@Importing into repository the new directory structure.