[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [fpu] FPU operations





>From: owner-fpu@opencores.org [mailto:owner-fpu@opencores.org]On Behalf
>Of Damjan Lampret
>
>
...
>Well not really necessary. I took your fmul and replaced mul24 with
>carefully balanced 3-stage pipelined multiplier and if I remember
>correctly it operated at 400 MHz (complete fmul). I don't know about
>fasu but it looks to me that 2-stage piepeline would be enough to be

Cool !

>balanced with fmul. CPU doesn't really care since every insn can take
>different number of clk cycles to complete.

That's interesting. How do you handle inter-instructions dependencies ?
for example if I do the following:

mul fp1,fp2		! floating point multiply
cmp fp1, fp4	! compare two floating point registers
jle xyz		! jump if larger then

Before you can execute the compare, which I would guess would execute
in one cycle, you have to wait for the add to complete, which might
take 4 cycles. And you can not continue execution because you do not
know which way the jump will go (OK, you can do jump prediction etc.,
but that's beyond my point).

I don't quite understand how you can have variable instruction
execution times and maintain 1 cpi. Perhaps you could elaborate ?

Looking at your emails that keep on coming :*), I think we mis-communicated
here ?

> > Now, more complex operations (for example square root) might be
> > variable
> > length execution units, that might signal to the CPU when they are
> > done.
> > Damjan, can this be done with OR1K ? The CPU would have to monitor
> > the
> > completion and write the result back out of order. If another
> > operation
>
>If number of clk cycles per insn is fixed then it is no problem. If it
>depends upon operands then FPU should provide output stall to stall
>CPU.

Not operands but the operations type: add, sub, mul .... sin, cos, tan
Sorry I was not clear here. What I meant is that basic execution units
like add, sub, mul, div, might all take 4 cycles. More complex execution
units like sqrt, sin, ln etc, might take different execution times, that
might vary depending on the actual operation. So, for example, a sqrt
might take 8 cycles, a ln might take 12 cycles, etc.

You don't really want to stall the CPU at any time if possible. I believe 
this should be optimized by the compiler. Not optimized code
will force the cpu to stall.

The stall logic, IMHO, should be in the CPU. I believe the CPU should
know how long every instruction needs, and handle the stalling
internally.

> > port for the FPU register file (which I assume is kept inside the
>
>FPU has FP register file which is not the same as integer register
>file. This is so that superscalar will be easier to do and also we
>don't need so many write ports on current register files.

Right, I understand this, but my point was that the register file should
be located inside the CPU, not the FPU. I thought it would be easier to
handle it that way. I'm thinking the CPU will load the FP registers from
memory at some point as well, and might move registers around etc. I
always see the FPU as a pure math block, like Weitek did it in the 80s.

>--damjan

BTW: Please keep in mind that I'm still working on the add/sub and mul
cores. It looks like the normalization, special cases and rounding will
add a lot of logic ... The once that are posted are only "outlines",
can't handle all corner cases and do not do rounding (I think I
documented that as well).

rudi


________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com