Internal review of Zvfhmin/Zvfh extensions before public review


Krste Asanovic
 

These RISC-V vector extensions to handle IEEE FP16 were defined prior to
ratification of the vector specification, but were left out of RVV 1.0
as they were not to be included in the base V extension.

The extension text is available on the following branch in the vector
spec repo:

https://github.com/riscv/riscv-v-spec/tree/zvfh

and the relevant page of the PDF is extracted and attached below.

Note these do not add any new vector FP opcodes but just add obvious
behavior for the case where SEW=16.

We would like to push this out for public review in preparation for
ratification, but would first like to allow some time for any internal
comment,

Krste


Guy Lemieux
 

bfloat16 is arguably just as important as binary16 encoding.

This makes me wonder:

a) assigning the term "half precision" to binary16 encoding gives it a
"here first" monopoly and leaves bfloat16 out in the cold

b) can we reconcile the naming conventions for {binary16, bfloat16}
encodings? eg, perhaps Zvf16bfloat and Zvf16ieee

c) can the extension proposal also include similar text for bfloat16?

d) perhaps we should look beyond the two 16b encodings and allow for
other more general variations, eg Zvf5e10m vs Zvf8e7m

e) eventually implementations will want to support multiple encodings
concurrently... but we have no road map. we should consider what such
a road map looks like before approving this extension.

Both (d) and (e) will likely require CSR changes. We probably don't
want to have a proliferation of data conversion instructions (cf SIMD
instruction extensions that change the execution width), so a generic
approach is needed. Following the SEW, LMUL, and VL examples, we would
need some CSR bits that define the current encoding (cf SEW?) and the
target encoding (cf EEW?), or rely upon immediate ISA bits to define
the to/from encodings.

Whatever we do for 16 bits, we will have to do again for 8-bit floats.
It is already looking like they need multiple encodings (eg, 4 vs 5
bit exponents) to avoid the range limitations inherent in any 8b
format.

Thanks,
Guy

On Mon, Oct 3, 2022 at 10:48 AM Krste Asanovic <krste@...> wrote:


These RISC-V vector extensions to handle IEEE FP16 were defined prior to
ratification of the vector specification, but were left out of RVV 1.0
as they were not to be included in the base V extension.

The extension text is available on the following branch in the vector
spec repo:

https://github.com/riscv/riscv-v-spec/tree/zvfh

and the relevant page of the PDF is extracted and attached below.

Note these do not add any new vector FP opcodes but just add obvious
behavior for the case where SEW=16.

We would like to push this out for public review in preparation for
ratification, but would first like to allow some time for any internal
comment,

Krste







Krste Asanovic
 

On Mon, 3 Oct 2022 13:43:53 -0700, Guy Lemieux <guy.lemieux@...> said:
| bfloat16 is arguably just as important as binary16 encoding.
| This makes me wonder:

| a) assigning the term "half precision" to binary16 encoding gives it a
| "here first" monopoly and leaves bfloat16 out in the cold

Well, it was "here first", a long time before BF16, and is part of
IEEE FP standards, and is still being used.

There is a proposal in progress to add scalar/vector BF16, so it is
exaggerating to say we are leaving it "out in the cold".

| b) can we reconcile the naming conventions for {binary16, bfloat16}
| encodings? eg, perhaps Zvf16bfloat and Zvf16ieee

The ratified scalar FP16 extensions set a precedent here for FP16
naming.

| c) can the extension proposal also include similar text for bfloat16?

There is a separate proposal for BF16 in progress. The two types do
not have symmetric uses. FP16 work tends to do more compute in FP16
format, whereas BF16 is predominantly used to mul-acc into FP32.

| d) perhaps we should look beyond the two 16b encodings and allow for
| other more general variations, eg Zvf5e10m vs Zvf8e7m

Sure, as demand arises.

| e) eventually implementations will want to support multiple encodings
| concurrently... but we have no road map. we should consider what such
| a road map looks like before approving this extension.

No options are foreclosed by the FP16 proposal.

It does not consume additional opcodes or CSR bits. The existing
opcodes were designed for IEEE FP and the SEW=16 encoding was
understood to represent this FP16 extension.

| Both (d) and (e) will likely require CSR changes.

Sure, but everything going forward has to be backward-compatible.

| We probably don't
| want to have a proliferation of data conversion instructions (cf SIMD
| instruction extensions that change the execution width), so a generic
| approach is needed.

However it is encoded (instructions or CSR bits), the different
conversion operations will be there in the hardware. So far, we have
decided not to support more than 2x difference in FP formats in a
single conversion instruction.

| Following the SEW, LMUL, and VL examples, we would
| need some CSR bits that define the current encoding (cf SEW?) and the
| target encoding (cf EEW?), or rely upon immediate ISA bits to define
| the to/from encodings.

Sure, we need some bits somewhere that tell the machine what to do.

| Whatever we do for 16 bits, we will have to do again for 8-bit floats.
| It is already looking like they need multiple encodings (eg, 4 vs 5
| bit exponents) to avoid the range limitations inherent in any 8b
| format.

Different data types have different use cases, and hence different
sets of sensible operations to support, i.e., it doesn't make sense to
orthogonalize datatype and arithmetic operation encodings if the
cross-product is a very sparse space.

FP8 is very different from FP16 and BF16. For one thing, a single
application mul-add operation will want different formats on the two
multiply inputs. The number of bits needed to specify the different
FP8 formats is greater than the number needed to specify all the
previous FP formats combined.

Given that we can add more FP datatypes later and that support for
them will take time to architect and finalize, I can't see the
rationale for blocking the FP16 proposal which just fills out the
existing ratified schema for IEEE and which is already supported by
multiple vendors.

Krste


| Thanks,
| Guy

| On Mon, Oct 3, 2022 at 10:48 AM Krste Asanovic <krste@...> wrote:
||
||
|| These RISC-V vector extensions to handle IEEE FP16 were defined prior to
|| ratification of the vector specification, but were left out of RVV 1.0
|| as they were not to be included in the base V extension.
||
|| The extension text is available on the following branch in the vector
|| spec repo:
||
|| https://github.com/riscv/riscv-v-spec/tree/zvfh
||
|| and the relevant page of the PDF is extracted and attached below.
||
|| Note these do not add any new vector FP opcodes but just add obvious
|| behavior for the case where SEW=16.
||
|| We would like to push this out for public review in preparation for
|| ratification, but would first like to allow some time for any internal
|| comment,
||
|| Krste
||
||
||
||
||
||
||


Roger Ferrer Ibanez
 

Hi Krste,

On 3/10/22 23:15, Krste Asanovic wrote:
There is a proposal in progress to add scalar/vector BF16, so it is
exaggerating to say we are leaving it "out in the cold".
Is this already in some public document? Do you have a link handy?

Thanks a lot.

Kind regards,
Roger

--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


Krste Asanovic
 

https://github.com/riscv/riscv-bfloat16

On Tue, 4 Oct 2022 11:14:11 +0200, "Roger Ferrer Ibanez" <roger.ferrer@...> said:
| Hi Krste,
| On 3/10/22 23:15, Krste Asanovic wrote:
|| There is a proposal in progress to add scalar/vector BF16, so it is
|| exaggerating to say we are leaving it "out in the cold".

| Is this already in some public document? Do you have a link handy?

| Thanks a lot.

| Kind regards,
| Roger

| --
| Roger Ferrer Ibáñez - roger.ferrer@...
| Barcelona Supercomputing Center - Centro Nacional de Supercomputación



|


| x[DELETED ATTACHMENT smime.p7s, application/pkcs7-signature]


Guy Lemieux
 

This proposal is only for BF16. Should we look forward to proposals
from DLFloat, TF16, and others? Won't this overly encumber the ISA,
like the ever-growing list of SIMD instructions? It seems like an ok
way to add compatibility, but not a way to make CPUs more cost
effective, higher performance, or use less power.

The whole purpose of BF16 is to enable cheaper, more efficient
hardware dedicated to BF16. Instead, this proposal bolts on BF16 to
the **entire** Zfh and Zfhmin specs, which rely upon the entire *F*
and possibly entire *V* spec. Maybe we should call it the Bloat16
format instead?

A BF16 extension should be stand-alone and only depend on a greatly
reduced subset of the F or V extensions (not the full ones). Hence,
this requires a lot more careful planning. Any "format conversion"
instructions should not be used for only BF16, but be overloaded and
allow the user to choose among the other formats (perhaps by
designating them in a CSR, perhaps using an instruction similar to
vsetvl/vsetvli). This avoids an explosion in opcodes.

I know it's easy to criticize without a concrete alternative proposal
-- I'm sorry that I do not offer that.

Guy

On Wed, Oct 5, 2022 at 4:06 PM Krste Asanovic <krste@...> wrote:
https://github.com/riscv/riscv-bfloat16

On Tue, 4 Oct 2022 11:14:11 +0200, "Roger Ferrer Ibanez" <roger.ferrer@...> said:
| Hi Krste,
| On 3/10/22 23:15, Krste Asanovic wrote:
|| There is a proposal in progress to add scalar/vector BF16, so it is
|| exaggerating to say we are leaving it "out in the cold".

| Is this already in some public document? Do you have a link handy?

| Thanks a lot.

| Kind regards,
| Roger

| --
| Roger Ferrer Ibáñez - roger.ferrer@...
| Barcelona Supercomputing Center - Centro Nacional de Supercomputación


Roger Ferrer Ibanez
 

Hi,

(apologies for the repeated message I am using the web interface due to some problems with our email and I think I just replied to Guy)

We're looking at something like this so we can support two formats for SEW=16 (IEEE Binary16 and bfloat16) along with a couple of formats for SEW=8 (1-4-3 and 1-5-2). We added an extra bit in vtype to select an alternate format and we found we needed another bit for conversions between formats. Our goal was to avoid adding new instructions, hence our reliance on extending vtype. Those bits are only relevant for fp-ops and fp-conversions.

IMO while we can make this work in our case, I don't think it scales if there is a proliferation of floating point formats where one SEW has more than two formats. Also, our approach might be too general as it potentially extends to all operations. We'll refine it eventually.

Kind regards,
Roger


Kiran Gunnam
 

Hi all,
 There is a very active effort from IEEE. I hope some one from RISC-V group can join IEEE P3109 as a stakeholder. I see Ken being active in both the RISC-V and P3109.

I am pleased to report that IEEE P3109 Standards Working Group on Arithmetic Formats for Machine Learning  is now getting lot of traction and is considering proposals from several groups (Nvidia, Intel & ARM has a joint proposal. Graphcore, AMD & Qualcomm has a joint proposal. Tesla also has a proposal. D-matrix has a proposal.)

It will be great to have a representation from more companies and stakeholders to drive consensus in arriving at a new standard.

To express interest in joining this working group, please submit your contact information and follow the instructions here at https://forms.gle/77JX8gqMy3kk62bg6


Regards,
Kiran Gunnam
Chair, IEEE P3109 WG


On Oct 6, 2022, at 1:16 PM, Roger Ferrer Ibanez <roger.ferrer@...> wrote:

Hi,

(apologies for the repeated message I am using the web interface due to some problems with our email and I think I just replied to Guy)

We're looking at something like this so we can support two formats for SEW=16 (IEEE Binary16 and bfloat16) along with a couple of formats for SEW=8 (1-4-3 and 1-5-2). We added an extra bit in vtype to select an alternate format and we found we needed another bit for conversions between formats. Our goal was to avoid adding new instructions, hence our reliance on extending vtype. Those bits are only relevant for fp-ops and fp-conversions.

IMO while we can make this work in our case, I don't think it scales if there is a proliferation of floating point formats where one SEW has more than two formats. Also, our approach might be too general as it potentially extends to all operations. We'll refine it eventually.

Kind regards,
Roger


Krste Asanovic
 

On Wed, 5 Oct 2022 17:42:44 -0700, Guy Lemieux <guy.lemieux@...> said:
| This proposal is only for BF16. Should we look forward to proposals
| from DLFloat, TF16, and others?

Yes.

| Won't this overly encumber the ISA,
| like the ever-growing list of SIMD instructions?

Yes.

I don't know of any way to provide operations on new datatypes without
adding new instructions (whether encoded in instruction stream or via
additional control bits in CSRs).

| It seems like an ok
| way to add compatibility, but not a way to make CPUs more cost
| effective, higher performance, or use less power.

How do you propose to do any of these without adding new instructions
for the datatypes you need to support?

| The whole purpose of BF16 is to enable cheaper, more efficient
| hardware dedicated to BF16. Instead, this proposal bolts on BF16 to
| the **entire** Zfh and Zfhmin specs, which rely upon the entire *F*
| and possibly entire *V* spec. Maybe we should call it the Bloat16
| format instead?

BF16 is used with FP32. Do you know of any real BF16 application that
doesn't?

| A BF16 extension should be stand-alone and only depend on a greatly
| reduced subset of the F or V extensions (not the full ones). Hence,
| this requires a lot more careful planning.

BF16 was designed to work with FP32. The "careful planning" was
already done by many others.

| Any "format conversion"
| instructions should not be used for only BF16, but be overloaded and
| allow the user to choose among the other formats (perhaps by
| designating them in a CSR, perhaps using an instruction similar to
| vsetvl/vsetvli). This avoids an explosion in opcodes.

No - this just encodes new instructions using CSR bits versus new
opcodes. CSR state adds more hardware complexity than opcodes.

| I know it's easy to criticize without a concrete alternative proposal

Indeed.

| -- I'm sorry that I do not offer that.

| Guy

Krste



| On Wed, Oct 5, 2022 at 4:06 PM Krste Asanovic <krste@...> wrote:
|| https://github.com/riscv/riscv-bfloat16


|| >>>>> On Tue, 4 Oct 2022 11:14:11 +0200, "Roger Ferrer Ibanez" <roger.ferrer@...> said:
||
|| | Hi Krste,
|| | On 3/10/22 23:15, Krste Asanovic wrote:
|| || There is a proposal in progress to add scalar/vector BF16, so it is
|| || exaggerating to say we are leaving it "out in the cold".
||
|| | Is this already in some public document? Do you have a link handy?
||
|| | Thanks a lot.
||
|| | Kind regards,
|| | Roger
||
|| | --
|| | Roger Ferrer Ibáñez - roger.ferrer@...
|| | Barcelona Supercomputing Center - Centro Nacional de Supercomputación
||
||


Krste Asanovic
 

We can delay, but not prevent, the need to have greater than 32b
instructions. All general-purpose architectures with a large set of
vector instructions have >32b instructions (although ARM and POWER try
to pretend they're just two 32b instructions). CSR-state-encoding
approaches are annoying to both hardware implementers and software
toolchains.

I'll repeat that most alternative datatypes need a much smaller number
of operators supported than IEEE FP and some have unique operators not
present in IEEE, so does not make sense to encode full cross-product
of datatype and operators for all datatypes.

Krste

On Thu, 06 Oct 2022 00:46:42 -0700, "Roger Ferrer Ibanez" <roger.ferrer@...> said:
| Hi,
| (apologies for the repeated message I am using the web interface due to some problems with our email and I think I
| just replied to Guy)

| We're looking at something like this so we can support two formats for SEW=16 (IEEE Binary16 and bfloat16) along
| with a couple of formats for SEW=8 (1-4-3 and 1-5-2). We added an extra bit in vtype to select an alternate format
| and we found we needed another bit for conversions between formats. Our goal was to avoid adding new instructions,
| hence our reliance on extending vtype. Those bits are only relevant for fp-ops and fp-conversions.

| IMO while we can make this work in our case, I don't think it scales if there is a proliferation of floating point
| formats where one SEW has more than two formats. Also, our approach might be too general as it potentially extends
| to all operations. We'll refine it eventually.

| Kind regards,
| Roger
|