Date   

Re: [RISC-V] [tech] [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])

Guy Lemieux
 

Thanks Tim, I think that sums it up nicely.

I just wanted to put a pointer out to the original post that I made on isa-dev regarding opcode sharing / management:

It was very much motivated by the need to share the custom opcode blocks. Now it seems the official extensions will run out of opcode space too, so either we go 64b or we learn to share :-)

There may be some useful tidbits in the dialog that followed from that point forward.

Incidentally, for things like the vector extension, I think 64b encodings are inevitable. The opcode space is needed, particularly to allow orthogonality in all of the different operand modes. Plus, there is little harm — the majority of code is still scalar, and every 64b vector instruction replaces a handful of scalar operations so it is actually more compact than the scalar ISA already.

Ciao,
Guy



On Mon, Nov 2, 2020 at 5:58 PM Tim Vogt <tim.vogt@...> wrote:
















This is definitely a subject ripe for broader discussion.  In the FPGA special interest group we’ve been working on a framework for combining separately authored custom function units for the last year or so, and one of the key debate topics

has been how to manage the instruction encoding space.  As I read through this email thread, I can safely say that we’ve discussed all of the same topics and all of the same approaches listed here at one point or another.  For our particular scope, we’ve converged

on a bank-switching style implementation using a CSR, but that’s a very tactical approach for our specific goals.



 



Since this a common problem that keeps coming up in multiple contexts, it’s worth exploring whether we can come up with a more strategic framework and get consensus within the community, to avoid fragmentation and constant reinvention of

the wheel.



 



--Tim Vogt



 



--



Tim Vogt



Distinguished Engineer



Lattice Semiconductor



Office: 503-268-8068



 



 



 







From: tech@... <tech@...>

On Behalf Of
David Horner via lists.riscv.org


Sent: Monday, October 26, 2020 9:16 AM


To: Allen Baum <allen.baum@...>


Cc: Robert Chyla <Robert.Chyla@...>; tech-code-size@...; Greg Favor <gfavor@...>; Tariq Kurd <tariq.kurd@...>; Bill Huffman <huffman@...>; tech-fast-int@...; jeremy.bennett@...; tech@...;

tech-vector-ext@...; tech-p-ext@...; tech-bitmanip@.... <tech-bitmanip@...>


Subject: Re: [RISC-V] [tech] [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])







 





On 2020-10-26 12:48 a.m., Allen Baum wrote:











Are we talking about something that is effectively bank switching the opcodes here?









That is one approach. It is a consideration that has recently been mentioned wrt misa.













Something like that was proposed very early on, using a CSR (like MISA maybe - the details are lost to me) to enable and disable them.









I remember  Luke Kenneth Casson Leighton <lkcl@...> was in on the discussions.



A variety of csr and related approaches were considered.









The specific issue that brought it up is if someone developed a custom extension, did a lot of work, and then some other extension came along that stepped on those opcodes - and the implementation wanted to use both of them.







The author thought it was pretty obvious this kind of thing was going to happen. I don't think that exact scenario will, but running out of standard 32b opcodes with ratified extensions might.









exactly.



Also in lkcl's case the "vectorization" extension of all opcodes is [was proposed] of this nature









We're already starting to look at the long tail - extensions that are specialized to specific workloads, but highly advantageous to them.







I'm guessing we will get to the point that these extensions will not have to coexist inside a single app, though - so a bank switching approach (non-user mode at the least, perhaps not within an app at all) could potentially work, but it

sounds ugly to make the tools understand the configuration.









agreed. thus the uni-op-code approach wich can co-exist with any of these stragegies but provides a framework to mange them. (just as ascii and EBCIDIC extensions are comparably managed).













 







 







On Sat, Oct 24, 2020 at 8:23 AM ds2horner <ds2horner@...> wrote:









These are all important considerations.



However, what they have in common when considering Allen's question:







This discussion is bringing up an issue that needs wider discussion about extensions in general.





 







is that they are all tactical considerations are in the context of our current framework of instruction space allocation. What we will find is that these trade-off considerations will reinforce the dilemma

that Allen raises. How do we manage these conflicting "necessities/requirements" of different target environments. 







 







I have hinted at it already, we need not only tactical analysis of feature tradeoff in different domains but a strategic approach to support them.







 







The concern is nothing new. It has been raised, if only obliquely, many times prior on the [google]



groups.riscv.org
(-dev, -sw especially) and 

lists.riscv.org
TG threads.







The vector group, especially,  has grappled with it in the context of current V encoding being a subset of a [hypothetical] 64 bit encoding.







 







Specific proposals have been mentioned, but there was then no political will or perhaps more fairly, no common perception that there was a compelling reason to work systematically to address it. The [then]

common thinking was that 48 and 64 bit instruction spaces will be used as 32 and 16 bit are exhausted, and everyone will be happy. Well, that naive hope has not materialized and many are envisioning clashes that will hurt RISCV progress, either fragmentation

or stagnation, as tactical approaches and considerations are implement or debated.







 







Previously two major strategic approaches were hinted at, even if they were not outright proposed.







 







Hardware Support - this has been explicitly proposed in many flavours: and is currently in the minds of many.







     The idea is a mode shift analogous to







        arm's transition to thumb and back and







        intel's myriad of operating modes: real, protected, virtual, long and their disparate instantiations.







     I agree that implementations should have considerable freedom on how to provide hardware select-able functionality.







     However, a proposed framework to support that should be provided by



RISV.org
.







     Recent discussion and document tweaks about misa (Machine ISA register) suggest that this mechanism,







          though valuable, is inadequate as robust support for the explosion of features.







     An expanded framework will be necessary, perhaps along the lines of the two level performance counters definitions.







     The conflict with overlapping mappings of groups of instructions to the same encoding space is not easily addressed by this mechanism.







 







which leads us to







 







Software Support:







 







The Generalized Proposal:







All future extensions are not mapped to a fixed exclusive universal encoding,







but rather to appropriately sized [based initially off 32 isize] minor [22-bit], major[25-bit] or quadrant [30-bit] encoding,







that is allocated to the appropriate instruction encoding at link/load time to match the hardware [or hardware dynamic configuration, as above].







This handles the green field encodings.







Each feature could have a default minor/major/quadrant encoding designation.







 







Brown field can also managed, simply if the related co-encoded feature is present, with more complexity, and perhaps extensive opcode mapping if blended into other feature's encodings.







 







An implementation method would be to have a fixed exclusive universal prefix for each feature.







Each instruction would then be emitted by the compiler as a [prefix]:[instruction with default encoding] pair.







If the initial prefixes are also nops [most of which are currently designated as hints],







then the code would be executable on machines that use the default mapping







without any link/load intervention [at lower performance granted].







 







This approach is backward compatible for the other established extensions:







most notably F which consumes 7 major opcodes spaces [and *only* 5 with Zfinx (Zifloat?)] and







then AMO which also consumes the majority of a major opcode.







 







This strategic change has a number of immediate and significant benefits:







  1) custom reserved major op codes effectively become unreserved as "standard" extensions can be mapped there also.







       The custom reserved nature will then only be the designated default allocation, "standard extensions" will not default to them.







  2) as mentioned above, if the prefix is a nop then link/load support is not needed for direct execution support [only efficiency].







  3) the transition to higher bit encodings can be simplified. As easily as the compiler emmitting the designated  prefix for that feature that encodes for 64 bit instructions.







So, two assigned fixed exclusive encodings per feature may be useful, one a 64bit encoding and one a nop.







 





I do not intent to stifle any of the tactical discussions of co-usefulness of features and profile domains.


These are meaningful and useful considerations.





Rather, I hope that by having a framework for coexistance of features, that those discussions can proceed in a more guided way;




that discovers can be incorporated into a framework centric corpus of understanding of trade-offs and cooperative benefits of features/profiles.





 







 







On 2020-10-23 11:45 p.m., Robert Chyla wrote:









I agree with Greg's statements. For me 'code-size' is very important for small, deeply embedded/IoT-class small systems.







 







Work in other groups (bitmanip) will also benefit code size, but it is not primary focus I think as these will also improve code-speed.







 







Linux-like big processors usually have DDR RAM and code size is 'unlimited'.







It should not hurt as code-size advances will benefit such big systems, but we should not forget about 'cheap to implement'='logic size' factors.







 







IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.







 







Regards,







/Robert







 







On 10/23/2020 5:11 PM, Greg Favor wrote:











It seems like a TG, probably through the statement of its charter, should clearly define what types or classes of systems it is focused on optimizing for (if there is an intended focus) and what types or classes of systems it does not expect

to be appropriate for.   More concretely, it seems like there are a few TG's developing extensions oriented towards embedded real time systems and/or low-cost embedded systems.  These are extensions that would probably not be implemented in full-blown Linux-class

systems.  Those extensions don't need to worry about being acceptable to such system designs, and can optimize for the requirements and constraints of their target class(es) of systems.





 







Unless I'm mistaken, this TG falls in that category.  And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be judged wrt other classes

of systems).  And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably be agreed upon early on.







 







Greg







 







On Fri, Oct 23, 2020 at 4:34 PM Allen Baum <allen.baum@...> wrote:











This discussion is bringing up an issue that needs wider discussion about extensions in general.







Risc-V is intended to be an architecture that supports an extremely wide range of implementations, 







ranging from very low gate count microcontrollers, to high end superscalar out-of-order processors.







How do we evaluate an extension that only makes sense at one end or the other?







 







I don't expect a vector, or even hypervisor extensions in a low gate count system.







There are other extensions that are primarily aimed at specific applications areas as well.







 







A micro sequenced (e.g. push/pop[int]) op might be fairly trivial to implement in a low gate count system




(e.g. without VM, but with PMPs) and have significant savings in code size, power, and increased performance.







They may have none of those, or less significant, advantages in a high end implementation --







and/or might be very difficult or costly to implement in them, (e.g. for TLB miss, interrupt, & exception handling )







 (I am not claiming that these specific ops do, but just pretend there is one like that)







 







Should we avoid defining instructions and extensions like that? 







Or just allow that some extensions just don't make sense for some class of implementation?







Are there guidelines we can put in place to help make those decisions? 







This same (not precisely the same) kind of issue is rearing its head in other places, e.g. range based CMOs.







 















 





--


Regards,


Robert Chyla,

Lead Engineer, Debug and Trace Probe Software


IAR Systems




1211 Flynn Rd, Unit 104


Camarillo, CA
  93012 USA


Office: +1
805 383 3682 x104


E-mail: Robert.Chyla@... Website:



www.iar.com


































Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

lidawei14@...
 

Hi all,

If I use EDIV to compute SpMV y = A * x as size r * c blocks, I might have to load size r of y and size c of x, these are shorter than VL = r * c, is there an efficient way to do this by current support?

If I would like to use a mask to compress, for VL = 16, I can store 16-bit value to memory and load it to GPR, then how can I transform it to a vector mask?

Thank you,
Dawei


Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

Krste Asanovic
 



On Oct 27, 2020, at 2:12 AM, lidawei14 via lists.riscv.org <lidawei14=huawei.com@...> wrote:

Hi all,

Thank you Nick for the reply.
    
I saw EDIV will not be included in v1.0, any issues to be resolved? Can I have a look at the discussion page on EDIV?

The email list has the archived discussion, which should include discussion of EDIV.

The main reason not to include in v1.0 is that it has many details to work through, and resolving these would delay 1.0 by many months.

Krste


Thanks a lot,
Dawei


Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

lidawei14@...
 

Hi all,

Thank you Nick for the reply.
    
I saw EDIV will not be included in v1.0, any issues to be resolved? Can I have a look at the discussion page on EDIV?

Thanks a lot,
Dawei


Re: [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])

David Horner
 


On 2020-10-26 12:48 a.m., Allen Baum wrote:
Are we talking about something that is effectively bank switching the opcodes here?
That is one approach. It is a consideration that has recently been mentioned wrt misa.
Something like that was proposed very early on, using a CSR (like MISA maybe - the details are lost to me) to enable and disable them.

I remember  Luke Kenneth Casson Leighton <lkcl@...> was in on the discussions.

A variety of csr and related approaches were considered.

The specific issue that brought it up is if someone developed a custom extension, did a lot of work, and then some other extension came along that stepped on those opcodes - and the implementation wanted to use both of them.
The author thought it was pretty obvious this kind of thing was going to happen. I don't think that exact scenario will, but running out of standard 32b opcodes with ratified extensions might.

exactly.

Also in lkcl's case the "vectorization" extension of all opcodes is [was proposed] of this nature

We're already starting to look at the long tail - extensions that are specialized to specific workloads, but highly advantageous to them.
I'm guessing we will get to the point that these extensions will not have to coexist inside a single app, though - so a bank switching approach (non-user mode at the least, perhaps not within an app at all) could potentially work, but it sounds ugly to make the tools understand the configuration.
agreed. thus the uni-op-code approach wich can co-exist with any of these stragegies but provides a framework to mange them. (just as ascii and EBCIDIC extensions are comparably managed).


On Sat, Oct 24, 2020 at 8:23 AM ds2horner <ds2horner@...> wrote:

These are all important considerations.

However, what they have in common when considering Allen's question:

This discussion is bringing up an issue that needs wider discussion about extensions in general.

is that they are all tactical considerations are in the context of our current framework of instruction space allocation. What we will find is that these trade-off considerations will reinforce the dilemma that Allen raises. How do we manage these conflicting "necessities/requirements" of different target environments. 

I have hinted at it already, we need not only tactical analysis of feature tradeoff in different domains but a strategic approach to support them.

The concern is nothing new. It has been raised, if only obliquely, many times prior on the [google] groups.riscv.org (-dev, -sw especially) and  lists.riscv.org TG threads.
The vector group, especially,  has grappled with it in the context of current V encoding being a subset of a [hypothetical] 64 bit encoding.

Specific proposals have been mentioned, but there was then no political will or perhaps more fairly, no common perception that there was a compelling reason to work systematically to address it. The [then] common thinking was that 48 and 64 bit instruction spaces will be used as 32 and 16 bit are exhausted, and everyone will be happy. Well, that naive hope has not materialized and many are envisioning clashes that will hurt RISCV progress, either fragmentation or stagnation, as tactical approaches and considerations are implement or debated.

Previously two major strategic approaches were hinted at, even if they were not outright proposed.

Hardware Support - this has been explicitly proposed in many flavours: and is currently in the minds of many.
     The idea is a mode shift analogous to
        arm's transition to thumb and back and
        intel's myriad of operating modes: real, protected, virtual, long and their disparate instantiations.
     I agree that implementations should have considerable freedom on how to provide hardware select-able functionality.
     However, a proposed framework to support that should be provided by RISV.org.
     Recent discussion and document tweaks about misa (Machine ISA register) suggest that this mechanism,
          though valuable, is inadequate as robust support for the explosion of features.
     An expanded framework will be necessary, perhaps along the lines of the two level performance counters definitions.
     The conflict with overlapping mappings of groups of instructions to the same encoding space is not easily addressed by this mechanism.

which leads us to

Software Support:

The Generalized Proposal:
All future extensions are not mapped to a fixed exclusive universal encoding,
but rather to appropriately sized [based initially off 32 isize] minor [22-bit], major[25-bit] or quadrant [30-bit] encoding,
that is allocated to the appropriate instruction encoding at link/load time to match the hardware [or hardware dynamic configuration, as above].
This handles the green field encodings.
Each feature could have a default minor/major/quadrant encoding designation.

Brown field can also managed, simply if the related co-encoded feature is present, with more complexity, and perhaps extensive opcode mapping if blended into other feature's encodings.

An implementation method would be to have a fixed exclusive universal prefix for each feature.
Each instruction would then be emitted by the compiler as a [prefix]:[instruction with default encoding] pair.
If the initial prefixes are also nops [most of which are currently designated as hints],
then the code would be executable on machines that use the default mapping
without any link/load intervention [at lower performance granted].

This approach is backward compatible for the other established extensions:
most notably F which consumes 7 major opcodes spaces [and *only* 5 with Zfinx (Zifloat?)] and
then AMO which also consumes the majority of a major opcode.

This strategic change has a number of immediate and significant benefits:
  1) custom reserved major op codes effectively become unreserved as "standard" extensions can be mapped there also.
       The custom reserved nature will then only be the designated default allocation, "standard extensions" will not default to them.
  2) as mentioned above, if the prefix is a nop then link/load support is not needed for direct execution support [only efficiency].
  3) the transition to higher bit encodings can be simplified. As easily as the compiler emmitting the designated  prefix for that feature that encodes for 64 bit instructions.
So, two assigned fixed exclusive encodings per feature may be useful, one a 64bit encoding and one a nop.

I do not intent to stifle any of the tactical discussions of co-usefulness of features and profile domains.
These are meaningful and useful considerations.

Rather, I hope that by having a framework for coexistance of features, that those discussions can proceed in a more guided way;
that discovers can be incorporated into a framework centric corpus of understanding of trade-offs and cooperative benefits of features/profiles.


On 2020-10-23 11:45 p.m., Robert Chyla wrote:
I agree with Greg's statements. For me 'code-size' is very important for small, deeply embedded/IoT-class small systems.

Work in other groups (bitmanip) will also benefit code size, but it is not primary focus I think as these will also improve code-speed.

Linux-like big processors usually have DDR RAM and code size is 'unlimited'.
It should not hurt as code-size advances will benefit such big systems, but we should not forget about 'cheap to implement'='logic size' factors.

IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.

Regards,
/Robert

On 10/23/2020 5:11 PM, Greg Favor wrote:
It seems like a TG, probably through the statement of its charter, should clearly define what types or classes of systems it is focused on optimizing for (if there is an intended focus) and what types or classes of systems it does not expect to be appropriate for.   More concretely, it seems like there are a few TG's developing extensions oriented towards embedded real time systems and/or low-cost embedded systems.  These are extensions that would probably not be implemented in full-blown Linux-class systems.  Those extensions don't need to worry about being acceptable to such system designs, and can optimize for the requirements and constraints of their target class(es) of systems.

Unless I'm mistaken, this TG falls in that category.  And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be judged wrt other classes of systems).  And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably be agreed upon early on.

Greg

On Fri, Oct 23, 2020 at 4:34 PM Allen Baum <allen.baum@...> wrote:
This discussion is bringing up an issue that needs wider discussion about extensions in general.
Risc-V is intended to be an architecture that supports an extremely wide range of implementations, 
ranging from very low gate count microcontrollers, to high end superscalar out-of-order processors.
How do we evaluate an extension that only makes sense at one end or the other?

I don't expect a vector, or even hypervisor extensions in a low gate count system.
There are other extensions that are primarily aimed at specific applications areas as well.

A micro sequenced (e.g. push/pop[int]) op might be fairly trivial to implement in a low gate count system
(e.g. without VM, but with PMPs) and have significant savings in code size, power, and increased performance.
They may have none of those, or less significant, advantages in a high end implementation --
and/or might be very difficult or costly to implement in them, (e.g. for TLB miss, interrupt, & exception handling )
 (I am not claiming that these specific ops do, but just pretend there is one like that)

Should we avoid defining instructions and extensions like that? 
Or just allow that some extensions just don't make sense for some class of implementation?
Are there guidelines we can put in place to help make those decisions? 
This same (not precisely the same) kind of issue is rearing its head in other places, e.g. range based CMOs.


--
Regards,
Robert Chyla, Lead Engineer, Debug and Trace Probe Software
IAR Systems
1211 Flynn Rd, Unit 104
Camarillo, CA  93012 USA
Office: +1 805 383 3682 x104
E-mail: Robert.Chyla@... Website: www.iar.com


Re: [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])

David Horner
 

My take: This is analogous to ascii(7-bit) and EBCIDIC(8-bit) both competing in the 8 bit byte addressable character space.

Initial solutions were fragmentation, then code pages (select-able character sets).

Eventually unicode became the standard that allowed universal adoption and definition, and  down sizing to domains that needed a specific 8-bit byte encoding/mapping, printers, tty etc.

Just as C-extention initially relied on the linker/loader to do the code replacement, so to the uni-op-code would initially rely on the linker/loader.

As the tool chain becomes more sophisticated self-conforming-software  to hardware configuration (à la Linux)) will be developed.

.

 

On 2020-10-24 11:23 a.m., ds2horner wrote:

These are all important considerations.

However, what they have in common when considering Allen's question:

This discussion is bringing up an issue that needs wider discussion about extensions in general.

is that they are all tactical considerations are in the context of our current framework of instruction space allocation. What we will find is that these trade-off considerations will reinforce the dilemma that Allen raises. How do we manage these conflicting "necessities/requirements" of different target environments. 

I have hinted at it already, we need not only tactical analysis of feature tradeoff in different domains but a strategic approach to support them.

The concern is nothing new. It has been raised, if only obliquely, many times prior on the [google] groups.riscv.org (-dev, -sw especially) and  lists.riscv.org TG threads.
The vector group, especially,  has grappled with it in the context of current V encoding being a subset of a [hypothetical] 64 bit encoding.

Specific proposals have been mentioned, but there was then no political will or perhaps more fairly, no common perception that there was a compelling reason to work systematically to address it. The [then] common thinking was that 48 and 64 bit instruction spaces will be used as 32 and 16 bit are exhausted, and everyone will be happy. Well, that naive hope has not materialized and many are envisioning clashes that will hurt RISCV progress, either fragmentation or stagnation, as tactical approaches and considerations are implement or debated.

Previously two major strategic approaches were hinted at, even if they were not outright proposed.

Hardware Support - this has been explicitly proposed in many flavours: and is currently in the minds of many.
     The idea is a mode shift analogous to
        arm's transition to thumb and back and
        intel's myriad of operating modes: real, protected, virtual, long and their disparate instantiations.
     I agree that implementations should have considerable freedom on how to provide hardware select-able functionality.
     However, a proposed framework to support that should be provided by RISV.org.
     Recent discussion and document tweaks about misa (Machine ISA register) suggest that this mechanism,
          though valuable, is inadequate as robust support for the explosion of features.
     An expanded framework will be necessary, perhaps along the lines of the two level performance counters definitions.
     The conflict with overlapping mappings of groups of instructions to the same encoding space is not easily addressed by this mechanism.

which leads us to

Software Support:

The Generalized Proposal:
All future extensions are not mapped to a fixed exclusive universal encoding,
but rather to appropriately sized [based initially off 32 isize] minor [22-bit], major[25-bit] or quadrant [30-bit] encoding,
that is allocated to the appropriate instruction encoding at link/load time to match the hardware [or hardware dynamic configuration, as above].
This handles the green field encodings.
Each feature could have a default minor/major/quadrant encoding designation.

Brown field can also managed, simply if the related co-encoded feature is present, with more complexity, and perhaps extensive opcode mapping if blended into other feature's encodings.

An implementation method would be to have a fixed exclusive universal prefix for each feature.
Each instruction would then be emitted by the compiler as a [prefix]:[instruction with default encoding] pair.
If the initial prefixes are also nops [most of which are currently designated as hints],
then the code would be executable on machines that use the default mapping
without any link/load intervention [at lower performance granted].

This approach is backward compatible for the other established extensions:
most notably F which consumes 7 major opcodes spaces [and *only* 5 with Zfinx (Zifloat?)] and
then AMO which also consumes the majority of a major opcode.

This strategic change has a number of immediate and significant benefits:
  1) custom reserved major op codes effectively become unreserved as "standard" extensions can be mapped there also.
       The custom reserved nature will then only be the designated default allocation, "standard extensions" will not default to them.
  2) as mentioned above, if the prefix is a nop then link/load support is not needed for direct execution support [only efficiency].
  3) the transition to higher bit encodings can be simplified. As easily as the compiler emmitting the designated  prefix for that feature that encodes for 64 bit instructions.
So, two assigned fixed exclusive encodings per feature may be useful, one a 64bit encoding and one a nop.

I do not intent to stifle any of the tactical discussions of co-usefulness of features and profile domains.
These are meaningful and useful considerations.

Rather, I hope that by having a framework for coexistance of features, that those discussions can proceed in a more guided way;
that discovers can be incorporated into a framework centric corpus of understanding of trade-offs and cooperative benefits of features/profiles.


On 2020-10-23 11:45 p.m., Robert Chyla wrote:
I agree with Greg's statements. For me 'code-size' is very important for small, deeply embedded/IoT-class small systems.

Work in other groups (bitmanip) will also benefit code size, but it is not primary focus I think as these will also improve code-speed.

Linux-like big processors usually have DDR RAM and code size is 'unlimited'.
It should not hurt as code-size advances will benefit such big systems, but we should not forget about 'cheap to implement'='logic size' factors.

IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.

Regards,
/Robert

On 10/23/2020 5:11 PM, Greg Favor wrote:
It seems like a TG, probably through the statement of its charter, should clearly define what types or classes of systems it is focused on optimizing for (if there is an intended focus) and what types or classes of systems it does not expect to be appropriate for.   More concretely, it seems like there are a few TG's developing extensions oriented towards embedded real time systems and/or low-cost embedded systems.  These are extensions that would probably not be implemented in full-blown Linux-class systems.  Those extensions don't need to worry about being acceptable to such system designs, and can optimize for the requirements and constraints of their target class(es) of systems.

Unless I'm mistaken, this TG falls in that category.  And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be judged wrt other classes of systems).  And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably be agreed upon early on.

Greg

On Fri, Oct 23, 2020 at 4:34 PM Allen Baum <allen.baum@...> wrote:
This discussion is bringing up an issue that needs wider discussion about extensions in general.
Risc-V is intended to be an architecture that supports an extremely wide range of implementations, 
ranging from very low gate count microcontrollers, to high end superscalar out-of-order processors.
How do we evaluate an extension that only makes sense at one end or the other?

I don't expect a vector, or even hypervisor extensions in a low gate count system.
There are other extensions that are primarily aimed at specific applications areas as well.

A micro sequenced (e.g. push/pop[int]) op might be fairly trivial to implement in a low gate count system
(e.g. without VM, but with PMPs) and have significant savings in code size, power, and increased performance.
They may have none of those, or less significant, advantages in a high end implementation --
and/or might be very difficult or costly to implement in them, (e.g. for TLB miss, interrupt, & exception handling )
 (I am not claiming that these specific ops do, but just pretend there is one like that)

Should we avoid defining instructions and extensions like that? 
Or just allow that some extensions just don't make sense for some class of implementation?
Are there guidelines we can put in place to help make those decisions? 
This same (not precisely the same) kind of issue is rearing its head in other places, e.g. range based CMOs.


--
Regards,
Robert Chyla, Lead Engineer, Debug and Trace Probe Software
IAR Systems
1211 Flynn Rd, Unit 104
Camarillo, CA  93012 USA
Office: +1 805 383 3682 x104
E-mail: Robert.Chyla@... Website: www.iar.com


Re: [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])

Allen Baum
 

Are we talking about something that is effectively bank switching the opcodes here?
Something like that was proposed very early on, using a CSR (like MISA maybe - the details are lost to me) to enable and disable them.
The specific issue that brought it up is if someone developed a custom extension, did a lot of work, and then some other extension came along that stepped on those opcodes - and the implementation wanted to use both of them.
The author thought it was pretty obvious this kind of thing was going to happen. I don't think that exact scenario will, but running out of standard 32b opcodes with ratified extensions might. 
We're already starting to look at the long tail - extensions that are specialized to specific workloads, but highly advantageous to them.
I'm guessing we will get to the point that these extensions will not have to coexist inside a single app, though - so a bank switching approach (non-user mode at the least, perhaps not within an app at all) could potentially work, but it sounds ugly to make the tools understand the configuration.


On Sat, Oct 24, 2020 at 8:23 AM ds2horner <ds2horner@...> wrote:

These are all important considerations.

However, what they have in common when considering Allen's question:

This discussion is bringing up an issue that needs wider discussion about extensions in general.

is that they are all tactical considerations are in the context of our current framework of instruction space allocation. What we will find is that these trade-off considerations will reinforce the dilemma that Allen raises. How do we manage these conflicting "necessities/requirements" of different target environments. 

I have hinted at it already, we need not only tactical analysis of feature tradeoff in different domains but a strategic approach to support them.

The concern is nothing new. It has been raised, if only obliquely, many times prior on the [google] groups.riscv.org (-dev, -sw especially) and  lists.riscv.org TG threads.
The vector group, especially,  has grappled with it in the context of current V encoding being a subset of a [hypothetical] 64 bit encoding.

Specific proposals have been mentioned, but there was then no political will or perhaps more fairly, no common perception that there was a compelling reason to work systematically to address it. The [then] common thinking was that 48 and 64 bit instruction spaces will be used as 32 and 16 bit are exhausted, and everyone will be happy. Well, that naive hope has not materialized and many are envisioning clashes that will hurt RISCV progress, either fragmentation or stagnation, as tactical approaches and considerations are implement or debated.

Previously two major strategic approaches were hinted at, even if they were not outright proposed.

Hardware Support - this has been explicitly proposed in many flavours: and is currently in the minds of many.
     The idea is a mode shift analogous to
        arm's transition to thumb and back and
        intel's myriad of operating modes: real, protected, virtual, long and their disparate instantiations.
     I agree that implementations should have considerable freedom on how to provide hardware select-able functionality.
     However, a proposed framework to support that should be provided by RISV.org.
     Recent discussion and document tweaks about misa (Machine ISA register) suggest that this mechanism,
          though valuable, is inadequate as robust support for the explosion of features.
     An expanded framework will be necessary, perhaps along the lines of the two level performance counters definitions.
     The conflict with overlapping mappings of groups of instructions to the same encoding space is not easily addressed by this mechanism.

which leads us to

Software Support:

The Generalized Proposal:
All future extensions are not mapped to a fixed exclusive universal encoding,
but rather to appropriately sized [based initially off 32 isize] minor [22-bit], major[25-bit] or quadrant [30-bit] encoding,
that is allocated to the appropriate instruction encoding at link/load time to match the hardware [or hardware dynamic configuration, as above].
This handles the green field encodings.
Each feature could have a default minor/major/quadrant encoding designation.

Brown field can also managed, simply if the related co-encoded feature is present, with more complexity, and perhaps extensive opcode mapping if blended into other feature's encodings.

An implementation method would be to have a fixed exclusive universal prefix for each feature.
Each instruction would then be emitted by the compiler as a [prefix]:[instruction with default encoding] pair.
If the initial prefixes are also nops [most of which are currently designated as hints],
then the code would be executable on machines that use the default mapping
without any link/load intervention [at lower performance granted].

This approach is backward compatible for the other established extensions:
most notably F which consumes 7 major opcodes spaces [and *only* 5 with Zfinx (Zifloat?)] and
then AMO which also consumes the majority of a major opcode.

This strategic change has a number of immediate and significant benefits:
  1) custom reserved major op codes effectively become unreserved as "standard" extensions can be mapped there also.
       The custom reserved nature will then only be the designated default allocation, "standard extensions" will not default to them.
  2) as mentioned above, if the prefix is a nop then link/load support is not needed for direct execution support [only efficiency].
  3) the transition to higher bit encodings can be simplified. As easily as the compiler emmitting the designated  prefix for that feature that encodes for 64 bit instructions.
So, two assigned fixed exclusive encodings per feature may be useful, one a 64bit encoding and one a nop.

I do not intent to stifle any of the tactical discussions of co-usefulness of features and profile domains.
These are meaningful and useful considerations.

Rather, I hope that by having a framework for coexistance of features, that those discussions can proceed in a more guided way;
that discovers can be incorporated into a framework centric corpus of understanding of trade-offs and cooperative benefits of features/profiles.


On 2020-10-23 11:45 p.m., Robert Chyla wrote:
I agree with Greg's statements. For me 'code-size' is very important for small, deeply embedded/IoT-class small systems.

Work in other groups (bitmanip) will also benefit code size, but it is not primary focus I think as these will also improve code-speed.

Linux-like big processors usually have DDR RAM and code size is 'unlimited'.
It should not hurt as code-size advances will benefit such big systems, but we should not forget about 'cheap to implement'='logic size' factors.

IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.

Regards,
/Robert

On 10/23/2020 5:11 PM, Greg Favor wrote:
It seems like a TG, probably through the statement of its charter, should clearly define what types or classes of systems it is focused on optimizing for (if there is an intended focus) and what types or classes of systems it does not expect to be appropriate for.   More concretely, it seems like there are a few TG's developing extensions oriented towards embedded real time systems and/or low-cost embedded systems.  These are extensions that would probably not be implemented in full-blown Linux-class systems.  Those extensions don't need to worry about being acceptable to such system designs, and can optimize for the requirements and constraints of their target class(es) of systems.

Unless I'm mistaken, this TG falls in that category.  And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be judged wrt other classes of systems).  And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably be agreed upon early on.

Greg

On Fri, Oct 23, 2020 at 4:34 PM Allen Baum <allen.baum@...> wrote:
This discussion is bringing up an issue that needs wider discussion about extensions in general.
Risc-V is intended to be an architecture that supports an extremely wide range of implementations, 
ranging from very low gate count microcontrollers, to high end superscalar out-of-order processors.
How do we evaluate an extension that only makes sense at one end or the other?

I don't expect a vector, or even hypervisor extensions in a low gate count system.
There are other extensions that are primarily aimed at specific applications areas as well.

A micro sequenced (e.g. push/pop[int]) op might be fairly trivial to implement in a low gate count system
(e.g. without VM, but with PMPs) and have significant savings in code size, power, and increased performance.
They may have none of those, or less significant, advantages in a high end implementation --
and/or might be very difficult or costly to implement in them, (e.g. for TLB miss, interrupt, & exception handling )
 (I am not claiming that these specific ops do, but just pretend there is one like that)

Should we avoid defining instructions and extensions like that? 
Or just allow that some extensions just don't make sense for some class of implementation?
Are there guidelines we can put in place to help make those decisions? 
This same (not precisely the same) kind of issue is rearing its head in other places, e.g. range based CMOs.


--
Regards,
Robert Chyla, Lead Engineer, Debug and Trace Probe Software
IAR Systems
1211 Flynn Rd, Unit 104
Camarillo, CA  93012 USA
Office: +1 805 383 3682 x104
E-mail: Robert.Chyla@... Website: www.iar.com


Re: change "raise illegal instruction" -> "reserved" for static encodings

Krste Asanovic
 

There is text in some places stating this, but it is also something that needs to get reemphasized in the base ISA manual intro.

Krste

On Oct 25, 2020, at 4:37 PM, Roger Espasa <roger.espasa@...> wrote:

Did you keep/add text encouraging implementations to indeed raise illegal on reserved encodings ? I went through the patch (rather quickly) and did not see it. 

Roger 

On Mon, 26 Oct 2020 at 00:32, Krste Asanovic <krste@...> wrote:

I'm working through updates to vector spec, and one part of clean up
is changing text where it has mandatory raising of illegal instruction
exceptions on unsupported encodings to instead state the encoding is
"reserved".  This will allow future extensions to use these encodings,
while current implementations can continue (and are encouraged) to
raise illegal instruction exceptions on unsupported reserved
encodings.

We didn't talk through this in meetings, but I assume this will be
uncontroversial.

https://github.com/riscv/riscv-v-spec/commit/96a9bc96640b9a12bf8868634671e501f2a24b77

Krste







Re: change "raise illegal instruction" -> "reserved" for static encodings

Roger Espasa
 

Did you keep/add text encouraging implementations to indeed raise illegal on reserved encodings ? I went through the patch (rather quickly) and did not see it. 

Roger 

On Mon, 26 Oct 2020 at 00:32, Krste Asanovic <krste@...> wrote:

I'm working through updates to vector spec, and one part of clean up
is changing text where it has mandatory raising of illegal instruction
exceptions on unsupported encodings to instead state the encoding is
"reserved".  This will allow future extensions to use these encodings,
while current implementations can continue (and are encouraged) to
raise illegal instruction exceptions on unsupported reserved
encodings.

We didn't talk through this in meetings, but I assume this will be
uncontroversial.

https://github.com/riscv/riscv-v-spec/commit/96a9bc96640b9a12bf8868634671e501f2a24b77

Krste






change "raise illegal instruction" -> "reserved" for static encodings

Krste Asanovic
 

I'm working through updates to vector spec, and one part of clean up
is changing text where it has mandatory raising of illegal instruction
exceptions on unsupported encodings to instead state the encoding is
"reserved". This will allow future extensions to use these encodings,
while current implementations can continue (and are encouraged) to
raise illegal instruction exceptions on unsupported reserved
encodings.

We didn't talk through this in meetings, but I assume this will be
uncontroversial.

https://github.com/riscv/riscv-v-spec/commit/96a9bc96640b9a12bf8868634671e501f2a24b77

Krste


Vector Task Group minutes, 2020/10/23

Krste Asanovic
 

Reminder: No meeting next Friday October 30.

Date: 2020/10/23
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://github.com/riscv/riscv-v-spec

There is no task group meeting next week (October 30).

We have identified potential resource for building SAIL model for V
extension.

We will investigate various paths to building compliance suite,
including Imperas released suite.

Issues discussed:

#575 Are the reserved bits of vtype WARL?

No. Implementation can set vill if not a supported configuration.
While implementations are allowed to trap on reserved values, we don't
want to mandate data-dependent traps. vsetvli instructions have a 11b
value written to vtype (the sign bit of the 12b imm field selects
immediate/register forms of vsetvl). For future extension, we could
view vtype as an 11b CSR, with additional CSRs added for more bits
later.

#519 Reserve wider EEW

Proposal to reserve EEW of 128-1024 for v1.0 ratification, with
anticipation to use these encodings later as proposed. Agreed.

#522 Forbid register overlap in reductions?

This issue was closed, as agreed that there were no additional
constraints above those needed for regular renaming or restart.

#544 Proposal to add a "addvl" instruction

This would include a multiply immediate to scale vlenb by number of
vector registers to be saved/restored to reduce code size.

It was noted that Zba will reduce the cost in some cases as
shift-and-add instructions can perform some small immediate
multiplies.

The consensus was that more experience was needed with software use
cases before adding any new instruction.


Re: [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])

David Horner
 

These are all important considerations.

However, what they have in common when considering Allen's question:

This discussion is bringing up an issue that needs wider discussion about extensions in general.

is that they are all tactical considerations are in the context of our current framework of instruction space allocation. What we will find is that these trade-off considerations will reinforce the dilemma that Allen raises. How do we manage these conflicting "necessities/requirements" of different target environments. 

I have hinted at it already, we need not only tactical analysis of feature tradeoff in different domains but a strategic approach to support them.

The concern is nothing new. It has been raised, if only obliquely, many times prior on the [google] groups.riscv.org (-dev, -sw especially) and  lists.riscv.org TG threads.
The vector group, especially,  has grappled with it in the context of current V encoding being a subset of a [hypothetical] 64 bit encoding.

Specific proposals have been mentioned, but there was then no political will or perhaps more fairly, no common perception that there was a compelling reason to work systematically to address it. The [then] common thinking was that 48 and 64 bit instruction spaces will be used as 32 and 16 bit are exhausted, and everyone will be happy. Well, that naive hope has not materialized and many are envisioning clashes that will hurt RISCV progress, either fragmentation or stagnation, as tactical approaches and considerations are implement or debated.

Previously two major strategic approaches were hinted at, even if they were not outright proposed.

Hardware Support - this has been explicitly proposed in many flavours: and is currently in the minds of many.
     The idea is a mode shift analogous to
        arm's transition to thumb and back and
        intel's myriad of operating modes: real, protected, virtual, long and their disparate instantiations.
     I agree that implementations should have considerable freedom on how to provide hardware select-able functionality.
     However, a proposed framework to support that should be provided by RISV.org.
     Recent discussion and document tweaks about misa (Machine ISA register) suggest that this mechanism,
          though valuable, is inadequate as robust support for the explosion of features.
     An expanded framework will be necessary, perhaps along the lines of the two level performance counters definitions.
     The conflict with overlapping mappings of groups of instructions to the same encoding space is not easily addressed by this mechanism.

which leads us to

Software Support:

The Generalized Proposal:
All future extensions are not mapped to a fixed exclusive universal encoding,
but rather to appropriately sized [based initially off 32 isize] minor [22-bit], major[25-bit] or quadrant [30-bit] encoding,
that is allocated to the appropriate instruction encoding at link/load time to match the hardware [or hardware dynamic configuration, as above].
This handles the green field encodings.
Each feature could have a default minor/major/quadrant encoding designation.

Brown field can also managed, simply if the related co-encoded feature is present, with more complexity, and perhaps extensive opcode mapping if blended into other feature's encodings.

An implementation method would be to have a fixed exclusive universal prefix for each feature.
Each instruction would then be emitted by the compiler as a [prefix]:[instruction with default encoding] pair.
If the initial prefixes are also nops [most of which are currently designated as hints],
then the code would be executable on machines that use the default mapping
without any link/load intervention [at lower performance granted].

This approach is backward compatible for the other established extensions:
most notably F which consumes 7 major opcodes spaces [and *only* 5 with Zfinx (Zifloat?)] and
then AMO which also consumes the majority of a major opcode.

This strategic change has a number of immediate and significant benefits:
  1) custom reserved major op codes effectively become unreserved as "standard" extensions can be mapped there also.
       The custom reserved nature will then only be the designated default allocation, "standard extensions" will not default to them.
  2) as mentioned above, if the prefix is a nop then link/load support is not needed for direct execution support [only efficiency].
  3) the transition to higher bit encodings can be simplified. As easily as the compiler emmitting the designated  prefix for that feature that encodes for 64 bit instructions.
So, two assigned fixed exclusive encodings per feature may be useful, one a 64bit encoding and one a nop.

I do not intent to stifle any of the tactical discussions of co-usefulness of features and profile domains.
These are meaningful and useful considerations.

Rather, I hope that by having a framework for coexistance of features, that those discussions can proceed in a more guided way;
that discovers can be incorporated into a framework centric corpus of understanding of trade-offs and cooperative benefits of features/profiles.


On 2020-10-23 11:45 p.m., Robert Chyla wrote:
I agree with Greg's statements. For me 'code-size' is very important for small, deeply embedded/IoT-class small systems.

Work in other groups (bitmanip) will also benefit code size, but it is not primary focus I think as these will also improve code-speed.

Linux-like big processors usually have DDR RAM and code size is 'unlimited'.
It should not hurt as code-size advances will benefit such big systems, but we should not forget about 'cheap to implement'='logic size' factors.

IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.

Regards,
/Robert

On 10/23/2020 5:11 PM, Greg Favor wrote:
It seems like a TG, probably through the statement of its charter, should clearly define what types or classes of systems it is focused on optimizing for (if there is an intended focus) and what types or classes of systems it does not expect to be appropriate for.   More concretely, it seems like there are a few TG's developing extensions oriented towards embedded real time systems and/or low-cost embedded systems.  These are extensions that would probably not be implemented in full-blown Linux-class systems.  Those extensions don't need to worry about being acceptable to such system designs, and can optimize for the requirements and constraints of their target class(es) of systems.

Unless I'm mistaken, this TG falls in that category.  And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be judged wrt other classes of systems).  And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably be agreed upon early on.

Greg

On Fri, Oct 23, 2020 at 4:34 PM Allen Baum <allen.baum@...> wrote:
This discussion is bringing up an issue that needs wider discussion about extensions in general.
Risc-V is intended to be an architecture that supports an extremely wide range of implementations, 
ranging from very low gate count microcontrollers, to high end superscalar out-of-order processors.
How do we evaluate an extension that only makes sense at one end or the other?

I don't expect a vector, or even hypervisor extensions in a low gate count system.
There are other extensions that are primarily aimed at specific applications areas as well.

A micro sequenced (e.g. push/pop[int]) op might be fairly trivial to implement in a low gate count system
(e.g. without VM, but with PMPs) and have significant savings in code size, power, and increased performance.
They may have none of those, or less significant, advantages in a high end implementation --
and/or might be very difficult or costly to implement in them, (e.g. for TLB miss, interrupt, & exception handling )
 (I am not claiming that these specific ops do, but just pretend there is one like that)

Should we avoid defining instructions and extensions like that? 
Or just allow that some extensions just don't make sense for some class of implementation?
Are there guidelines we can put in place to help make those decisions? 
This same (not precisely the same) kind of issue is rearing its head in other places, e.g. range based CMOs.


--
Regards,
Robert Chyla, Lead Engineer, Debug and Trace Probe Software
IAR Systems
1211 Flynn Rd, Unit 104
Camarillo, CA  93012 USA
Office: +1 805 383 3682 x104
E-mail: Robert.Chyla@... Website: www.iar.com


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

David Horner
 


On 2020-10-21 6:33 p.m., swallach wrote:
i have not totally been following this discussion.  but at convex we handled this very simply

if Vl = 0,  no vector operation was executed,  and the vector instruction was executed and sequential operation proceeded.

to the best  of my  knowledge this never came up as an issue

https://github.com/riscv/riscv-v-spec/issues/587#issuecomment-711087236

To clarify, Andrew's reading of the spec has vstart>= vl behaviour superseding vl=0 implied behaviour.

Thus some vector instructions are executed even when vl=0. vfirst and vpopc are two of them.




--------------------------------------------------------------------




On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said:

|     [DH]: I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
|     Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
|     or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
|     A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
|     SOFTWARE control.

| Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero
| on the 11th.   That check could be in hardware, or it could be in
| the software that's calling the FF instruction.

I don't want us to rathole on how to guarantee forward progress for
vl=0 case, but do want to note that this kind of forward progress is
nasty to guarantee, implying there's long-lasting microarch state to
keep around - what if you're context swapped out before you get to the
11th?  Do you have to force the first one after a context swap to not
trim?  What if there's a sequence of ff's and second one goes back to
vl=0?

WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

swallach
 

i have not totally been following this discussion.  but at convex we handled this very simply

if Vl = 0,  no vector operation was executed,  and the vector instruction was executed and sequential operation proceeded.

to the best  of my  knowledge this never came up as an issue



--------------------------------------------------------------------




On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said:

|     [DH]: I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
|     Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
|     or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
|     A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
|     SOFTWARE control.

| Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero
| on the 11th.   That check could be in hardware, or it could be in
| the software that's calling the FF instruction.

I don't want us to rathole on how to guarantee forward progress for
vl=0 case, but do want to note that this kind of forward progress is
nasty to guarantee, implying there's long-lasting microarch state to
keep around - what if you're context swapped out before you get to the
11th?  Do you have to force the first one after a context swap to not
trim?  What if there's a sequence of ff's and second one goes back to
vl=0?

WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

krste@...
 

- [tech-cmo] so they don't get bothered with this off-topic discussion

On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@sifive.com> said:
| [DH]: I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
| Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
| or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
| A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
| SOFTWARE control.

| Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero
| on the 11th.   That check could be in hardware, or it could be in
| the software that's calling the FF instruction.

I don't want us to rathole on how to guarantee forward progress for
vl=0 case, but do want to note that this kind of forward progress is
nasty to guarantee, implying there's long-lasting microarch state to
keep around - what if you're context swapped out before you get to the
11th? Do you have to force the first one after a context swap to not
trim? What if there's a sequence of ff's and second one goes back to
vl=0?

| But this does not need to be in the RISC-V architectural standard. Not yet.

Let's agree on this and move on.

|  As long as the VL=0  encoding is free,  not used for  some other purpose, you can do that in your implementation.

|  Your implementation might not be able to pass the RISC-V architectural for FF,  which I assume will probably assert an error if they find FF and
| with VL=0.  but if  your hardware has a chicken bit to reduce the threshold of VL= FFs to zero, or if you have a binary translator from  the
| compliance tests  to your software guaranteed forward progress, sure.

|  Build something like that, so there's a lot of people who want it, and a few years from now we can put it into a future version of the vector
| standard.

To be clear, if this is ever done, it will be with a separate
encoding, not expanding behavior of current instructions. Returning
vl=0 is not a "free" part of encoding. Software might rightly want to
take advantage of knowing vl>0 so you cannot allow same instruction to
return vl=0 after the fact, so need a different opcode/mode.

Krste


| On 10/16/2020 4:48, David Horner wrote:

| First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value.

| Even if such appear "random".

| On 2020-10-16 2:01 a.m., krste@berkeley.edu wrote:

| - I'm sure there's probably
| papers out there with this already).

| Exactly.
| I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
| Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
| or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.

| A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
| SOFTWARE control.

| I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can
| be  readily checked for.
| Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch.

| (sending replies to vector list - as this is off topic for CMOs)

| My opinion is that baking SIMT execution model into ISA for purposes
| of exposing microarchitectural performance (i.e., cache misses)
| exposes too much of the machine, forcing application software to add
| extra retry loops (2nd nested loop inside of stripmining) and forcing
| system software to deal with complex traps.

|   [  Random historical connection - having a partial completion mask based
|      on cache misses is a vector version of the Stanford proposal for
|      "informing memory operations" where scalar core can branch on cache miss.
|                 https://dl.acm.org/doi/10.1145/232974.233000 ]
|   Most of the benefit for SIMT execution around microarchitectural
| hiccups can be obtained under the hood in the microarchitecture (and
| there are several hundred ISCA/MICRO/HPCA papers on doing that - I
| might be exaggerating, but only slightly - and I know Andy worked in
| this space at some point), and should outperform putting this handling
| into software.

| That said, I think it's OK to allow FF V loads to stop anywhere past
| element 0 including at a long-latency cache miss, mainly because it
| doesn't change anything in software model.

| I'm not sure it will really help perf that much in practice.  While
| it's easy to construct an example where it looks like it would help, I
| think in general most loops touch multiple vector operands, hardware
| prefetchers do well on vector streams, vector units are more efficient
| on larger chunks, scatter-gathers missing in cache limit perf anyway,
| etc., so it's probably a fairly brittle optimization (yes, you could
| add a predictor to figure out whether to wait for the other elements
| or go ahead with a partial vector result - I'm sure there's probably
| papers out there with this already).

| Krste

| On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@cadence.com> said:

| | My take is the same as Andrew has outlined below.
| |       Bill

| | On 10/15/20 4:30 PM, andrew@sifive.com wrote:

| |     EXTERNAL MAIL
|     |     Forwarding this to tech-vector-ext; couple comments below.
|     |     On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@sifive.com> wrote:

| |         In vector meeting last Friday  I listened to both Krste and David Horner's  different opinions about fault-on-first and vector
| length trimming. I realized (and may have
| |         convinced other attendees) that the  RISC-V "fault-on-first"  vector length trimming need not be done just for things like
| page-faults.
|         |         Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, 
| because vector element zero is the forward progress
| |         mechanism.
|         |         Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the  element zero of the
| operation,  but that it can randomly stop with the appropriate
| |         indication for vector length  trimming at any point in the middle of the instruction.
|         |     Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). 
| Enumerating all cases in which the length might be
| |     trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

| |         This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete
| mechanism.   It could, if there were something else in
| |         the system that guaranteed forward progress
|         |     My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward
| progress, and cleanly matches the while-loop vectorization
| |     model.

| |         ---+ Expanded

| |         From vector meeting last Friday: trimming, fault-on-first.  I realized that it is similar to the forms of SW visible non-faulting
| speculative loads some machines, especially
| |         VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The
| NaT-ness is implied by trimmed vector length.
| |         It could be implied by a mask showing which vector operations had completed.

| |         All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. 
| Software control flow must fall through such a
| |         check operation,  and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different
| instructions, so there must be a branch.

| |         The RISC-V Fault-on-first approach  has the correctness check for non-faulting implied by redoing the instruction.  i.e. it is
| its own non-faulting check.  it gets away with
| |         this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on
| vector element zero, i.e. never allowing a trim
| |         to zero length.   if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask,
| but to make forward progress it would have to
| |         guarantee that at least one vector element had completed.

| |         David Horner's desire for fault-on-first that may have performed no operations at all is (1)  reasonable IMHO (I think I managed
| to explain that the Krste), but (2) Would
| |         require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above.
| Which is almost certainly a bigger
| |         architectural change than RISC-V should make it this time.

| |         Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've
| ever done. Particularly those vector instruction
| |         sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what
| amounts to a vector instruction set and
| |         microarchitecture.  https://pharr.org/matt/papers/ispc_inpar_2012.pdf ).  It would be unfortunate for such an SIMT program to
| lose  work completed after the first fault.

| |         MORAL:  fault-on-first may be suitable for vector load that might speculate past the end of the vector -  where the length is 
| not known or inconvenient when the vector load
| |         instruction is started. Fault-on-first is  suboptimal for running SIMT on top of vectors.   i.e. fault-on-first  is the
| equivalent of precise exceptions for in order
| |         execution,  and for a single thread executing vector instructions, whereas  completion mask  allows out of order within a vector
| and/or vector length  threading.

| |         IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is
| totally fine to have the fault-on-first stuff
| |         return up to the  first really long latency cost miss, as long as it always  guarantees that at least vector element zero was
| complete. Because vector element zero complete
| |         is what guarantees forward progress.

| |         Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to
| actually implement a page-fault that did
| |         copy-on-write or  swapped in from disk.   but that would be visible to the operating system, not the user program.  However, such
| an OS implementation  would have to
| |         guarantee that it would not kill a process as a result  of a true permissions error page-fault. Or, if the virtual memory
| architecture made the distinction between
| |         permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy  on read,  the OS does not need
| to be involved.

| |          EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless
| you only trim only on true faults and not  COW
| |         or COR or disk swappage-faults).   However,  fault-on-first on any page-fault is a much  lower bandwidth  information leak 
| channel  than is fault-on-first on long latency
| |         cache misses.  so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to
| implement fault-on-first on any cache miss.
| |         However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system
| where all of the CPUs are dedicated to a single
| |         problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to
| them!

| |         Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually  much
| less of a virtualization hole than
| |         fault-on-first for page-faults.   The operating system or hypervisor has very little control over cache misses.  the OS and
| hypervisor have almost full control over
| |         page-faults.  The usual rule in security and virtualization is that an application should not be able to detect that it has had
| an "innocent"  page-fault, such as COW or COR
| |         or disk swapping.

| |         --
| |         --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis
|     |

| --
| --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis


Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

Nick Knight
 

Hi Dawei,

Unfortunately the V-extension does not currently feature block-matrix multiplication instructions, including the bit-masked versions that you're describing. Some of these operations are intended to be addressed by a future (sub-) extension (perhaps EDIV). 

Presently, you can apply your scheme directly to the restricted class of BCSR matrices with 1-dimensional blocks. You can imagine various (nonzero-structure-dependent) tradeoffs between using r-by-1 and 1-by-c blocks.

More generally you can recursively compose (singly or doubly compressed) block-CS{R,C} formats, encoding indices using bitvectors or lists at any level of the recursion.

Best,
Nick


On Tue, Oct 20, 2020 at 2:04 AM lidawei14 via lists.riscv.org <lidawei14=huawei.com@...> wrote:
Hi,

Perhaps instead of using bit vector to encode an entire matrix, we can encode a sub block.
There is a common sparse matrix format called BCSR that blocks the non-zero values of CSR, so that we can reduce col_ind[] storage and reused vector x.
The main disadvantage of BCSR is we have to pad zeros, where we can actually use a bit mask to encode nonzeros of a sub block as Nagendra's bit vector implementation so that the overhead can be avoided.
I could not find good reduction instructions for tiled matrix vector multiplications if we have multiple rows in a block.

One sub block:
A =
a b
0 d 
Corresponding x:
x =
e
f
Bit vector:
1 1 0 1
Computation:
a b 0 d 
e f e f
 
fmul = ae bf 0e df 
accumulate (reduction) ae+bf,0e+df
(Note we can skip that zero computation using bit mask).

Thanks,
Dawei


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

David Horner
 

You're incorrectly characterizing FoF below.  The FoF loads are not
intended for software to dynamically probe the microarch state to
check for possible faults

That  is not what I am advocating.

(though it can be misused that way).  The
point is to support software vector-length speculation, where whether
an access is really needed is not known ahead of time.

That is not precisely the full use case.
Rather your intended use case is :

When the application is assured that a constrained load can succeed,
     [ the system guarantees a termination condition for the load exists
     ,that it  is detectable from the data read up to and including the end point,
      and that all the data from the start point to the end point is readable]
   then FoF provides a convenient and expedited way to advance through the load.

And if you define "not known ahead of time" to mean before each successive load, then that time frameis not precisely true either.
The load could be performed one unit at a time, and each time the need would be known.
The unit requested could be of arbitrary length [successive packets of ethernet data or crypto segments].
I'm not trying to be obtuse and oppositional.
The value of FoF is to avoid the complexities of such tracking,
but if an EE were to reasonably guarantee that the data to be loaded
can be speculatively read up to a page boundary, then FoF is not needed,
nor does it necessarily provide any hard advantage over the regular strided load. 
[some implementations may detect such things as debug breakpoints and not trigger them, but as far as the software is concerned it has the speculative to-the-end-of-the-page guarantee, thus it will be content even if the debugger is annoyed]


The FoF loads are not
intended for software to dynamically probe the microarch state to
check for possible faults (though it can be misused that way).

The detection of microarch state is incidental to the characterization I attribute to FoF.
And it is not only microarch state that can be revealed but system and EE level state.

FoF fails in situations that are not covered by your use case.
Specifically, what does the EE do when it detects a situation that forward progress is not possible.
e.g. the data requested is not mapped into the process.
As I understand your use case the [standard] FoF load is aborted and its process as well.
The "enhanced/dangerous" FoF load will be allowed vl=0 to identify the "abort" case.
Consider this scenario:
A process requests the EE to maps into another process' [e.g. child's] address space pages to scan,
and the asychronous [child] co-process does the scanning.
FoF return vl=0 is eminently suited to this use case.
It is certainly possible to add to the handshaking/synchronization process the current end point of the data
that would need to be checked as each page is processed.
This can be substantial overhead and delay.

It is certainly possible to ensure that each request overreaches the natural page alignment.
However, as FoF allow the processor to reduce vl at any point, it could continually reduce vl so that it is better aligned to cache, anticipating that following request will be optimized. The program will still work, and detect potential page failures, but the false positives could be substantial and even more costly and substantially variable across implementations. [not to mention the EE thinking the process is attempting to do side channel attack].

These use cases argue for vl=0 return. And as I mentioned before, these use cases will motivate the EE to return vl=0, even without the application using the "new/corrupted" FoF encoding for vl=0 allowed.
 

On Tue, Oct 20, 2020 at 5:08 AM Krste Asanovic <krste@...> wrote:

If we allow regular FoF loads to return with vl=0, we must provide a
forward-progress guarantee, otherwise the instructions are practically
unusable.

I believe I have shown practical uses above.

 The forward-progress guarantee must not add overhead to the
common cases where returning vl=0 serves no useful purpose.

I certainly agree. But when does returning vl=0 serve no useful purpose?

this is difficult to describe, especially when code may have several
FoF loads in a stripmine loop.  If allowing FoF loads to return vl=0
requires application overhead to support the forward-progresss
guarantee, then we should have a separate encoding for that
instruction so that the common case is not burdened by the esoteric
case.

There are different forward-progress guarantees.
As I mentioned before separate encoding will not provide a practical benefit.
Once the new encoding is introduced,
legacy processors will just have their EE emulate it by allowing vl=0 return
under the same conditions and the linkeditor will replace the new FoF with the old.




The FoF instructions allow software vector-length speculation in a
safe way, where the first element is checked as normal and raises any
necessary traps to the OS, while the later elements are not processed
if they're problematic.  Only if software attempts to actually process
the later elements, because processing the earlier elements deems it
necessary, is the required trap actioned.

The trap is serviced by the OS not the application.  Most commonly, it
will be a page fault, sometimes a protection violation.  Neither are
reported to the application (in general), because the application can
do nothing about these traps.  This is different from the other cases
you bring up (integer overflow, FP flags).


As mentioned before, if we think outside the box of the "classic" use case,
there certainly are meaningful and significant ways that applications can
handle EE level events (analogous to divide by zero).


There is no difficulty in providing forward progress on FoF loads in a
microarchitecture, as otherwise regular vector loads wouldn't work.
FoF loads are only a small modification to regular vector loads,
basically flushing the pipeline to change vl on a trap instead of
taking the trap and setting vstart.

The only way I would contemplate allowing trimming to vl=0 for the 1.0
standard was if there was a forward-progress guarantee that did not
burden regular uses of FoF loads. 


The default case is just such a non-burdensome approach.

Check vl=0 if you are not guaranteed to succeed.
Ignore vl=0 at your peril if you are unsure (you could end up in an infinite loop).
Ignore vl=0 if you are guaranteed not to read past valid memory.

Also, the guarantee would have to
actually enable some improvement in an implementation (as otherwise,
no one would choose to trim to 0, and we can then keep the spec
simple).

The spec will need to address this case in any event, even if to say we do not recommend EE return with vl=0.
The spec cannot mandate that EE not return vl=0. Certification does not extend to runtime constrained EEs.
Code needs to be aware that this can happen.
The net is, I don't believe the "prohibition" significantly simplifies the spec.
It may actually make it more contentious.

You simplified integer divide over other ISA that mandated a trap for divide by zero.
With this approach we mandate a trap for FoF when vl=0 would be sufficient.

Where it is inevitable that EE will do the sensible thing and
return vl=0; when forward progress [within reasonable constraints] is not possible.

 

Krste


>>>>> On Sat, 17 Oct 2020 22:39:37 -0400, "David Horner" <ds2horner@...> said:

| On 2020-10-17 6:49 p.m., krste@... wrote:
|| - [tech-cmo] so they don't get bothered with this off-topic discussion
||
||||||| On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said:
|| |     [DH]: I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
|| |     Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
|| |     or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
|| |     A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
|| |     SOFTWARE control.
||
|| | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero
|| | on the 11th.   That check could be in hardware, or it could be in
|| | the software that's calling the FF instruction.
||
|| I don't want us to rathole on how to guarantee forward progress for
|| vl=0 case, but do want to note that this kind of forward progress is
|| nasty to guarantee, implying there's long-lasting microarch state to
|| keep around - what if you're context swapped out before you get to the
|| 11th?  Do you have to force the first one after a context swap to not
|| trim?  What if there's a sequence of ff's and second one goes back to
|| vl=0?

| Krste: I gather your answer is more in the context of lr/sc type forward
| guarantees, instructions that are designed not to trap when delivering
| on their primary function.

|     So  I agree that determining an appropriate deferred trap is
| problematic.

|      However, the intent of vl*ff is to operate in an environment
| anticipating exception behaviour.

|       It is the instruction's raison d'être.  If the full vl is expected
| to always be returned [with very, very few exceptions] we would not have
| this instruction, but rather direct the EE to reduce vl or abort the
| process.

|      So rather than a rathole we have the Elephant-In-The-Room. What
| does the EE do when deferred forward progress is not possible?

|      Given that the application is anticipating "trouble" with the read
| memory access, does it make sense to only address the "safe" case?

|      With float exceptions RISCV does not provide trap handlers, but
| rather FFlags for the application to electively check.

|      With integer overflow or zero divide RISCV does not provide trap
| handlers, but requires the application to include code to detect the
| condition.

|      Trap handlers for vl*ff are only incidental. They are no more
| special to vl*ff than any other of the vl*, or the RVI lw,lh,lb, etc.

|      In apparent contradiction to the spec, a valid implementations can
| "trap" as it would for the non-ff but not service the fault, only reduce
| the vl accordingly until the fault occurs on the first element.

|      Thus central to the functioning of the instruction is what happens
| when the fault occurs on the first element.

|      Punting to the handler is not an answer. Return at least one
| element or trap does not define the operational characteristics [even if
| it may arguably be an ISA architectural answer].

|      There is nothing prohibiting the trap from returning vl=0. And I
| argue that EEs will indeed elect to do that when there can be no forward
| progress [e.g. the requested address is mapped execute only].

|      Platforms will stipulate a behaviour and vl=0 will be a choice.
| What we should try to address is how to allow greatest portability and
| least software fragmentation.

|    I believe this should be accomplished exactly was effected for the
| integer overflow. Exclude the checking code if you do not need it, and
| include it if you are not assured that it is superflous.

|   In other words vl=0 must be handled , either by avoidance or
| explicitly as indication that is nothing to process.

|| | But this does not need to be in the RISC-V architectural standard. Not yet.
||
|| Let's agree on this and move on.
| There is no value in ignoring the issue.
||


Vector TG meeting minutes 2020/10/16

Krste Asanovic
 

Date: 2020/10/16
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~16
Current issues on github: https://github.com/riscv/riscv-v-spec

# Definition of Done checklist

We discussed what items on the "definitions of done" checklist are not
currently underway, including the architectural compliance tests and
the SAIL formal model.

# 550 Names/contents of initial vector subsets.

We discussed the proposals for initial vector subsets for
microcontroller/embedded use cases. The proposed five subsets are
Zve32x, Zve32f, Zve64x, Zve64f, Zve64d.

The consensus was to not include Zvamo in these, as some embedded
memory systems do not have caches, and many interconnects do not
support atomics.

vrgather instruction was discussed for omission, but the consensus was
that if indexed memory accesses were to be supported (which had
agreement), then vrgather was relatively easy to add and was very
useful, especially in low-ended embedded systems with limited memory
systems.

reductions were discussed for omission, but it was felt that they're
commonly used and they would fragment the software base too much if
left out.

There was some discussion about supporting only LMUL=1, but was noted
this would not allow use of the widening and narrowing instructions.
Another proposal was to support only LMUL<=1, as this would simplify
some implementations while still supporting the full instruction set.
This would form an even more minimal base.

There was no clear consensus on requiring a minimum VLEN>ELEN. This
would interact with a proposal to allow LMUL<=1, in that a longer VLEN
might be mandated if LMUL<=1 since larger LMULs could not be used to
hold longer vectors.

We did not reach agreement on whether 64b integer multiplies should be
included in Zve64f subset.


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Krste Asanovic
 

If we allow regular FoF loads to return with vl=0, we must provide a
forward-progress guarantee, otherwise the instructions are practically
unusable. The forward-progress guarantee must not add overhead to the
common cases where returning vl=0 serves no useful purpose. I believe
this is difficult to describe, especially when code may have several
FoF loads in a stripmine loop. If allowing FoF loads to return vl=0
requires application overhead to support the forward-progresss
guarantee, then we should have a separate encoding for that
instruction so that the common case is not burdened by the esoteric
case.

You're incorrectly characterizing FoF below. The FoF loads are not
intended for software to dynamically probe the microarch state to
check for possible faults (though it can be misused that way). The
point is to support software vector-length speculation, where whether
an access is really needed is not known ahead of time.

The FoF instructions allow software vector-length speculation in a
safe way, where the first element is checked as normal and raises any
necessary traps to the OS, while the later elements are not processed
if they're problematic. Only if software attempts to actually process
the later elements, because processing the earlier elements deems it
necessary, is the required trap actioned.

The trap is serviced by the OS not the application. Most commonly, it
will be a page fault, sometimes a protection violation. Neither are
reported to the application (in general), because the application can
do nothing about these traps. This is different from the other cases
you bring up (integer overflow, FP flags).

There is no difficulty in providing forward progress on FoF loads in a
microarchitecture, as otherwise regular vector loads wouldn't work.
FoF loads are only a small modification to regular vector loads,
basically flushing the pipeline to change vl on a trap instead of
taking the trap and setting vstart.

The only way I would contemplate allowing trimming to vl=0 for the 1.0
standard was if there was a forward-progress guarantee that did not
burden regular uses of FoF loads. Also, the guarantee would have to
actually enable some improvement in an implementation (as otherwise,
no one would choose to trim to 0, and we can then keep the spec
simple).

Krste


On Sat, 17 Oct 2020 22:39:37 -0400, "David Horner" <ds2horner@gmail.com> said:
| On 2020-10-17 6:49 p.m., krste@sifive.com wrote:
|| - [tech-cmo] so they don't get bothered with this off-topic discussion
||
||||||| On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@sifive.com> said:
|| | [DH]: I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
|| | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
|| | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
|| | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
|| | SOFTWARE control.
||
|| | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero
|| | on the 11th.   That check could be in hardware, or it could be in
|| | the software that's calling the FF instruction.
||
|| I don't want us to rathole on how to guarantee forward progress for
|| vl=0 case, but do want to note that this kind of forward progress is
|| nasty to guarantee, implying there's long-lasting microarch state to
|| keep around - what if you're context swapped out before you get to the
|| 11th? Do you have to force the first one after a context swap to not
|| trim? What if there's a sequence of ff's and second one goes back to
|| vl=0?

| Krste: I gather your answer is more in the context of lr/sc type forward
| guarantees, instructions that are designed not to trap when delivering
| on their primary function.

|    So  I agree that determining an appropriate deferred trap is
| problematic.

|     However, the intent of vl*ff is to operate in an environment
| anticipating exception behaviour.

|      It is the instruction's raison d'être.  If the full vl is expected
| to always be returned [with very, very few exceptions] we would not have
| this instruction, but rather direct the EE to reduce vl or abort the
| process.

|     So rather than a rathole we have the Elephant-In-The-Room. What
| does the EE do when deferred forward progress is not possible?

|     Given that the application is anticipating "trouble" with the read
| memory access, does it make sense to only address the "safe" case?

|     With float exceptions RISCV does not provide trap handlers, but
| rather FFlags for the application to electively check.

|     With integer overflow or zero divide RISCV does not provide trap
| handlers, but requires the application to include code to detect the
| condition.

|     Trap handlers for vl*ff are only incidental. They are no more
| special to vl*ff than any other of the vl*, or the RVI lw,lh,lb, etc.

|     In apparent contradiction to the spec, a valid implementations can
| "trap" as it would for the non-ff but not service the fault, only reduce
| the vl accordingly until the fault occurs on the first element.

|     Thus central to the functioning of the instruction is what happens
| when the fault occurs on the first element.

|     Punting to the handler is not an answer. Return at least one
| element or trap does not define the operational characteristics [even if
| it may arguably be an ISA architectural answer].

|     There is nothing prohibiting the trap from returning vl=0. And I
| argue that EEs will indeed elect to do that when there can be no forward
| progress [e.g. the requested address is mapped execute only].

|     Platforms will stipulate a behaviour and vl=0 will be a choice.
| What we should try to address is how to allow greatest portability and
| least software fragmentation.

|   I believe this should be accomplished exactly was effected for the
| integer overflow. Exclude the checking code if you do not need it, and
| include it if you are not assured that it is superflous.

|  In other words vl=0 must be handled , either by avoidance or
| explicitly as indication that is nothing to process.

|| | But this does not need to be in the RISC-V architectural standard. Not yet.
||
|| Let's agree on this and move on.
| There is no value in ignoring the issue.
||
|| |  As long as the VL=0  encoding is free,  not used for  some other purpose, you can do that in your implementation.
||
|| |  Your implementation might not be able to pass the RISC-V architectural for FF,  which I assume will probably assert an error if they find FF and
|| | with VL=0.  but if  your hardware has a chicken bit to reduce the threshold of VL= FFs to zero, or if you have a binary translator from  the
|| | compliance tests  to your software guaranteed forward progress, sure.
||
|| |  Build something like that, so there's a lot of people who want it, and a few years from now we can put it into a future version of the vector
|| | standard.
||
|| To be clear, if this is ever done, it will be with a separate
|| encoding, not expanding behavior of current instructions. Returning
|| vl=0 is not a "free" part of encoding. Software might rightly want to
|| take advantage of knowing vl>0 so you cannot allow same instruction to
|| return vl=0 after the fact, so need a different opcode/mode.
| And it is precisely because this backward compatibility is not managed
| if we tactically ignore vl=0 that we must address it, and allow vl=0 for
| V1.0.
||
|| Krste
||
||
|| | On 10/16/2020 4:48, David Horner wrote:
||
|| | First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value.
||
|| | Even if such appear "random".
||
|| | On 2020-10-16 2:01 a.m., krste@berkeley.edu wrote:
||
|| | - I'm sure there's probably
|| | papers out there with this already).
||
|| | Exactly.
|| | I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
|| | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
|| | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
||
|| | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
|| | SOFTWARE control.
||
|| | I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can
|| | be  readily checked for.
|| | Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch.
||
|| | (sending replies to vector list - as this is off topic for CMOs)
||
|| | My opinion is that baking SIMT execution model into ISA for purposes
|| | of exposing microarchitectural performance (i.e., cache misses)
|| | exposes too much of the machine, forcing application software to add
|| | extra retry loops (2nd nested loop inside of stripmining) and forcing
|| | system software to deal with complex traps.
||
|| |   [  Random historical connection - having a partial completion mask based
|| |      on cache misses is a vector version of the Stanford proposal for
|| |      "informing memory operations" where scalar core can branch on cache miss.
|| |                 https://dl.acm.org/doi/10.1145/232974.233000 ]
|| |   Most of the benefit for SIMT execution around microarchitectural
|| | hiccups can be obtained under the hood in the microarchitecture (and
|| | there are several hundred ISCA/MICRO/HPCA papers on doing that - I
|| | might be exaggerating, but only slightly - and I know Andy worked in
|| | this space at some point), and should outperform putting this handling
|| | into software.
||
|| | That said, I think it's OK to allow FF V loads to stop anywhere past
|| | element 0 including at a long-latency cache miss, mainly because it
|| | doesn't change anything in software model.
||
|| | I'm not sure it will really help perf that much in practice.  While
|| | it's easy to construct an example where it looks like it would help, I
|| | think in general most loops touch multiple vector operands, hardware
|| | prefetchers do well on vector streams, vector units are more efficient
|| | on larger chunks, scatter-gathers missing in cache limit perf anyway,
|| | etc., so it's probably a fairly brittle optimization (yes, you could
|| | add a predictor to figure out whether to wait for the other elements
|| | or go ahead with a partial vector result - I'm sure there's probably
|| | papers out there with this already).
||
|| | Krste
||
|| | On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@cadence.com> said:
||
|| | | My take is the same as Andrew has outlined below.
|| | |       Bill
||
|| | | On 10/15/20 4:30 PM, andrew@sifive.com wrote:
||
|| | |     EXTERNAL MAIL
|| |     |     Forwarding this to tech-vector-ext; couple comments below.
|| |     |     On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@sifive.com> wrote:
||
|| | |         In vector meeting last Friday  I listened to both Krste and David Horner's  different opinions about fault-on-first and vector
|| | length trimming. I realized (and may have
|| | |         convinced other attendees) that the  RISC-V "fault-on-first"  vector length trimming need not be done just for things like
|| | page-faults.
|| |         |         Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed,
|| | because vector element zero is the forward progress
|| | |         mechanism.
|| |         |         Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the  element zero of the
|| | operation,  but that it can randomly stop with the appropriate
|| | |         indication for vector length  trimming at any point in the middle of the instruction.
|| |         |     Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values).
|| | Enumerating all cases in which the length might be
|| | |     trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.
||
|| | |         This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete
|| | mechanism.   It could, if there were something else in
|| | |         the system that guaranteed forward progress
|| |         |     My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward
|| | progress, and cleanly matches the while-loop vectorization
|| | |     model.
||
|| | |         ---+ Expanded
||
|| | |         From vector meeting last Friday: trimming, fault-on-first.  I realized that it is similar to the forms of SW visible non-faulting
|| | speculative loads some machines, especially
|| | |         VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The
|| | NaT-ness is implied by trimmed vector length.
|| | |         It could be implied by a mask showing which vector operations had completed.
||
|| | |         All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison.
|| | Software control flow must fall through such a
|| | |         check operation,  and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different
|| | instructions, so there must be a branch.
||
|| | |         The RISC-V Fault-on-first approach  has the correctness check for non-faulting implied by redoing the instruction.  i.e. it is
|| | its own non-faulting check.  it gets away with
|| | |         this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on
|| | vector element zero, i.e. never allowing a trim
|| | |         to zero length.   if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask,
|| | but to make forward progress it would have to
|| | |         guarantee that at least one vector element had completed.
||
|| | |         David Horner's desire for fault-on-first that may have performed no operations at all is (1)  reasonable IMHO (I think I managed
|| | to explain that the Krste), but (2) Would
|| | |         require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above.
|| | Which is almost certainly a bigger
|| | |         architectural change than RISC-V should make it this time.
||
|| | |         Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've
|| | ever done. Particularly those vector instruction
|| | |         sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what
|| | amounts to a vector instruction set and
|| | |         microarchitecture.  https://pharr.org/matt/papers/ispc_inpar_2012.pdf ).  It would be unfortunate for such an SIMT program to
|| | lose  work completed after the first fault.
||
|| | |         MORAL:  fault-on-first may be suitable for vector load that might speculate past the end of the vector -  where the length is
|| | not known or inconvenient when the vector load
|| | |         instruction is started. Fault-on-first is  suboptimal for running SIMT on top of vectors.   i.e. fault-on-first  is the
|| | equivalent of precise exceptions for in order
|| | |         execution,  and for a single thread executing vector instructions, whereas  completion mask  allows out of order within a vector
|| | and/or vector length  threading.
||
|| | |         IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is
|| | totally fine to have the fault-on-first stuff
|| | |         return up to the  first really long latency cost miss, as long as it always  guarantees that at least vector element zero was
|| | complete. Because vector element zero complete
|| | |         is what guarantees forward progress.
||
|| | |         Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to
|| | actually implement a page-fault that did
|| | |         copy-on-write or  swapped in from disk.   but that would be visible to the operating system, not the user program.  However, such
|| | an OS implementation  would have to
|| | |         guarantee that it would not kill a process as a result  of a true permissions error page-fault. Or, if the virtual memory
|| | architecture made the distinction between
|| | |         permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy  on read,  the OS does not need
|| | to be involved.
||
|| | |          EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless
|| | you only trim only on true faults and not  COW
|| | |         or COR or disk swappage-faults).   However,  fault-on-first on any page-fault is a much  lower bandwidth  information leak
|| | channel  than is fault-on-first on long latency
|| | |         cache misses.  so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to
|| | implement fault-on-first on any cache miss.
|| | |         However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system
|| | where all of the CPUs are dedicated to a single
|| | |         problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to
|| | them!
||
|| | |         Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually  much
|| | less of a virtualization hole than
|| | |         fault-on-first for page-faults.   The operating system or hypervisor has very little control over cache misses.  the OS and
|| | hypervisor have almost full control over
|| | |         page-faults.  The usual rule in security and virtualization is that an application should not be able to detect that it has had
|| | an "innocent"  page-fault, such as COW or COR
|| | |         or disk swapping.
||
|| | |         --
|| | |         --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis
|| |     |
||
|| | --
|| | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis


|


Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

lidawei14@...
 

Hi,

Perhaps instead of using bit vector to encode an entire matrix, we can encode a sub block.
There is a common sparse matrix format called BCSR that blocks the non-zero values of CSR, so that we can reduce col_ind[] storage and reused vector x.
The main disadvantage of BCSR is we have to pad zeros, where we can actually use a bit mask to encode nonzeros of a sub block as Nagendra's bit vector implementation so that the overhead can be avoided.
I could not find good reduction instructions for tiled matrix vector multiplications if we have multiple rows in a block.

One sub block:
A =
a b
0 d 
Corresponding x:
x =
e
f
Bit vector:
1 1 0 1
Computation:
a b 0 d 
e f e f
 
fmul = ae bf 0e df 
accumulate (reduction) ae+bf,0e+df
(Note we can skip that zero computation using bit mask).

Thanks,
Dawei

201 - 220 of 693