Date   

[RFC] Drafting a formal v1.0 release for RVV C Intrinsic API

eop.chen@...
 

Hi all,
 
We (SiFive) are going to draft out a formal v1.0 release for the RVV C
intrinsic API. Next week we are going to provide a roadmap, including time
reserved for comments on what is left on the table and needs to be cleared
before the release. All existing issue will be settled. The ones that have
converged will be closed and opening ones will be tagged as "resolve for v1.0"
or "resolve after v1.0" that we can bring up for discussion in future
meeting(s).
 
Here are some initial thoughts on items before the release:
 
Release a generator script that produces the current C intrinsic API. Other
languages that seek to implement the intrinsics will able to leverage this.
Release a pdf version, better formatted document for the RVV C intrinsic API.
We hope to expand RVV users by providing a better conditioned document.
Schedule out timelines on requesting for comments on current items. Maybe
a monthly meeting? We hope to gather more inputs and reach consensus.
Our take on the release is to consider the completeness of current intrinsic
API-s and do minimal fixes and leave the current implementation "as-is" for
v1.0.
 
Looking forward for input and hope we can close this by the end of this year.

 
This post is also cc-ed to amongst vector TG, toolchain & runtime TG, graphic
TG and HPC TG.
Link to RFC issue under rvv-intrinsic-doc

Regards,

eop Chen


Re: Notice of Group Archival

Jeff Scheel <jeff@...>
 

Krste has requested that this group not be archived due to pending work on Zvfh and Zvfhmin extensions.  Based on this, we will wait until this work completes to proceed with archival.

Thanks!
-Jeff

--
Jeff Scheel (he/him/his)
Linux Foundation, RISC-V Technical Program Manager


Notice of Group Archival

Jeff Scheel <jeff@...>
 

Community members,

The Vector Extension Task Group community has completed its work and is slated to be deactivated and archived on August 15, 2022.  If you believe that this decision has been made in error and the action should be postponed, please send an email to help@... with an explanation.

Future discussions previously addressed by the group, should be addressed to:

Thanks,
-Jeff

--
Jeff Scheel (he/him/his)
Linux Foundation, RISC-V Technical Program Manager


Re: Seeking inputs for evaluating vector ABI design

Kito Cheng
 

Hi Peter:

The relevance to the present discussion is that RTCG may require detailed
configuration discovery ABI/API that goes beyond the ABI for functional
code. I hope the work of the relevant group(s) will take this into
consideration.
Discovery ABI/API are discussed in another place, which might not be
part of psABI, it would be more like Linux specific stuffs, SiFive
folks and Rivos folks have some discussion about the design of
configuration discovery API, might included a set of new system call
to let program in user space has a mechanism get more detail
information, that will public info soon (it would be public at least
before the Linux plumber 2022 I guess).

On Wed, Jul 27, 2022 at 12:49 AM L Peter Deutsch <ghost@...> wrote:

I would like to emphasize Zalman Stern's point about trading off hardware
economy for dynamic software optimization, in the context of a larger
comment about optimizing compiled code for RISC-V. The specification of RVV
is designed very well to work well across a variety of hardware
implementations without requiring different code, but IMO one of the great
truths of system design is that "compilation beats interpretation," and in
this context, execution-time parameterization as defined for RVV is a form
of interpretation that, like many kinds of interpretation, trades space and
time overhead for convenience.

For it to be most effective, the representation *from* which run-time code
is generated must be sufficiently high-level: the higher the level, the
greater the opportunities to tailor the code to the hardware. Not having
experience with vector-amenable computation, I can't say anything more
specific, other than to note the historical tug of war between, on the one
hand, compilers that recognize vectorizable constructs in low-level
languages like C, and on the other, very high-level languages like APL or
Halide.

The relevance to the present discussion is that RTCG may require detailed
configuration discovery ABI/API that goes beyond the ABI for functional
code. I hope the work of the relevant group(s) will take this into
consideration.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC


Re: Seeking inputs for evaluating vector ABI design

Kito Cheng
 

Hi Zalman:

Define a standardized vector ABI means we can have a common interface
and agreement among different compiler and libraries, isn't means we
must use that everywhere, we did have several way to doing some
dynamic function version selection (e.g. ifunc), and it won't changing
existing software ecosystem (means NO need to recompile everything
with vector ABI), it's extension of standardized ABI.

I could imagine optimized library functions with RVV might not have
any performance gain for all HW platform, but that would be software
optimization issue rather than ABI issue I think:)

On Wed, Jul 27, 2022 at 8:13 AM Bruce Hoult <bruce@...> wrote:

On Wed, Jul 27, 2022 at 4:21 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:

A plea to not design the future around vague and ill-considered use cases...

First of all, discussion of libc functions such as strcmp is irrelevant to this thread, as they do not have vector register arguments. They pass pointers to arguments in memory and use (and always will use) the standard ABI, not an augmented Vector ABI as Kito is proposing.



A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

If you have a machine with the properties you describe, and having a machine run both some heavy HPC task and some trivial task that uses the vector unit for strcpy() on the same core results in a severe overall performance penalty then you might indeed be advised not to do that. Run those lightweight spoiler tasks on different cores, or install a libc that doesn't use the vector unit.

For everyone else with desktop PCs or phones or cloud servers etc, the vector unit should be used as much as possible! ARM seem to be intending to vectorise every loop in every program. I don't know if or when they will achieve that, or whether RISC-V compilers will do the same, but in the meantime getting memcpy(), memset(), strlen(), strcpy(), strcmp() and all their friends to use the vector unit is low hanging fruit that can instantly make a measurable improvement to every program on the machine.

I ran some benchmarks of memcpy() and strcpy() on an Allwinner D1 machine (which has only 128 bit vector registers) 15 months ago (April 2021). Not only was in-cache performance often doubled, the "which version do I choose?" overhead for small sizes was reduced a lot.

https://hoult.org/d1_memcpy.txt
https://hoult.org/d1_strcpy.txt

That machine has some quirks. Or course it is implementing RVV draft 0.7.1, but functions such as these are binary-compatible between them. It has only 128 bit vector registers, whereas it looks as if SiFive for example are intending 256 bit minimum. Most vector instructions on the D1 (C906 core) take 3*LMUL cycles regardless of whether the actual vector might use fewer than LMUL registers.


Re: RISCV Vector Compliance Test Suite

Allen Baum
 

If you're in a hurry, Imperas has developed a set of vector tests also, and they're likely very comprehensive.
I don't know which configurations are supported though.

On Mon, Jul 25, 2022 at 5:54 AM Kito Cheng <kito.cheng@...> wrote:
FYI: https://github.com/riscv-software-src/riscv-tests/pull/400

On Mon, Jul 25, 2022 at 8:49 PM Alexander Podoplelov
<alexander.podoplelov@...> wrote:
>
> Also, could you please inform me about RISC-V Vector compliance tests v1.0?
>
> 25.07.2022 13:52, Umer Shahid пишет:
>
> Great, thanks for letting me know.
>
> Regards,
> Umer
>
> On Mon, Jul 25, 2022 at 1:51 PM Krste Asanovic <krste@...> wrote:
>>
>> Xi Wang has been developed vector compliance tests at RIOS lab,
>> Krste
>>
>> On Jul 25, 2022, at 12:10 AM, Umer Shahid <umer.shahid@...> wrote:
>>
>> Hello all,
>> I hope you are fine, safe, and healthy. I want to know if there is any test suite or platform which can be used to run RISC-V Vector compliance tests? We, in our team, have started to work on RVV version 1.0 compliance testing but we are unable to find any suitable test suite to generate or run our tests on it. If any team is working on it or anybody knows someone who has worked in this domain then please connect this thread to that person.
>>
>> Regards,
>> Umer
>>
>>
>
>
> --
> Umer Shahid
> Member Technical Staff
> 10xEngineers
> Mobile: +92-334-4072836
> Email: umer.shahid@..., umershahid@...,pk
>
>






Re: Seeking inputs for evaluating vector ABI design

Bruce Hoult
 

On Wed, Jul 27, 2022 at 4:21 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:
A plea to not design the future around vague and ill-considered use cases...

First of all, discussion of libc functions such as strcmp is irrelevant to this thread, as they do not have vector register arguments. They pass pointers to arguments in memory and use (and always will use) the standard ABI, not an augmented Vector ABI as Kito is proposing.

 
A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

If you have a machine with the properties you describe, and having a machine run both some heavy HPC task and some trivial task that uses the vector unit for strcpy() on the same core results in a severe overall performance penalty then you might indeed be advised not to do that. Run those lightweight spoiler tasks on different cores, or install a libc that doesn't use the vector unit.

For everyone else with desktop PCs or phones or cloud servers etc, the vector unit should be used as much as possible! ARM seem to be intending to vectorise every loop in every program. I don't know if or when they will achieve that, or whether RISC-V compilers will do the same, but in the meantime getting memcpy(), memset(), strlen(), strcpy(), strcmp() and all their friends to use the vector unit is low hanging fruit that can instantly make a measurable improvement to every program on the machine.

I ran some benchmarks of memcpy() and strcpy() on an Allwinner D1 machine (which has only 128 bit vector registers) 15 months ago (April 2021). Not only was in-cache performance often doubled, the "which version do I choose?" overhead for small sizes was reduced a lot.


That machine has some quirks. Or course it is implementing RVV draft 0.7.1, but functions such as these are binary-compatible between them. It has only 128 bit vector registers, whereas it looks as if SiFive for example are intending 256 bit minimum. Most vector instructions on the D1 (C906 core) take 3*LMUL cycles regardless of whether the actual vector might use fewer than LMUL registers.
 


Re: Seeking inputs for evaluating vector ABI design

ghost
 

I would like to emphasize Zalman Stern's point about trading off hardware
economy for dynamic software optimization, in the context of a larger
comment about optimizing compiled code for RISC-V. The specification of RVV
is designed very well to work well across a variety of hardware
implementations without requiring different code, but IMO one of the great
truths of system design is that "compilation beats interpretation," and in
this context, execution-time parameterization as defined for RVV is a form
of interpretation that, like many kinds of interpretation, trades space and
time overhead for convenience.

For it to be most effective, the representation *from* which run-time code
is generated must be sufficiently high-level: the higher the level, the
greater the opportunities to tailor the code to the hardware. Not having
experience with vector-amenable computation, I can't say anything more
specific, other than to note the historical tug of war between, on the one
hand, compilers that recognize vectorizable constructs in low-level
languages like C, and on the other, very high-level languages like APL or
Halide.

The relevance to the present discussion is that RTCG may require detailed
configuration discovery ABI/API that goes beyond the ABI for functional
code. I hope the work of the relevant group(s) will take this into
consideration.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC


Re: Seeking inputs for evaluating vector ABI design

Zalman Stern
 

A plea to not design the future around vague and ill-considered use cases...

A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

Having a heterogeneous set of cores on an SoC is a given at this point. The small cores likely will not have a vector unit at all, but if one is going to push the vector extension into general purpose workloads, there will be pressure to have a small implementation on smaller cores and a high performance one on big cores. Fortunately the instruction set allows scaling the hardware implementation cleanly, but making all that seamless in software is tricky. A design that allows per thread constraints on which available vector units are acceptable is perhaps the thing to try for. Ideally this would be somewhat dynamic and setting and unsetting the constraints would be cheap. Note that many mainstream operating systems effectively ban this sort of hardware design and relegate big vector units to an accelerator role. The programming model for the accelerator is completely different than for the general purpose CPU. This has to change.

With variable length vectors, there are also going to inherently be costs in supporting the dynamic size. The stack frame layout will need to support variable length slots or plan for a large maximum size, etc. Allowing one to constrain the compilation to a specific size is potentially a big win for cases where the hardware is known (e.g. firmware) or when doing just in time compilation. Specialization, having one or more versions of a routine optimized for known hardware, is also very likely to be a win over support for fully dynamic size vectors in many cases. Allowing the calling convention to support fixed size layout when it is known is important. Providing a means to efficiently dispatch to specialized routines is a good idea as well. (E.g. a restricted dynamic linking mechanism that has zero runtime overhead.)

-Z-




On Tue, Jul 26, 2022 at 7:11 AM Kito Cheng <kito.cheng@...> wrote:
Hi Jan:


>> NOTE: We don't have a complete compiler auto vectorizer
>> implementation, especially the ability for those math functions, so
>> we'll rewrite the vectorized version by hand for evaluation.
>
> Might this implementation of math functions be helpful? It already supports RVV via intrinsics.

Thanks for your amazing work! I think that it is very useful, it saves
us time to re-implement those functions with RVV :)






Re: Seeking inputs for evaluating vector ABI design

Kito Cheng
 

Hi Jan:


NOTE: We don't have a complete compiler auto vectorizer
implementation, especially the ability for those math functions, so
we'll rewrite the vectorized version by hand for evaluation.
Might this implementation of math functions be helpful? It already supports RVV via intrinsics.
Thanks for your amazing work! I think that it is very useful, it saves
us time to re-implement those functions with RVV :)


Re: Seeking inputs for evaluating vector ABI design

Jan Wassenberg
 

Hi Kito,

NOTE: We don't have a complete compiler auto vectorizer
implementation, especially the ability for those math functions, so
we'll rewrite the vectorized version by hand for evaluation.
Might this implementation of math functions be helpful? It already supports RVV via intrinsics. 


Seeking inputs for evaluating vector ABI design

Kito Cheng
 

Hi:

I am Kito from the RISC-V psABI group, we've defined a basic vector
ABI, which allows function use vector registers within function, that
could be used for optimize several libraries like libc, e.g. we can
use vector instruction to accelerate several memory and string
manipulation functions like strcmp or memcpy.

However we still missing a complete vector ABI which includes a vector
calling convention and vector libraries interface for RISC-V vector
extensions, that's high priority job for psABI group this year, one of
major goal of this mail is seeking potential benchmark for evaluating
the design of vector ABI and make sure no missing item in the plan, so
any feedbacks are appreciated!

# Vector Calling Convention (Highest priority)

Vector calling convention will include following items:
- Define a vector calling convention variant to allow program pass
value with scalable vector type (e.g. vint32m1_t) within vector
registers.
- Define a vector calling convention variant to allow program pass
value with fixed-vector type (e.g. int32x4_t) within vector registers.
- Vector function signature/mangling

# Vector Libraries Interface

- Interface for math function, e.g. vector version of sin function,
define the function name, function signature and the behavior for tail
and masked-off elements.

# Benchmarks:

We would like to collect any benchmarks which contain function calls
inside kernel function, since we need to evaluate the design of
calling conversion like how many registers used to pass parameters and
return value, and the allocation of callee-save and caller-save
registers.

Currently we are consider using follow benchmarks to evaluate the
design of calling convention:

- TSVC
- PolyBenchC

NOTE: We don't have a complete compiler auto vectorizer
implementation, especially the ability for those math functions, so
we'll rewrite the vectorized version by hand for evaluation.


Thanks!


Re: RISCV Vector Compliance Test Suite

Kito Cheng
 

FYI: https://github.com/riscv-software-src/riscv-tests/pull/400

On Mon, Jul 25, 2022 at 8:49 PM Alexander Podoplelov
<alexander.podoplelov@...> wrote:

Also, could you please inform me about RISC-V Vector compliance tests v1.0?

25.07.2022 13:52, Umer Shahid пишет:

Great, thanks for letting me know.

Regards,
Umer

On Mon, Jul 25, 2022 at 1:51 PM Krste Asanovic <krste@...> wrote:

Xi Wang has been developed vector compliance tests at RIOS lab,
Krste

On Jul 25, 2022, at 12:10 AM, Umer Shahid <umer.shahid@...> wrote:

Hello all,
I hope you are fine, safe, and healthy. I want to know if there is any test suite or platform which can be used to run RISC-V Vector compliance tests? We, in our team, have started to work on RVV version 1.0 compliance testing but we are unable to find any suitable test suite to generate or run our tests on it. If any team is working on it or anybody knows someone who has worked in this domain then please connect this thread to that person.

Regards,
Umer


--
Umer Shahid
Member Technical Staff
10xEngineers
Mobile: +92-334-4072836
Email: umer.shahid@..., umershahid@...,pk


Re: RISCV Vector Compliance Test Suite

Alexander Podoplelov
 

Also, could you please inform me about RISC-V Vector compliance tests v1.0?

25.07.2022 13:52, Umer Shahid пишет:

Great, thanks for letting me know. 

Regards,
Umer

On Mon, Jul 25, 2022 at 1:51 PM Krste Asanovic <krste@...> wrote:
Xi Wang has been developed vector compliance tests at RIOS lab,
Krste

On Jul 25, 2022, at 12:10 AM, Umer Shahid <umer.shahid@...> wrote:

Hello all,
I hope you are fine, safe, and healthy. I want to know if there is any test suite or platform which can be used to run RISC-V Vector compliance tests? We, in our team, have started to work on RVV version 1.0 compliance testing but we are unable to find any suitable test suite to generate or run our tests on it. If any team is working on it or anybody knows someone who has worked in this domain then please connect this thread to that person.

Regards,
Umer



--
Umer Shahid 
Member Technical Staff
10xEngineers
Mobile: +92-334-4072836


Re: RISCV Vector Compliance Test Suite

Umer Shahid
 

Great, thanks for letting me know. 

Regards,
Umer

On Mon, Jul 25, 2022 at 1:51 PM Krste Asanovic <krste@...> wrote:
Xi Wang has been developed vector compliance tests at RIOS lab,
Krste

On Jul 25, 2022, at 12:10 AM, Umer Shahid <umer.shahid@...> wrote:

Hello all,
I hope you are fine, safe, and healthy. I want to know if there is any test suite or platform which can be used to run RISC-V Vector compliance tests? We, in our team, have started to work on RVV version 1.0 compliance testing but we are unable to find any suitable test suite to generate or run our tests on it. If any team is working on it or anybody knows someone who has worked in this domain then please connect this thread to that person.

Regards,
Umer



--
Umer Shahid 
Member Technical Staff
10xEngineers
Mobile: +92-334-4072836


Re: RISCV Vector Compliance Test Suite

Krste Asanovic
 

Xi Wang has been developed vector compliance tests at RIOS lab,
Krste

On Jul 25, 2022, at 12:10 AM, Umer Shahid <umer.shahid@...> wrote:

Hello all,
I hope you are fine, safe, and healthy. I want to know if there is any test suite or platform which can be used to run RISC-V Vector compliance tests? We, in our team, have started to work on RVV version 1.0 compliance testing but we are unable to find any suitable test suite to generate or run our tests on it. If any team is working on it or anybody knows someone who has worked in this domain then please connect this thread to that person.

Regards,
Umer


RISCV Vector Compliance Test Suite

Umer Shahid
 

Hello all,
I hope you are fine, safe, and healthy. I want to know if there is any test suite or platform which can be used to run RISC-V Vector compliance tests? We, in our team, have started to work on RVV version 1.0 compliance testing but we are unable to find any suitable test suite to generate or run our tests on it. If any team is working on it or anybody knows someone who has worked in this domain then please connect this thread to that person.

Regards,
Umer


Re: Vector element groups

Krste Asanovic
 

On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking
| would be when individual specs implement the concept. I would think it would require some pretty good justification to not have
| an exception.

On further thought, I do think it makes sense to require raising of an
illegal instruction exception when vl is not a multiple of element
group size rather than leaving reserved. Will be updating the doc
with rationale.

Krste


Re: Vector element groups

Krste Asanovic
 

On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and
| not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would
| like to elucidate the issues?

We already have vector loads and stores with static EEW in the
instruction, which ignore dynamic SEW. Future 64-bit encodings would
also have static EEWs in instruction. If static encoding space was available,
we would not have had dynamic SEW at all.

The current EG proposal does require vl to be set.

Krste


Re: Vector element groups

Nicolas Brunie
 

Hi Yann,
   I think Ken is referencing the optimization of splitting the sha256's state in two and merging rounds. It is for example described here : https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html

Regards,
Nicolas 

Le mar. 19 juil. 2022 à 01:47, Yann Loisel <yann.loisel@...> a écrit :
Hi Ken
Not sure to follow you on your 128-bit inputs and outputs for the SHA256.
The spec speaks about 8 32-bit working variables, a, b, c, ..., g,h, used as the current state, so 256 bits and then an 8-group of 32 bit- values.
Could you please elaborate here ?
Having an algorithmic representation could be helpful for the overall discussion too.
Thanks
yann

On Fri, Jul 15, 2022 at 8:07 PM Ken Dockser <kad@...> wrote:
Thanks for putting this concept proposal together, Krste.

I have several initial comments and questions:
  1. I am all for the concept of element groups. As you point out, they are especially useful in cryptography where we need to operate on data sizes that are greater than the (current) largest SEW=64.
  2. SHA256 is used as an example in the document, but with an incorrect number of elements in a group. To be competitive, SHA256 instructions require 128-bit (not 256-bit) inputs and outputs. More specifically, the outputs need to be a group of 4 (not 8) 32-bit values and each of the three inputs need to be 4 32-bit values.
  3. If LMUL is allowed to be used for the concatenation of registers in narrow implementations (i.e., VLEN < EGW) to form the groups, what would be the meaning of LMUL>1 for wider implementations (i.e., when VLEN >= EGW)?
  4. How would vl be interpreted when VLMUL is used for the concatenation of registers to form groups?
  5. Would instructions that are based on element groups be required to support the use of LMUL to form those groups on narrower implementations? Or, could the instruction be defined to require a minimum VLEN
Thanks,
Ken



--

Yann Loisel
Principal Security Architect

1 - 20 of 820