Seeking inputs for evaluating vector ABI design


Kito Cheng
 

Hi:

I am Kito from the RISC-V psABI group, we've defined a basic vector
ABI, which allows function use vector registers within function, that
could be used for optimize several libraries like libc, e.g. we can
use vector instruction to accelerate several memory and string
manipulation functions like strcmp or memcpy.

However we still missing a complete vector ABI which includes a vector
calling convention and vector libraries interface for RISC-V vector
extensions, that's high priority job for psABI group this year, one of
major goal of this mail is seeking potential benchmark for evaluating
the design of vector ABI and make sure no missing item in the plan, so
any feedbacks are appreciated!

# Vector Calling Convention (Highest priority)

Vector calling convention will include following items:
- Define a vector calling convention variant to allow program pass
value with scalable vector type (e.g. vint32m1_t) within vector
registers.
- Define a vector calling convention variant to allow program pass
value with fixed-vector type (e.g. int32x4_t) within vector registers.
- Vector function signature/mangling

# Vector Libraries Interface

- Interface for math function, e.g. vector version of sin function,
define the function name, function signature and the behavior for tail
and masked-off elements.

# Benchmarks:

We would like to collect any benchmarks which contain function calls
inside kernel function, since we need to evaluate the design of
calling conversion like how many registers used to pass parameters and
return value, and the allocation of callee-save and caller-save
registers.

Currently we are consider using follow benchmarks to evaluate the
design of calling convention:

- TSVC
- PolyBenchC

NOTE: We don't have a complete compiler auto vectorizer
implementation, especially the ability for those math functions, so
we'll rewrite the vectorized version by hand for evaluation.


Thanks!


Jan Wassenberg
 

Hi Kito,

NOTE: We don't have a complete compiler auto vectorizer
implementation, especially the ability for those math functions, so
we'll rewrite the vectorized version by hand for evaluation.
Might this implementation of math functions be helpful? It already supports RVV via intrinsics. 


Kito Cheng
 

Hi Jan:


NOTE: We don't have a complete compiler auto vectorizer
implementation, especially the ability for those math functions, so
we'll rewrite the vectorized version by hand for evaluation.
Might this implementation of math functions be helpful? It already supports RVV via intrinsics.
Thanks for your amazing work! I think that it is very useful, it saves
us time to re-implement those functions with RVV :)


Zalman Stern
 

A plea to not design the future around vague and ill-considered use cases...

A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

Having a heterogeneous set of cores on an SoC is a given at this point. The small cores likely will not have a vector unit at all, but if one is going to push the vector extension into general purpose workloads, there will be pressure to have a small implementation on smaller cores and a high performance one on big cores. Fortunately the instruction set allows scaling the hardware implementation cleanly, but making all that seamless in software is tricky. A design that allows per thread constraints on which available vector units are acceptable is perhaps the thing to try for. Ideally this would be somewhat dynamic and setting and unsetting the constraints would be cheap. Note that many mainstream operating systems effectively ban this sort of hardware design and relegate big vector units to an accelerator role. The programming model for the accelerator is completely different than for the general purpose CPU. This has to change.

With variable length vectors, there are also going to inherently be costs in supporting the dynamic size. The stack frame layout will need to support variable length slots or plan for a large maximum size, etc. Allowing one to constrain the compilation to a specific size is potentially a big win for cases where the hardware is known (e.g. firmware) or when doing just in time compilation. Specialization, having one or more versions of a routine optimized for known hardware, is also very likely to be a win over support for fully dynamic size vectors in many cases. Allowing the calling convention to support fixed size layout when it is known is important. Providing a means to efficiently dispatch to specialized routines is a good idea as well. (E.g. a restricted dynamic linking mechanism that has zero runtime overhead.)

-Z-




On Tue, Jul 26, 2022 at 7:11 AM Kito Cheng <kito.cheng@...> wrote:
Hi Jan:


>> NOTE: We don't have a complete compiler auto vectorizer
>> implementation, especially the ability for those math functions, so
>> we'll rewrite the vectorized version by hand for evaluation.
>
> Might this implementation of math functions be helpful? It already supports RVV via intrinsics.

Thanks for your amazing work! I think that it is very useful, it saves
us time to re-implement those functions with RVV :)






ghost
 

I would like to emphasize Zalman Stern's point about trading off hardware
economy for dynamic software optimization, in the context of a larger
comment about optimizing compiled code for RISC-V. The specification of RVV
is designed very well to work well across a variety of hardware
implementations without requiring different code, but IMO one of the great
truths of system design is that "compilation beats interpretation," and in
this context, execution-time parameterization as defined for RVV is a form
of interpretation that, like many kinds of interpretation, trades space and
time overhead for convenience.

For it to be most effective, the representation *from* which run-time code
is generated must be sufficiently high-level: the higher the level, the
greater the opportunities to tailor the code to the hardware. Not having
experience with vector-amenable computation, I can't say anything more
specific, other than to note the historical tug of war between, on the one
hand, compilers that recognize vectorizable constructs in low-level
languages like C, and on the other, very high-level languages like APL or
Halide.

The relevance to the present discussion is that RTCG may require detailed
configuration discovery ABI/API that goes beyond the ABI for functional
code. I hope the work of the relevant group(s) will take this into
consideration.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC


Bruce Hoult
 

On Wed, Jul 27, 2022 at 4:21 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:
A plea to not design the future around vague and ill-considered use cases...

First of all, discussion of libc functions such as strcmp is irrelevant to this thread, as they do not have vector register arguments. They pass pointers to arguments in memory and use (and always will use) the standard ABI, not an augmented Vector ABI as Kito is proposing.

 
A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

If you have a machine with the properties you describe, and having a machine run both some heavy HPC task and some trivial task that uses the vector unit for strcpy() on the same core results in a severe overall performance penalty then you might indeed be advised not to do that. Run those lightweight spoiler tasks on different cores, or install a libc that doesn't use the vector unit.

For everyone else with desktop PCs or phones or cloud servers etc, the vector unit should be used as much as possible! ARM seem to be intending to vectorise every loop in every program. I don't know if or when they will achieve that, or whether RISC-V compilers will do the same, but in the meantime getting memcpy(), memset(), strlen(), strcpy(), strcmp() and all their friends to use the vector unit is low hanging fruit that can instantly make a measurable improvement to every program on the machine.

I ran some benchmarks of memcpy() and strcpy() on an Allwinner D1 machine (which has only 128 bit vector registers) 15 months ago (April 2021). Not only was in-cache performance often doubled, the "which version do I choose?" overhead for small sizes was reduced a lot.


That machine has some quirks. Or course it is implementing RVV draft 0.7.1, but functions such as these are binary-compatible between them. It has only 128 bit vector registers, whereas it looks as if SiFive for example are intending 256 bit minimum. Most vector instructions on the D1 (C906 core) take 3*LMUL cycles regardless of whether the actual vector might use fewer than LMUL registers.
 


Kito Cheng
 

Hi Zalman:

Define a standardized vector ABI means we can have a common interface
and agreement among different compiler and libraries, isn't means we
must use that everywhere, we did have several way to doing some
dynamic function version selection (e.g. ifunc), and it won't changing
existing software ecosystem (means NO need to recompile everything
with vector ABI), it's extension of standardized ABI.

I could imagine optimized library functions with RVV might not have
any performance gain for all HW platform, but that would be software
optimization issue rather than ABI issue I think:)

On Wed, Jul 27, 2022 at 8:13 AM Bruce Hoult <bruce@...> wrote:

On Wed, Jul 27, 2022 at 4:21 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:

A plea to not design the future around vague and ill-considered use cases...

First of all, discussion of libc functions such as strcmp is irrelevant to this thread, as they do not have vector register arguments. They pass pointers to arguments in memory and use (and always will use) the standard ABI, not an augmented Vector ABI as Kito is proposing.



A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

If you have a machine with the properties you describe, and having a machine run both some heavy HPC task and some trivial task that uses the vector unit for strcpy() on the same core results in a severe overall performance penalty then you might indeed be advised not to do that. Run those lightweight spoiler tasks on different cores, or install a libc that doesn't use the vector unit.

For everyone else with desktop PCs or phones or cloud servers etc, the vector unit should be used as much as possible! ARM seem to be intending to vectorise every loop in every program. I don't know if or when they will achieve that, or whether RISC-V compilers will do the same, but in the meantime getting memcpy(), memset(), strlen(), strcpy(), strcmp() and all their friends to use the vector unit is low hanging fruit that can instantly make a measurable improvement to every program on the machine.

I ran some benchmarks of memcpy() and strcpy() on an Allwinner D1 machine (which has only 128 bit vector registers) 15 months ago (April 2021). Not only was in-cache performance often doubled, the "which version do I choose?" overhead for small sizes was reduced a lot.

https://hoult.org/d1_memcpy.txt
https://hoult.org/d1_strcpy.txt

That machine has some quirks. Or course it is implementing RVV draft 0.7.1, but functions such as these are binary-compatible between them. It has only 128 bit vector registers, whereas it looks as if SiFive for example are intending 256 bit minimum. Most vector instructions on the D1 (C906 core) take 3*LMUL cycles regardless of whether the actual vector might use fewer than LMUL registers.


Kito Cheng
 

Hi Peter:

The relevance to the present discussion is that RTCG may require detailed
configuration discovery ABI/API that goes beyond the ABI for functional
code. I hope the work of the relevant group(s) will take this into
consideration.
Discovery ABI/API are discussed in another place, which might not be
part of psABI, it would be more like Linux specific stuffs, SiFive
folks and Rivos folks have some discussion about the design of
configuration discovery API, might included a set of new system call
to let program in user space has a mechanism get more detail
information, that will public info soon (it would be public at least
before the Linux plumber 2022 I guess).

On Wed, Jul 27, 2022 at 12:49 AM L Peter Deutsch <ghost@...> wrote:

I would like to emphasize Zalman Stern's point about trading off hardware
economy for dynamic software optimization, in the context of a larger
comment about optimizing compiled code for RISC-V. The specification of RVV
is designed very well to work well across a variety of hardware
implementations without requiring different code, but IMO one of the great
truths of system design is that "compilation beats interpretation," and in
this context, execution-time parameterization as defined for RVV is a form
of interpretation that, like many kinds of interpretation, trades space and
time overhead for convenience.

For it to be most effective, the representation *from* which run-time code
is generated must be sufficiently high-level: the higher the level, the
greater the opportunities to tailor the code to the hardware. Not having
experience with vector-amenable computation, I can't say anything more
specific, other than to note the historical tug of war between, on the one
hand, compilers that recognize vectorizable constructs in low-level
languages like C, and on the other, very high-level languages like APL or
Halide.

The relevance to the present discussion is that RTCG may require detailed
configuration discovery ABI/API that goes beyond the ABI for functional
code. I hope the work of the relevant group(s) will take this into
consideration.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC