Re: Seeking inputs for evaluating vector ABI design

Kito Cheng

Hi Zalman:

Define a standardized vector ABI means we can have a common interface
and agreement among different compiler and libraries, isn't means we
must use that everywhere, we did have several way to doing some
dynamic function version selection (e.g. ifunc), and it won't changing
existing software ecosystem (means NO need to recompile everything
with vector ABI), it's extension of standardized ABI.

I could imagine optimized library functions with RVV might not have
any performance gain for all HW platform, but that would be software
optimization issue rather than ABI issue I think:)

On Wed, Jul 27, 2022 at 8:13 AM Bruce Hoult <bruce@...> wrote:

On Wed, Jul 27, 2022 at 4:21 AM Zalman Stern via <> wrote:

A plea to not design the future around vague and ill-considered use cases...

First of all, discussion of libc functions such as strcmp is irrelevant to this thread, as they do not have vector register arguments. They pass pointers to arguments in memory and use (and always will use) the standard ABI, not an augmented Vector ABI as Kito is proposing.

A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

If you have a machine with the properties you describe, and having a machine run both some heavy HPC task and some trivial task that uses the vector unit for strcpy() on the same core results in a severe overall performance penalty then you might indeed be advised not to do that. Run those lightweight spoiler tasks on different cores, or install a libc that doesn't use the vector unit.

For everyone else with desktop PCs or phones or cloud servers etc, the vector unit should be used as much as possible! ARM seem to be intending to vectorise every loop in every program. I don't know if or when they will achieve that, or whether RISC-V compilers will do the same, but in the meantime getting memcpy(), memset(), strlen(), strcpy(), strcmp() and all their friends to use the vector unit is low hanging fruit that can instantly make a measurable improvement to every program on the machine.

I ran some benchmarks of memcpy() and strcpy() on an Allwinner D1 machine (which has only 128 bit vector registers) 15 months ago (April 2021). Not only was in-cache performance often doubled, the "which version do I choose?" overhead for small sizes was reduced a lot.

That machine has some quirks. Or course it is implementing RVV draft 0.7.1, but functions such as these are binary-compatible between them. It has only 128 bit vector registers, whereas it looks as if SiFive for example are intending 256 bit minimum. Most vector instructions on the D1 (C906 core) take 3*LMUL cycles regardless of whether the actual vector might use fewer than LMUL registers.

Join to automatically receive all group messages.