Re: Seeking inputs for evaluating vector ABI design
On Wed, Jul 27, 2022 at 4:21 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:
First of all, discussion of libc functions such as strcmp is irrelevant to this thread, as they do not have vector register arguments. They pass pointers to arguments in memory and use (and always will use) the standard ABI, not an augmented Vector ABI as Kito is proposing.
If you have a machine with the properties you describe, and having a machine run both some heavy HPC task and some trivial task that uses the vector unit for strcpy() on the same core results in a severe overall performance penalty then you might indeed be advised not to do that. Run those lightweight spoiler tasks on different cores, or install a libc that doesn't use the vector unit.
For everyone else with desktop PCs or phones or cloud servers etc, the vector unit should be used as much as possible! ARM seem to be intending to vectorise every loop in every program. I don't know if or when they will achieve that, or whether RISC-V compilers will do the same, but in the meantime getting memcpy(), memset(), strlen(), strcpy(), strcmp() and all their friends to use the vector unit is low hanging fruit that can instantly make a measurable improvement to every program on the machine.
I ran some benchmarks of memcpy() and strcpy() on an Allwinner D1 machine (which has only 128 bit vector registers) 15 months ago (April 2021). Not only was in-cache performance often doubled, the "which version do I choose?" overhead for small sizes was reduced a lot.
That machine has some quirks. Or course it is implementing RVV draft 0.7.1, but functions such as these are binary-compatible between them. It has only 128 bit vector registers, whereas it looks as if SiFive for example are intending 256 bit minimum. Most vector instructions on the D1 (C906 core) take 3*LMUL cycles regardless of whether the actual vector might use fewer than LMUL registers.