Re: Seeking inputs for evaluating vector ABI design

Zalman Stern

A plea to not design the future around vague and ill-considered use cases...

A fundamental issue here is that a vector unit built for high performance computing will potentially have a massive amount of state and can have very long delays in fully completing inflight operations. (Multiple 100k cycles of delay on a context switch to quiesce the vector unit is not out of the question.) Thus, using this hardware for string operations comes with costs that may not be measured in microbenchmarks. Discussing it for strcmp strikes me as far more of a benchmark gaming opportunity than sound engineering, but in as much as there may be legitimate use cases for this, they almost certainly have to allow developer choice.

The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc.

Having a heterogeneous set of cores on an SoC is a given at this point. The small cores likely will not have a vector unit at all, but if one is going to push the vector extension into general purpose workloads, there will be pressure to have a small implementation on smaller cores and a high performance one on big cores. Fortunately the instruction set allows scaling the hardware implementation cleanly, but making all that seamless in software is tricky. A design that allows per thread constraints on which available vector units are acceptable is perhaps the thing to try for. Ideally this would be somewhat dynamic and setting and unsetting the constraints would be cheap. Note that many mainstream operating systems effectively ban this sort of hardware design and relegate big vector units to an accelerator role. The programming model for the accelerator is completely different than for the general purpose CPU. This has to change.

With variable length vectors, there are also going to inherently be costs in supporting the dynamic size. The stack frame layout will need to support variable length slots or plan for a large maximum size, etc. Allowing one to constrain the compilation to a specific size is potentially a big win for cases where the hardware is known (e.g. firmware) or when doing just in time compilation. Specialization, having one or more versions of a routine optimized for known hardware, is also very likely to be a win over support for fully dynamic size vectors in many cases. Allowing the calling convention to support fixed size layout when it is known is important. Providing a means to efficiently dispatch to specialized routines is a good idea as well. (E.g. a restricted dynamic linking mechanism that has zero runtime overhead.)


On Tue, Jul 26, 2022 at 7:11 AM Kito Cheng <kito.cheng@...> wrote:
Hi Jan:

>> NOTE: We don't have a complete compiler auto vectorizer
>> implementation, especially the ability for those math functions, so
>> we'll rewrite the vectorized version by hand for evaluation.
> Might this implementation of math functions be helpful? It already supports RVV via intrinsics.

Thanks for your amazing work! I think that it is very useful, it saves
us time to re-implement those functions with RVV :)

Join to automatically receive all group messages.