Date
1 - 8 of 8
Seeking inputs for evaluating vector ABI design
Kito Cheng
Hi:
I am Kito from the RISC-V psABI group, we've defined a basic vector ABI, which allows function use vector registers within function, that could be used for optimize several libraries like libc, e.g. we can use vector instruction to accelerate several memory and string manipulation functions like strcmp or memcpy. However we still missing a complete vector ABI which includes a vector calling convention and vector libraries interface for RISC-V vector extensions, that's high priority job for psABI group this year, one of major goal of this mail is seeking potential benchmark for evaluating the design of vector ABI and make sure no missing item in the plan, so any feedbacks are appreciated! # Vector Calling Convention (Highest priority) Vector calling convention will include following items: - Define a vector calling convention variant to allow program pass value with scalable vector type (e.g. vint32m1_t) within vector registers. - Define a vector calling convention variant to allow program pass value with fixed-vector type (e.g. int32x4_t) within vector registers. - Vector function signature/mangling # Vector Libraries Interface - Interface for math function, e.g. vector version of sin function, define the function name, function signature and the behavior for tail and masked-off elements. # Benchmarks: We would like to collect any benchmarks which contain function calls inside kernel function, since we need to evaluate the design of calling conversion like how many registers used to pass parameters and return value, and the allocation of callee-save and caller-save registers. Currently we are consider using follow benchmarks to evaluate the design of calling convention: - TSVC - PolyBenchC NOTE: We don't have a complete compiler auto vectorizer implementation, especially the ability for those math functions, so we'll rewrite the vectorized version by hand for evaluation. Thanks! |
|
Jan Wassenberg
Hi Kito, NOTE: We don't have a complete compiler auto vectorizer Might this implementation of math functions be helpful? It already supports RVV via intrinsics. |
|
Kito Cheng
Hi Jan:
Thanks for your amazing work! I think that it is very useful, it savesNOTE: We don't have a complete compiler auto vectorizerMight this implementation of math functions be helpful? It already supports RVV via intrinsics. us time to re-implement those functions with RVV :) |
|
Zalman Stern
A plea to not design the future around vague and ill-considered use cases... The C string library is generally used for legacy/convenience on small strings. People with real performance on the table use something else. Yes it still matters, but if we're looking at really using a vector unit for text handling, an interface that is not pointers and zero termination based is almost certainly required. The opportunity is more to design that API than to shoehorn vectors under libc. Having a heterogeneous set of cores on an SoC is a given at this point. The small cores likely will not have a vector unit at all, but if one is going to push the vector extension into general purpose workloads, there will be pressure to have a small implementation on smaller cores and a high performance one on big cores. Fortunately the instruction set allows scaling the hardware implementation cleanly, but making all that seamless in software is tricky. A design that allows per thread constraints on which available vector units are acceptable is perhaps the thing to try for. Ideally this would be somewhat dynamic and setting and unsetting the constraints would be cheap. Note that many mainstream operating systems effectively ban this sort of hardware design and relegate big vector units to an accelerator role. The programming model for the accelerator is completely different than for the general purpose CPU. This has to change. With variable length vectors, there are also going to inherently be costs in supporting the dynamic size. The stack frame layout will need to support variable length slots or plan for a large maximum size, etc. Allowing one to constrain the compilation to a specific size is potentially a big win for cases where the hardware is known (e.g. firmware) or when doing just in time compilation. Specialization, having one or more versions of a routine optimized for known hardware, is also very likely to be a win over support for fully dynamic size vectors in many cases. Allowing the calling convention to support fixed size layout when it is known is important. Providing a means to efficiently dispatch to specialized routines is a good idea as well. (E.g. a restricted dynamic linking mechanism that has zero runtime overhead.) -Z- On Tue, Jul 26, 2022 at 7:11 AM Kito Cheng <kito.cheng@...> wrote: Hi Jan: |
|
ghost
I would like to emphasize Zalman Stern's point about trading off hardware
economy for dynamic software optimization, in the context of a larger comment about optimizing compiled code for RISC-V. The specification of RVV is designed very well to work well across a variety of hardware implementations without requiring different code, but IMO one of the great truths of system design is that "compilation beats interpretation," and in this context, execution-time parameterization as defined for RVV is a form of interpretation that, like many kinds of interpretation, trades space and time overhead for convenience. For it to be most effective, the representation *from* which run-time code is generated must be sufficiently high-level: the higher the level, the greater the opportunities to tailor the code to the hardware. Not having experience with vector-amenable computation, I can't say anything more specific, other than to note the historical tug of war between, on the one hand, compilers that recognize vectorizable constructs in low-level languages like C, and on the other, very high-level languages like APL or Halide. The relevance to the present discussion is that RTCG may require detailed configuration discovery ABI/API that goes beyond the ABI for functional code. I hope the work of the relevant group(s) will take this into consideration. -- L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC |
|
On Wed, Jul 27, 2022 at 4:21 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:
First of all, discussion of libc functions such as strcmp is irrelevant to this thread, as they do not have vector register arguments. They pass pointers to arguments in memory and use (and always will use) the standard ABI, not an augmented Vector ABI as Kito is proposing.
If you have a machine with the properties you describe, and having a machine run both some heavy HPC task and some trivial task that uses the vector unit for strcpy() on the same core results in a severe overall performance penalty then you might indeed be advised not to do that. Run those lightweight spoiler tasks on different cores, or install a libc that doesn't use the vector unit. For everyone else with desktop PCs or phones or cloud servers etc, the vector unit should be used as much as possible! ARM seem to be intending to vectorise every loop in every program. I don't know if or when they will achieve that, or whether RISC-V compilers will do the same, but in the meantime getting memcpy(), memset(), strlen(), strcpy(), strcmp() and all their friends to use the vector unit is low hanging fruit that can instantly make a measurable improvement to every program on the machine. I ran some benchmarks of memcpy() and strcpy() on an Allwinner D1 machine (which has only 128 bit vector registers) 15 months ago (April 2021). Not only was in-cache performance often doubled, the "which version do I choose?" overhead for small sizes was reduced a lot. That machine has some quirks. Or course it is implementing RVV draft 0.7.1, but functions such as these are binary-compatible between them. It has only 128 bit vector registers, whereas it looks as if SiFive for example are intending 256 bit minimum. Most vector instructions on the D1 (C906 core) take 3*LMUL cycles regardless of whether the actual vector might use fewer than LMUL registers. |
|
Kito Cheng
Hi Zalman:
toggle quoted message
Show quoted text
Define a standardized vector ABI means we can have a common interface and agreement among different compiler and libraries, isn't means we must use that everywhere, we did have several way to doing some dynamic function version selection (e.g. ifunc), and it won't changing existing software ecosystem (means NO need to recompile everything with vector ABI), it's extension of standardized ABI. I could imagine optimized library functions with RVV might not have any performance gain for all HW platform, but that would be software optimization issue rather than ABI issue I think:) On Wed, Jul 27, 2022 at 8:13 AM Bruce Hoult <bruce@...> wrote:
|
|
Kito Cheng
Hi Peter:
The relevance to the present discussion is that RTCG may require detailedDiscovery ABI/API are discussed in another place, which might not be part of psABI, it would be more like Linux specific stuffs, SiFive folks and Rivos folks have some discussion about the design of configuration discovery API, might included a set of new system call to let program in user space has a mechanism get more detail information, that will public info soon (it would be public at least before the Linux plumber 2022 I guess). On Wed, Jul 27, 2022 at 12:49 AM L Peter Deutsch <ghost@...> wrote:
|
|