A topic to discuss: lower bound on VLEN.
The upper bound is helpful but even VL-agnostic code sometimes wants at least 128 bits.
Example: N parallel instances of AES (16 bytes each), or N 128-bit results from 64x64 normal or carryless multiplication.
We can get this already (assuming SEW_LMUL1MAX = 64) by setting LMUL=2, but it seems like a poor tradeoff that
software should halve the number of registers/groups, just so that hardware could theoretically have single-element vectors.
Can we mandate VLEN >= 2*SEW_LMUL1MAX, perhaps in a profile? That would help software :)
BTW, are we intending to have the same binaries work on different implementations? It seems the only way to discover SEW_LMUL1MAX
is to try various SEW/LMUL and check for vill. Because LMUL is baked into the intrinsic function name,
software that wants portable binaries would have to recompile all vector code for LMUL=1,2,4,8, and then
pick the first one that works.
That's very burdensome, a profile guaranteeing SEW_LMUL1MAX = 64 or at least LMUL2MAX = 64 would also help a lot.