attached is what we did at convex and it worked quite well. worked well in the context of compiler generated code for stencils and for runtimers like convolution and correlation

Vector first register - C4600

The vector register set of the C4600 Series CPUs contains an additional vector register called the vector first register (VF).

VF specifies the first element of vector register Vi, Vj or Vk accessed by a vector instruction, provided that the MSB of the corresponding 5-bit register select field of the instruction is set. VF cannot be applied to operations on VM.

VF is seven bits in length and may contain a value between 0 and 127. If the value of VF plus the value of VL is greater than 128, the effective value ofVL for vector instructions that use VF is 128 minus VF. This effective VL value determines the number of results written to a vector register or VM, or the number of elements stored to memory.

If the value of VF plus Sj is greater than 127 in the mov
V i, S j, Sk and rnov S i, S j, Vk instructions, then the selected element of the vector register is equal to (VF plus Sj) mod 128. Therefore, the vector register wraps for these two instructions only.

If Vi or Vj of an instruction specifies the same register as Vk of the instruction, and VF is applied to Vk, and VL is greater than VF, then elements of the shared register may be written (as Vk) before they are read (as Vi or Vj, depending of the hardware implementation). In this case, the result in Vk is architecturally undefined.Theinstructionmerg.x Vi,Vj,Vkhasthesame behavior if Vi or Vj are the same as Vk. 

