Hi,
In VeLO (https://arxiv.org/pdf/2211.09760.pdf) Section B.3, it states that mixing is done by F0(x) + max(σ(F1(σ(F2(x)))), axis = 0, keep_dims = True).
However, in the implementation of hyper_v2 (https://github.com/google/learned_optimization/blob/main/learned_optimization/research/general_lopt/hyper_v2.py#L330-L335), it essentially use only one linear layer instead of two, as the input to second linear layer is x instead of mix_layer (L332).