Skip to content

Conversation

@FawadHa1der
Copy link

@FawadHa1der FawadHa1der commented Jul 22, 2025

In sumcheck there are scenarios(like when the verifier issues a challenge) when one operand in a multiplication is fixed and repeated many times. This pull request aims to exploit and reduce redundant computations. This technique is especially very useful in bitsliced setting.

When one of the multiplication operand is repeated, we can generate a linear map of the multiplication. i.e there are no non-linear(AND) ops in the map. The map (referenced as constant mul matrix in code) is generated by multiplying the constant with basis elements of the field. This one time cost amortized over many multiplications where the same fixed operand is used.

Since the constant mul matrix is a linear map. Now we can use the "method of four russians" on it and reduce the number of XOR operations significantly.

There are lot of optimizations that still can be done. But this PR is a minimal proof of concept and with localized changes. For NUM_VARS: 28 COMPOSITION_SIZE: 4 we get about 11% improvement in raw computation.

@FawadHa1der
Copy link
Author

Probably needs a little more cleanup. I would love to work further on this and have it merged. Please let me know whatever cleanup and modifications the team wants.

@FawadHa1der FawadHa1der force-pushed the matrix_constant_mul branch from 00f11c0 to f6052a8 Compare July 22, 2025 14:36
@FawadHa1der
Copy link
Author

The same technique can also be applied to normal non-bitsliced data. After the pre-computation 128 bit mul can be done in 16 XORS + 16 lookups. or 32 XORS + 32 lookups. Example C code below.

void build_byte_tables_from_cols(const __uint128_t cols[128], __uint128_t T[16][256])
{
for (int pos = 0; pos < 16; ++pos) {
T[pos][0] = 0;
for (int v = 1; v < 256; ++v) {
int lsb = v & -v;
int bit = __builtin_ctz(lsb); // 0..7
T[pos][v] = T[pos][v ^ lsb] ^ cols[pos*8 + bit];
}
}
}

__uint128_t mul_const_neon_bytes(const __uint128_t T[16][256], __uint128_t X)
{
uint8_t xb[16];
u128_to_bytes_le(X, xb); // may be we can remove this?

uint64x2_t acc = vdupq_n_u64(0);
for (int pos = 0; pos < 16; ++pos) {
uint64x2_t t = vld1q_u64((const uint64_t*)&T[pos][ xb[pos] ]);
acc = veorq_u64(acc, t);
}

uint64_t out64[2];
vst1q_u64(out64, acc);
__uint128_t y = (__uint128_t)out64[0] | ((__uint128_t)out64[1] << 64);
return y;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant