rvint

Integer-based mathematical subroutines which implement RISC-V M and B extension functionality on I, and Zmmul instruction sets. There is partial support for the reduced functionality E instruction set where this is easy to do and doesn't involve a performance tradeoff.

These routines are designed to be both concise and efficient, although where those two goals clash, I went with concise.

Division

On 32-bit processors (RV32I, RV32E)

32-bit by 32-bit signed and unsigned division with 32-bit result and remainder.
Fast 32-bit unsigned division by 3
Fast 32-bit unsigned division by 5
Fast 32-bit unsigned division by 6
Fast 32-bit unsigned division by 7
Fast 32-bit unsigned division by 9
Fast 32-bit unsigned division by 10
Fast 32-bit unsigned division by 11
Fast 32-bit unsigned division by 12
Fast 32-bit unsigned division by 13
Fast 32-bit unsigned division by 100
Fast 32-bit unsigned division by 1000
Fast 32-bit signed division by 3
Fast 32-bit signed division by 5
Fast 32-bit signed division by 6
Fast 32-bit signed division by 7
Fast 32-bit signed division by 9
Fast 32-bit signed division by 10
Fast 32-bit signed division by 11
Fast 32-bit signed division by 12
Fast 32-bit signed division by 13
Fast 32-bit signed division by 100
Fast 32-bit signed division by 1000

On 64-bit processors (RV64I)

64-bit by 64-bit signed and unsigned division with 64-bit result and remainder.
Fast 64-bit unsigned division by 3
Fast 64-bit unsigned division by 5
Fast 64-bit unsigned division by 6
Fast 64-bit unsigned division by 7
Fast 64-bit unsigned division by 9
Fast 64-bit unsigned division by 10
Fast 64-bit unsigned division by 11
Fast 64-bit unsigned division by 12
Fast 64-bit unsigned division by 13
Fast 64-bit unsigned division by 100
Fast 64-bit unsigned division by 1000
Fast 64-bit signed division by 3
Fast 64-bit signed division by 5
Fast 64-bit signed division by 6
Fast 64-bit signed division by 7
Fast 64-bit signed division by 9
Fast 64-bit signed division by 10
Fast 64-bit signed division by 11
Fast 64-bit signed division by 12
Fast 64-bit signed division by 13
Fast 64-bit signed division by 100
Fast 64-bit signed division by 1000

Math Macros

abs (absolute value) of 32 or 64 bit register.

Multiplication

On 32-bit processors (RV32E, RV32I):

32-bit by 32-bit signed and unsigned multiplication with 32-bit result.
32-bit by 32-bit signed and unsigned multiplication with 64-bit result.

on 64-bit processors (RV64i):

32-bit by 32-bit signed and unsigned multiplication with 64-bit result.
64-bit by 64-bit signed and unsigned multiplication with 64-bit result.
64-bit by 64-bit signed and unsigned multiplication with 128-bit result.

Multiplication Macros

Multiply 32-bit or 64-bit register by 3.
Multiply 32-bit or 64-bit register by 5.
Multiply 32-bit or 64-bit register by 6.
Multiply 32-bit or 64-bit register by 9.
Multiply 32-bit or 64-bit register by 10.
Multiply 32-bit or 64-bit register by 11.
Multiply 32-bit or 64-bit register by 12.
Multiply 32-bit or 64-bit register by 13.
Multiply 32-bit or 64-bit register by 100.
Multiply 32-bit or 64-bit register by 1000.

Base Conversions & I/O Operations

These operations support 32-bit numbers on 32-bit architectures and 64-bit numbers on 64-bit architectures. All routines are RV32E, RV32I, and RV64I compatible.

ASCII binary to binary.
ASCII unsigned decimal to binary.
ASCII signed decimal to two's complement binary.
ASCII hexadecimal to binary.
binary to ASCII binary.
two's complement binary to ASCII signed decimal.
unsigned binary to unsigned ASCII decimal.
binary to ASCII hexadecimal.

Square Root (RV32E, RV32I, and RV64I compatible)

32-bit integer square root on 32-bit processors.
64-bit integer square root on 64-bit processors.

Greatest Common Divisor (RV32E, RV32I, and RV64I compatible)

32-bit GCD of two unsigned 32-bit numbers on 32-bit processors.
64-bit GCD of two unsigned 64-bit numbers on 64-bit processors.

Least Common Multiple (RV32E, RV32I, and RV64I compatible)

32-bit LCM of two unsigned 32-bit numbers on 32-bit processors.
64-bit LCM of two unsigned 64-bit numbers on 64-bit processors.

Bit Operations (RV32E, RV32I, and RV64I compatible)

Count leading zeroes in 32-bit number on 32-bit processors.
Count leading zeroes in 64-bit number on 64-bit processors.
Count trailing zeroes in 32-bit number on 32-bit processors.
Count trailing zeroes in 64-bit number on 64-bit processors.

Building

clang, lld, and make are assumed. I'm currently using clang 18. The TARGET, ARCH, and ABI should be edited in the Makefile, and the CPU_BITS constant should be set to match in config.s HAS_... flags should be set for available extensions.

A static library, librvint.a, will be created to link against.

Only certain routines support RV32E. The routines that do not currently give compile errors; this will be fixed in the future. RV64E is not a focus as I am unaware of any real world implementations. The Makefile has a CH32V003 target that is an RV32EC_zicsr target.

Tests

The test programs assume Linux syscalls are available. In particular syscall 64 is write and 93 is exit.

Running

I use this emulator to run RV32I code: https://riscv-programming.org/ale/
I'm running RV64 code on a scaleway elastic metal cloud instance: https://labs.scaleway.com/en/em-rv1/

API

Bit manipulation - bits.s:

These routines implement functionality in the ZBB ISA extension for CPUs which do not have this available. Despite being efficiently implemented (they run in O(log n) time), they are heavyweight enough that they are not useful in places like inner loops in division routines.

bits_ctz

Count number of trailing zeroes on CPUs with no ZBB extension. O(log n). RV32E/RV32I/RV64I compatible.

Configuration	Cycles (32)	Cycles (64)
Best Case	3	3
Average Case	~21	~25
Worst Case	23	27

Input

a0 = number

Output

a0 = count of trailing zeroes

bits_clz

Count number of leading zeroes on CPUs with no ZBB extension. O(log n). RV32E/RV32I/RV64I compatible.

Configuration	Cycles (32)	Cycles (64)
Best Case	3	3
Average Case	~21	~25
Worst Case	23	27

Input

a0 = number

Output

a0 = count of leading zeroes

Division - div.s:

Signed and unsigned division for CPUs without dividers.

The initial implementation of divremu came from the division routine in the rvmon monitor. I believe this code originally came from Bruce Hout on reddit. It has been heavily extended to minimize the number of cycles by using ISA extensions and restructuring the code by selectively unrolling parts. The divrem routine is a wrapper which enables signed division.

divremu

Unsigned integer division for CPUs without M extension. RV32E/RV32I/RV64I compatible. Restoring division algorithm. Available in ROLLED (compact) and UNROLLED (faster) versions (this is selected in config.s with the DIVREMU_UNROLLED flag). Worst-case performance:

Configuration	Cycles (32)	Cycles (64)
Base ISA (Rolled)	455	903
Base ISA (Unrolled)	351	695
With Extensions (Rolled)	391	775
With Extensions (Unrolled)	287	567

Input

a0 = dividend a1 = divisor

Output

a0 = quotient a1 = remainder

divrem

Signed integer division that rounds towards zero. RV32E/RV32I/RV64I compatible restoring division algorithm. Uses divremu, so performance depends upon whether that routine is used in ROLLED or UNROLLED form. Worst case performance:

Configuration	Cycles (32)	Cycles (64)
Base ISA (Rolled)	496	944
Base ISA (Unrolled)	392	736
With Extensions (Rolled)	430	814
With Extensions (Unrolled)	326	606

Input

a0 = dividend a1 = divisor

Output

a0 = quotient a1 = remainder

The following division routines are branchless, straight line code that execute in a fixed number of cycles for all inputs, and thus are suitable for cryptographic applications.

Initial implementation was done from Hacker's Delight, 2nd edition. The algorithms were extended to 64 bits, and agressively optimized using the peculiar features of the RISC-V ISA to minimize the cycle count. The algorithmic approach is to use a series expansion to estimate the quotient, then an estimated remainder is calculated, and the quotient is corrected.

div3u

Unsigned integer divide by 3. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	20	22
With Extensions	17	19

Input

a0 = dividend

Output

a0 = quotient

div5u

Unsigned integer divide by 5. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	15	17
With Extensions	14	16

Input

a0 = dividend

Output

a0 = quotient

div6u

Unsigned integer divide by 6. You may be better off calling div3u and shifting the result. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	18	22
With Extensions	17	21

Input

a0 = dividend

Output

a0 = quotient

div7u

Unsigned integer divide by 7. Note that only the base ISA is used here, so there is no accelleration from ISA extensions. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	16	20

Input

a0 = dividend

Output

a0 = quotient

div9u

Unsigned integer divide by 9. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	15	17
With Extensions	14	16

Input

a0 = dividend

Output

a0 = quotient

div10u

Unsigned integer divide by 10. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	16	18
With Extensions	15	17

Input

a0 = dividend

Output

a0 = quotient

div11u

Unsigned integer divide by 11. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	18	20
With Extensions	15	17

Input

a0 = dividend

Output

a0 = quotient

div12u

Unsigned integer divide by 12. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	17	19
With Extensions	16	18

Input

a0 = dividend

Output

a0 = quotient

div13u

Unsigned integer divide by 13. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	18	20
With Extensions	16	18

Input

a0 = dividend

Output

a0 = quotient

div100u

Unsigned integer divide by 100. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	10	21
With Extensions	17	19

Input

a0 = dividend

Output

a0 = quotient

div1000u

Unsigned integer divide by 1000. This basically divides by 10, then by 100. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	37	41
With Extensions	35	39

Input

a0 = dividend

Output

a0 = quotient

div3

Signed integer divide by 3. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	25	27
With Extensions	22	24

Input

a0 = dividend

Output

a0 = quotient

div5

Signed integer divide by 5. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	24	26
With Extensions	23	25

Input

a0 = dividend

Output

a0 = quotient

div6

Signed integer divide by 6. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	27	31
With Extensions	26	30

Input

a0 = dividend

Output

a0 = quotient

div7

Signed integer divide by 7. Note that ISA extensions are not used by this routine. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	25	29

Input

a0 = dividend

Output

a0 = quotient

div9

Signed integer divide by 9. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	24	26
With Extensions	23	25

Input

a0 = dividend

Output

a0 = quotient

div10

Signed integer divide by 10. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	25	27
With Extensions	24	26

Input

a0 = dividend

Output

a0 = quotient

div11

Signed integer divide by 11. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	27	29
With Extensions	24	26

Input

a0 = dividend

Output

a0 = quotient

div12

Signed integer divide by 12. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	26	28
With Extensions	25	27

Input

a0 = dividend

Output

a0 = quotient

div13

Signed integer divide by 13. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	27	31
With Extensions	25	29

Input

a0 = dividend

Output

a0 = quotient

div100

Signed integer divide by 100. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	24	26
With Extensions	22	24

Input

a0 = dividend

Output

a0 = quotient

div1000

Signed integer divide by 1000. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	39	43
With Extensions	36	40

Input

a0 = dividend

Output

a0 = quotient

The below remainder routines are based on Hacker's Delight 2nd Edition, but this and the other remainder routines I didn't implement are of limited utility as it's faster to calculate the quotient and calculate the remainder from that.

I could modify the above routines to also return the remainder in addition to the quotient, which seems useful, but would likely cost 3-4 instructions per routine. Not yet sure if I want to do this or not, I'm interested which approach library users would find most helpful.

These routines run in constant time and are branchless.

mod3u

unsigned integer remainder after division by 3. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	21	23
With Extensions	19	21

Input

a0 = dividend

Output

a0 = quotient

mod3

signed integer remainder after division by 3. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	24	26
With Extensions	22	24

Input

a0 = dividend

Output

a0 = quotient

Algorithms: gcd.s

These routines calculate the greatest common divisor (gcd) and least common multiple (lcm) of two numbers. They can take advantage of the M and Zbb extensions, if available.

gcd

Computes greatest common divisor of two numbers, gcd(u,v). Can take advantage of M and Zbb extensions. With M extension it uses Euclid's algorithm. If M is not present, it computes the gcd via Stein's algorithm. O(log n). RV32E/RV32I/RV64I

Average Cycle Counts

Configuration	Cycles (32)	Cycles (64)
Base ISA	~500	~1000
With Zbb	~150	~300
With M	~100	~400

Input

a0 = u
a1 = v

Output

a0 = result

lcm

Computes the least common multiple of two numbers, lcm(u, v). Computes via lcm(u,v) = (u / gcd(u,v)) * v. Can take advantage of M and Zbb extensions. O(log n). RV32E/RV32I/RV64I

Average cycle counts

Configuration	Cycles (32)	Cycles (64)
Base ISA	~1200	~2500
With Zbb	~750	~1500
With M	~250	~500

Input

a0 = u
a1 = v

Output

a0 = result

I/O and Base Conversion io.s

to_hex

Convert a value in a register to an ASCII hexadecimal string. RV32E/RV32I/RV64I

Cycle Counts

Configuration	Prefix? (a2)	Cycle Count	Notes
8-bit (Byte)	No	31	2 nibbles
32-bit (Word)	No	97	8 nibbles
32-bit (Word)	Yes (0x)	100	Add prefix overhead
64-bit (Long)	No	185	16 nibbles (RV64 only)

Input

a0 = number to convert to ASCII hex
a1 = number of bytes to convert (eg 1, 2, 4, 8)
a2 = 0 do not insert leading 0x, 1 insert leading 0x

Output

a0 = address of nul (\0) terminated buffer with output
a1 = length of string

to_bin

Convert a value in a register to an ASCII binary string. RV32E/RV32I/RV64I

Cycle Counts with a2=3

Configuration	Input Width	Cycle Count
Base ISA	32	438
Base ISA	64	870
With Zbs	32	406
With Zbs	64	806

Input

a0 = number to convert to ascii binary
a1 = number of bytes to convert (eg 1, 2, 4, 8)
a2 = flags (0=none, 1=0b prefix, 2=spaces every 8 bits, 3=both)

Output

a0 = address of nul (\0)-terminated buffer with output
a1 = length of string

to_decu

Convert a value in a register to an ASCII unsigned decimal string. RV32E/RV32I/RV64I/RV128I This routine has an inline implemention of the div10u series expansion in it.

Configuration	Cycles (32)	Cycles (64)	Cycles (128)
Base ISA	230	590	1,450
Base + Zba	194	495	1,220
Base + M	~410	~850	Not recommended

Input

a0 = unsigned number to convert to ascii decimal

Output

a0 = address of nul (\0)-terminated buffer with output
a1 = length of string

to_dec

Convert a value in a register to an ASCII signed decimal string. RV32E/RV32I/RV64I/RV128I This routine is a wrapper around to_decu.

Configuration	Cycles (32)	Cycles (64)	Cycles (128)
Base ISA	~245	~605	~1,465
Base + Zba	~209	~510	~1,235
Base + M	~425	~865	Not recommended

Input

a0 = signed number to convert to ascii decimal

Output

a0 = address of nul (\0)-terminated buffer with output
a1 = length of string

from_hex

Convert an ASCII Hexadecimal string to binary. RV32E/RV32I/RV64I

Cycle count, assuming Hexadecimal string is half digits, half letters.

Cycles (32)	Cycles (64)
90	170

Input

a0 = pointer to ASCII Hexadecimal string, terminated with a non-hex char.

Output

a0 = pointer (advanced to point to non-hex char)
a1 = number
a2 = error check: 0 if no digits found, otherwise 1

from_bin

Convert an ASCII binary string to binary. RV32E/RV32I/RV64I

Cycles (32)	Cycles (64)
331	651

Input

a0 = pointer to ASCII binary string, terminated with a non-binary char (not 0 or 1)

Output

a0 = pointer (advanced to point to non-binary char)
a1 = number
a2 = error check: 0 if no digits found, otherwise 1

from_decu

Read an ASCII unsigned decimal string into a register. The parsing of the value stops when we read the first non-decimal character. RV32E/RV32I/RV64I

Assume 10 digits for 32 bit case, 20 digits for 64 bit case:

Configuration	Cycles (32)	Cycles (64)
Base ISA	131	251
With Zba	121	231

Input

a0 = pointer to number to convert from decimal, terminated with non-decimal char.

Output

a0 = pointer (advanced to point to non-decimal char)
a1 = number
a2 = error check: 0 if no digits found, otherwise 1

from_dec

Read an ASCII signed decimal string into a register. The parsing of the value stops when we read the first non-decimal character.

Assume 10 digits for 32 bit case, 20 digits for 64 bit case:

Configuration	Cycles (32)	Cycles (64)
Base ISA	~151	~271
With Zba	~141	~251

Input

a0 = pointer to number to convert from decimal, terminated with non-decimal char.

Output

a0 = pointer (advanced to point to non-decimal char)
a1 = number
a2 = error check: 0 if no digits found, otherwise 1

Math macros mac.s

abs

A macro to compute the absolute value of a number.

The src and dest registers can be the same if src does not need to be preserved.

Configuration	Cycles
Base ISA	3
Zbb Extension	2

abs dest src scratch

Input

src - input register
scratch - temporary register

Output

dest - output register

Multiplication

nmul

Native word length multiplication (32 on RV32I/E, 64 on RV64I). Intended for CPUs without M extension.

Configuration	Cycles (32)	Cycles (64)	Notes
Base ISA	~270	~540	Scales via bit width
Zbb	~170	~330	Scales with popcount

Input

a0 = multiplicand
a1 = multiplier

Output

a0 = product (low order)

mul32

Unified (RV64E, RV32I, and RV32I) Unsigned/Signed 32x32-bit to 64-bit multiply.

Configuration	Cycles (32)	Cycles (64)
Base ISA	~415	~280
Zbb Extension	~410	~165

Input

a0 = op1
a1 = op2
a2 = signed_flag: 0=unsigned, 1=signed

Output

a0 = product low word
a1 = product high word

m128

64x64 bit to 128-bit multiplication (signed/unsigned) on RV64I CPUs.

Configuration	Cycles
Base ISA	~710
Zbb Extension	~660

Configuration	Cycles
Base ISA	5
Zba Extension	4

mul1000 dest src scratch0

Input

src = source register containing number to multiply.
scratch0 = scratch register used for intermediate computation

Output

dest = destination register containing src*1000

Square Root sqrt.s

isqrt

Integer square root - floor(sqrt(n)) using a binary non-restoring algorithm. RV32E/RV32I/RV64I

Configuration	Cycles (32)	Cycles (64)
Base ISA	~130	~260
Zbb Extension	~105	~ 210

Input

a0 = number to compute square root of

Output

a0 = root

Name		Name	Last commit message	Last commit date
Latest commit History 269 Commits
src		src
LICENSE		LICENSE
README.md		README.md
rvint.png		rvint.png

License

benmesander/rvint

Folders and files

Latest commit

History

Repository files navigation

rvint

Division

On 32-bit processors (RV32I, RV32E)

On 64-bit processors (RV64I)

Math Macros

Multiplication

On 32-bit processors (RV32E, RV32I):

on 64-bit processors (RV64i):

Multiplication Macros

Base Conversions & I/O Operations

Square Root (RV32E, RV32I, and RV64I compatible)

Greatest Common Divisor (RV32E, RV32I, and RV64I compatible)

Least Common Multiple (RV32E, RV32I, and RV64I compatible)

Bit Operations (RV32E, RV32I, and RV64I compatible)

Building

Tests

Running

API

Bit manipulation - bits.s:

bits_ctz

Input

Output

bits_clz

Input

Output

Division - div.s:

divremu

Input

Output

divrem

Input

Output

div3u

Input

Output

div5u

Input

Output

div6u

Input

Output

div7u

Input

Output

div9u

Input

Output

div10u

Input

Output

div11u

Input

Output

div12u

Input

Output

div13u

Input

Output

div100u

Input

Output

div1000u

Input

Output

div3

Input

Output

div5

Input

Output

div6

Input

Output