Official Repository for Nami — an adaptive activation function with built-in strong regularization capabilities.
Nami means "wave" in Japanese — the name comes from its wavy nature in the negative domain.
Unlike typical activation functions that tend to a single value, Nami oscillates due to the sin function. It combines this oscillation with the smoothness of tanh. The behavior is controlled by three learnable parameters: w, a, and b.
pip install nami-act
# for torch
from nami import Torch.Nami
nami_torch = Torch.Nami()
# for tensorflow
from nami import TF.Nami
nami_tf = TF.Nami()
# for jax there are two modules
# one is nami function another is Nami object for Haiku
# if someone uses flax they can simply wrap the functional
# module with flax.nn.Module
from nami.JAX import nami, Nami PyTorch
TensorFlow
Jax (needs some fixes)
- Nami consistently leads in high-impact learning phases (e.g., around LR drops), emphasizing its adaptability.
- This resilience is not observed in traditional activations that either saturate early or destabilize under aggressive scheduling.
- Nami’s derivative exhibits a non-monotonic, softly saturating profile—a design that diversifies gradient propagation and avoids neuron inactivation.
- Its flexible form enables localized feature modulation, especially beneficial in deep architectures where gradient dynamics are critical.
Here is a quick comparison of Nami, Swish and Mish with the same weight initialization on ResNet-18, on CIFAR-10, More Detailed and complex tests are here benchmarks
def seed_everything(seed=42):
import random, os
import numpy as np
import torch
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True, warn_only=True)
seed_everything(42)Ran the training for 200 epochs and Nami showed very stable losses troughout the training and especially in the later epochs and that is because of tanh(x * a) in the positive and a * sin(x * w) / b in the negative domain. It has learnable parameters w, a, b, which demand longer runs to learn deeper and complex information from the data.
-
wis responsible for maintaining the wave-length, the smaller it is the smoother the gradients are. -
aregulates the spikes of the waves, high waves can capture deeper information, but if it rises too much, it may cause overfitting — that’s wherebcomes into the picture. -
btackles overfitting by supressinga's dominance, and increases generalization.
Here is a sample benchmark on Resnet18:
Nami:
Validation Accuracy 94.81,
Training Loss 0.0015,
Validation Loss 0.1963
Mish:
Validation Accuracy 94.09,
Training Loss 0.0032,
Validation Loss 0.2424
Swish/SiLU:
Validation Accuracy 94.06,
Training Loss 0.0024,
Validation Loss 0.2347
In conclusion, Nami beats Mish and Swish in both generalization and accuracy! Nami is suitable for Larger and Deeper Networks. And also; for shallow Neural Networks it captures more complex information than any other activation. Despite having three trainable parameters, its computational cost remains minimal — and the convergence is outstanding. For more detailed experiments visit benchmarks.
"I'd love to see ML folks fine-tune LLMs or train deep models using Nami, you can share the stats with me here: Gmail, X
Thanks for your support :)






