This project implements a unified hyperbolic representation learning framework for the North American Industry Classification System (NAICS). The system combines multi-channel text encoding, Mixture-of-Experts fusion, hyperbolic contrastive learning, and a hyperbolic graph refinement stage to produce geometry-aware embeddings aligned with the hierarchical structure of the NAICS taxonomy.
The final output is a set of Lorentz-model hyperbolic embeddings suitable for similarity search, hierarchical modeling, graph-based reasoning, and downstream machine learning applications.
The system consists of four sequential stages:
- Multi-channel text encoding – independent transformer-based encoders for title, description, examples, and exclusions.
- Mixture-of-Experts (MoE) fusion – adaptive fusion of the four embeddings using Top-2 gating.
- Hyperbolic contrastive learning – projection into Lorentz space and optimization with Decoupled Contrastive Learning (DCL).
- Hyperbolic Graph Convolutional Refinement (HGCN) – structure-aware refinement using the explicit NAICS parent–child graph.
Each stage is designed to preserve or enhance the hierarchical geometry of NAICS codes.
Each NAICS code includes four distinct text fields:
- Title: Short code name ⟶ Concise category identification
- Description: Detailed explanation of what the code encompasses ⟶ Rich semantic content
- Examples: Representative businesses in this category ⟶ Concrete instantiations
- Excluded: Codes explicitly NOT in this category ⟶ Disambiguation and boundaries
Each field is processed independently using a transformer encoder (LoRA-adapted). This produces four Euclidean embeddings:
- Title: (Embedding_title)
- Description: (Embedding_description)
- Examples: Embedding_examples)
- Excluded: (Embedding_excluded)
These embeddings serve as inputs to the fusion stage.
The four channel embeddings are concatenated and passed into a Mixture-of-Experts (MoE) module. Key components include:
- Top-2 gating to route each input to the two most relevant experts.
- Feed-forward expert networks that learn specialized fusion behaviors.
- Auxiliary load-balancing loss to ensure even expert utilization across batches.
This produces a single fused Euclidean embedding (E_fused) per NAICS code.
To align the latent space with the hierarchical structure of NAICS, embeddings are projected into Lorentz-model hyperbolic space via the exponential map.
The fused Euclidean vector is mapped onto the hyperboloid:
- Uses exponential map at the origin
- Supports learned or fixed curvature
- Ensures numerical stability
The result is a Lorentz embedding (E_hyp).
Contrastive learning is performed using Decoupled Contrastive Learning (DCL) with Lorentzian geodesic distances:
d(u, v) = arcosh(-<u, v>_L)
The DCL loss decouples the positive and negative terms:
L = (-pos_sim + logsumexp(neg_sims)).mean()
This formulation provides better gradient flow and numerical stability compared to standard InfoNCE.
Negatives include:
- unrelated codes,
- hierarchically distant codes,
- false negatives detected via periodic clustering (masked with -inf).
A curriculum-based procedure removes semantically similar negatives once the embedding space stabilizes:
- Generate embeddings for the dataset.
- Cluster embeddings (e.g., via KMeans).
- Identify negatives sharing the cluster label with the anchor.
- Exclude these from the contrastive denominator.
This prevents the model from incorrectly separating close hierarchical neighbors.
To fully integrate the explicit hierarchical relationships of NAICS, the system applies a Hyperbolic Graph Convolutional Network as a refinement stage.
Nodes represent NAICS codes, and edges represent parent–child relationships in the taxonomy.
The refinement module includes:
- Two hyperbolic graph convolutional layers
- Tangent-space aggregation and message passing
- Learnable curvature shared across layers
- Exponential and logarithmic maps for manifold transitions
The model optimizes a combined loss:
Ensures that:
- anchor–positive distance < anchor–negative distance
- distances use Lorentz geodesics
Encourages embeddings at the same hierarchical level to maintain similar hyperbolic radii.
This aligns global and local geometric structure with the NAICS taxonomy.
To ensure graph refinement does not erode the global structure captured by the text model, the same hierarchy-aware metrics introduced earlier in the pipeline are logged:
- Cophenetic correlation + pair counts
- Spearman correlation
- NDCG@K (configurable list, default
5/10/20) - Hyperbolic distortion statistics
To ensure the refinement step preserves global structure while improving local parent retrieval, use the following command:
uv run naics-embedder tools verify-stage4 \
--pre ./output/hyperbolic_projection/encodings.parquet \
--post ./output/hgcn/encodings.parquet \
--distance-matrix ./data/naics_distance_matrix.parquet \
--relations ./data/naics_relations.parquetThe verifier reports cophenetic correlation, NDCG@K, and parent-retrieval accuracy deltas and fails when degradation exceeds the configurable thresholds (--max-cophenetic-drop, --max-ndcg-drop, --min-local-improvement, --parent-top-k). Integrate this command into your pipeline to guarantee that Stage 4 only ships when it demonstrably preserves the global NAICS hierarchy.
Upon completion of all four stages, the system produces:
- High-fidelity hyperbolic embeddings in Lorentz space
- Representations consistent with both text semantics and hierarchical relationships
- Embeddings suitable for:
- hierarchical search and retrieval
- clustering and visualization
- downstream machine learning tasks
- graph-based analytics
+-------------------------------+
| Multi-Channel Text Encoder |
| (Title / Desc / Examples / |
| Excluded via Transformer) |
+---------------+---------------+
|
v
+-------------------------------+
| Mixture-of-Experts |
| Top-2 Gating + Expert MLPs |
| Load-Balanced Fusion Layer |
+---------------+---------------+
|
v
+-------------------------------+
| Hyperbolic Projection |
| (Lorentz Exponential Map) |
+---------------+---------------+
|
v
+-------------------------------+
| Hyperbolic Contrastive Loss |
| (DCL + Lorentz Distance + |
| False Negative Masking) |
+---------------+---------------+
|
v
+-------------------------------+
| HGCN Refinement |
| (Tangent-Space GNN + Curv.) |
+---------------+---------------+
|
v
+-------------------------------+
| Final Lorentz Hyperbolic Emb. |
+-------------------------------+
Clone the repository:
git clone https://github.com/lowmason/naics-embedder.git
cd naics-embedderInstall uv:
pip3 install uvInstall dependencies:
uv synvPrepare the NAICS dataset with four text channels.
uv run naics-embedder data preprocess
uv run naics-embedder data relations
uv run naics-embedder data distances
uv run naics-embedder data tripletsOr:
uv run naics-embedder data allThe text encoder uses the Structure-Aware Dynamic Curriculum (SADC) scheduler by default. It progresses through three phases in a single run—structural initialization, geometric refinement, and false-negative mitigation—activating the appropriate sampling flags automatically.
Run training with the base config:
uv run naics-embedder train --config conf/config.yaml Train the refinement model:
uv run naics-embedder train-hgcn --config configs/hgcn.yamlUse Lorentzian distance:
dist = lorentz_distance(x, y)Lower values indicate closer hierarchical or semantic similarity.
Project to tangent space or Poincaré ball for plotting.
Final embeddings can be used as features for:
- classification models,
- clustering algorithms (in hyperbolic or tangent space),
- retrieval and recommendation systems.