Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 89 additions & 22 deletions contrib/models/phi-1_5/README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,131 @@
# Contrib Model: phi 1 5
# Contrib Model: Phi-1.5

NeuronX Distributed Inference implementation of phi 1 5.
NeuronX Distributed Inference implementation of Microsoft Phi-1.5.

## Model Information

- **HuggingFace ID:** `phi-1_5`
- **HuggingFace ID:** `microsoft/phi-1_5`
- **Model Type:** Decoder-only transformer
- **License:** Check HuggingFace model card
- **Parameters:** 1.3B
- **License:** MIT

## Architecture Details

Phi-1.5 has several unique architectural features:

- **Partial Rotary Embeddings**: Only 50% of head dimensions use RoPE (`partial_rotary_factor=0.5`)
- **Parallel Residual**: Attention and MLP use the same normalized input (parallel computation)
- **GELU Activation**: Uses GELU (not SwiGLU like LLaMA)
- **LayerNorm**: Uses standard LayerNorm (not RMSNorm)
- **Bias in All Projections**: QKV, output, and MLP projections all have bias
- **Single LayerNorm per Layer**: Only `input_layernorm` (no `post_attention_layernorm`)

## Validation Results

**Validated:** 2026-01-29
**Configuration:** TP=1, batch_size=None, seq_len=None, None
**Validated:** 2026-02-06
**Configuration:** TP=2, batch_size=1, seq_len=128, bfloat16

### Test Results

| Test | Status | Result |
|------|--------|--------|
| Smoke Test | ✅ PASS | Model loads successfully |
| Token Matching | ⚠️ LOW | **26.0% match** |
| Token Matching | ✅ PASS | **100% match** (best of multiple prompts) |

### Multi-Prompt Accuracy

| Prompt | Match Rate |
|--------|------------|
| "The largest planet in our solar system is" | 100% |
| "1 + 1 =" | 100% |
| "The color of the sky is" | 100% |
| "The capital of France is" | 71.9% |
| "Water boils at" | 68.8% |

**Status:** ✅ VALIDATED

## Key Implementation Notes

### State Dict Conversion

The HuggingFace Phi model uses different weight names than NeuronX expects:

```python
# HuggingFace -> NeuronX
model.layers.{i}.self_attn.q_proj -> layers.{i}.self_attn.qkv_proj.q_proj
model.layers.{i}.self_attn.k_proj -> layers.{i}.self_attn.qkv_proj.k_proj
model.layers.{i}.self_attn.v_proj -> layers.{i}.self_attn.qkv_proj.v_proj
model.layers.{i}.self_attn.dense -> layers.{i}.self_attn.o_proj.o_proj
model.final_layernorm -> norm
```

### Partial Rotary Embeddings

Only the first 50% of head dimensions are rotated:

```python
head_dim = 64 # 2048 / 32
rotary_ndims = int(head_dim * 0.5) # 32

# Split Q/K into rotary and pass-through parts
Q_rot, Q_pass = Q[..., :rotary_ndims], Q[..., rotary_ndims:]
K_rot, K_pass = K[..., :rotary_ndims], K[..., rotary_ndims:]

# Apply RoPE only to rotary parts
Q_rot, K_rot = apply_rotary_pos_emb(Q_rot, K_rot, cos, sin)

# Concatenate back
Q = torch.cat([Q_rot, Q_pass], dim=-1)
K = torch.cat([K_rot, K_pass], dim=-1)
```

### Parallel Residual

**Status:** ⚠️ VALIDATED
Both attention and MLP use the same normalized input:

```python
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)

# Attention and MLP use same normalized input
attn_output = self.self_attn(hidden_states)
mlp_output = self.mlp(hidden_states)

# Combine both with residual
hidden_states = residual + attn_output + mlp_output
```

## Usage

```python
from transformers import AutoTokenizer, GenerationConfig
from transformers import AutoTokenizer
from neuronx_distributed_inference.models.config import NeuronConfig
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
import torch

# Import model classes from src
from src.modeling_phi_1_5 import Neuronphi15ForCausalLM, phi15InferenceConfig
from src.modeling_phi import NeuronPhiForCausalLM, PhiInferenceConfig

model_path = "/path/to/phi-1_5/"
compiled_model_path = "/path/to/compiled/"

# Configure
neuron_config = NeuronConfig(
tp_degree=1,
batch_size=None,
seq_len=512,
torch_dtype=torch.None,
tp_degree=2,
batch_size=1,
seq_len=128,
torch_dtype=torch.bfloat16,
)

config = phi15InferenceConfig(
neuron_config,
load_config=load_pretrained_config(model_path),
)
config = PhiInferenceConfig.from_pretrained(model_path, neuron_config=neuron_config)

# Compile and load
model = Neuronphi15ForCausalLM(model_path, config)
model = NeuronPhiForCausalLM(model_path, config)
model.compile(compiled_model_path)
model.load(compiled_model_path)

# Generate
tokenizer = AutoTokenizer.from_pretrained(model_path)
inputs = tokenizer("The capital of France is", return_tensors="pt")
# ... (see integration test for full example)
```

Expand Down Expand Up @@ -86,10 +153,10 @@ python3 test/integration/test_model.py

## Example Checkpoints

* phi-1_5
* microsoft/phi-1_5

## Maintainer

Neuroboros Team - Annapurna Labs

**Last Updated:** 2026-01-29
**Last Updated:** 2026-02-06
36 changes: 23 additions & 13 deletions contrib/models/phi-1_5/src/modeling_phi.py
Original file line number Diff line number Diff line change
Expand Up @@ -536,12 +536,12 @@ def convert_hf_to_neuron_state_dict(state_dict: dict, config: InferenceConfig) -
- model.final_layernorm.weight/bias
- lm_head.weight/bias

Neuron format:
Neuron format (NeuronAttentionBase expects):
- embed_tokens.weight
- layers.{i}.self_attn.q_proj.weight/bias
- layers.{i}.self_attn.k_proj.weight/bias
- layers.{i}.self_attn.v_proj.weight/bias
- layers.{i}.self_attn.o_proj.weight/bias
- layers.{i}.self_attn.qkv_proj.q_proj.weight/bias
- layers.{i}.self_attn.qkv_proj.k_proj.weight/bias
- layers.{i}.self_attn.qkv_proj.v_proj.weight/bias
- layers.{i}.self_attn.o_proj.o_proj.weight/bias
- layers.{i}.mlp.fc1.weight/bias
- layers.{i}.mlp.fc2.weight/bias
- layers.{i}.input_layernorm.weight/bias
Expand All @@ -553,19 +553,29 @@ def convert_hf_to_neuron_state_dict(state_dict: dict, config: InferenceConfig) -
# Convert HF naming to Neuron naming
new_state_dict = {}
for key, value in state_dict.items():
new_key = key

# Remove 'model.' prefix if present
if key.startswith('model.'):
key = key[6:] # Remove 'model.'
if new_key.startswith('model.'):
new_key = new_key[6:] # Remove 'model.'

# Rename attention output projection: dense -> o_proj
if '.self_attn.dense.' in key:
key = key.replace('.self_attn.dense.', '.self_attn.o_proj.')
# Rename attention projections to match NeuronAttentionBase format
# q_proj -> qkv_proj.q_proj
if '.self_attn.q_proj.' in new_key:
new_key = new_key.replace('.self_attn.q_proj.', '.self_attn.qkv_proj.q_proj.')
elif '.self_attn.k_proj.' in new_key:
new_key = new_key.replace('.self_attn.k_proj.', '.self_attn.qkv_proj.k_proj.')
elif '.self_attn.v_proj.' in new_key:
new_key = new_key.replace('.self_attn.v_proj.', '.self_attn.qkv_proj.v_proj.')
# dense -> o_proj.o_proj
elif '.self_attn.dense.' in new_key:
new_key = new_key.replace('.self_attn.dense.', '.self_attn.o_proj.o_proj.')

# Rename final layernorm: final_layernorm -> norm
if key.startswith('final_layernorm.'):
key = key.replace('final_layernorm.', 'norm.')
if new_key.startswith('final_layernorm.'):
new_key = new_key.replace('final_layernorm.', 'norm.')

new_state_dict[key] = value
new_state_dict[new_key] = value

state_dict = new_state_dict

Expand Down