|
| 1 | +# SparseFlow Deployment Guide |
| 2 | + |
| 3 | +## Quick Start |
| 4 | + |
| 5 | +### 1. Check GPU Compatibility |
| 6 | +```bash |
| 7 | +python3 -c "import sparseflow; print(sparseflow.check_sparse_support())" |
| 8 | +``` |
| 9 | + |
| 10 | +Requirements: |
| 11 | +- NVIDIA GPU with compute capability ≥ 8.0 (Ampere or newer) |
| 12 | +- CUDA 11.8+ |
| 13 | +- PyTorch 2.0+ |
| 14 | + |
| 15 | +### 2. Analyze Your Deployment |
| 16 | +```bash |
| 17 | +sparseflow-audit --model llama-7b --qps 1000 |
| 18 | +``` |
| 19 | + |
| 20 | +This shows: |
| 21 | +- GPU requirements (dense vs sparse) |
| 22 | +- Annual cost savings |
| 23 | +- Carbon footprint reduction |
| 24 | +- ROI timeline |
| 25 | + |
| 26 | +### 3. Convert Your Model |
| 27 | +```bash |
| 28 | +sparseflow-convert \ |
| 29 | + --input model.pt \ |
| 30 | + --output model_sparse.sf \ |
| 31 | + --validate |
| 32 | +``` |
| 33 | + |
| 34 | +Validates 2:4 patterns and reports accuracy impact. |
| 35 | + |
| 36 | +### 4. Benchmark Performance |
| 37 | +```bash |
| 38 | +sparseflow-benchmark --size 4096x4096 --iterations 100 |
| 39 | +``` |
| 40 | + |
| 41 | +Measures actual speedup on your hardware. |
| 42 | + |
| 43 | +## Production Deployment |
| 44 | + |
| 45 | +### Step 1: Model Conversion |
| 46 | +```python |
| 47 | +import torch |
| 48 | +from torch import nn |
| 49 | +import sparseflow as sf |
| 50 | + |
| 51 | +# Load your model |
| 52 | +model = torch.load("model.pt") |
| 53 | + |
| 54 | +# Convert Linear layers |
| 55 | +for name, module in model.named_modules(): |
| 56 | + if isinstance(module, nn.Linear): |
| 57 | + sparse_layer, diff = sf.SparseLinear.from_dense( |
| 58 | + module, |
| 59 | + method="magnitude", |
| 60 | + return_diff=True |
| 61 | + ) |
| 62 | + print(f"{name}: {diff['max_error']:.6f} max error") |
| 63 | + |
| 64 | + # Replace in model |
| 65 | + # parent.layer = sparse_layer |
| 66 | +``` |
| 67 | + |
| 68 | +### Step 2: Validate Accuracy |
| 69 | +```python |
| 70 | +# Test on validation set |
| 71 | +dense_accuracy = evaluate(dense_model, val_loader) |
| 72 | +sparse_accuracy = evaluate(sparse_model, val_loader) |
| 73 | + |
| 74 | +print(f"Accuracy delta: {sparse_accuracy - dense_accuracy:.4f}") |
| 75 | +``` |
| 76 | + |
| 77 | +Typical accuracy impact: < 0.5% on most tasks |
| 78 | + |
| 79 | +### Step 3: Deploy |
| 80 | +```python |
| 81 | +# Inference |
| 82 | +x = torch.randn(1, 4096, device='cuda', dtype=torch.float16) |
| 83 | +y = sparse_model(x) # 2× faster |
| 84 | +``` |
| 85 | + |
| 86 | +## Cost Analysis |
| 87 | + |
| 88 | +### Example: LLaMA 7B @ 1000 QPS |
| 89 | + |
| 90 | +**Dense (baseline):** |
| 91 | +- GPUs: 16× A100-80GB |
| 92 | +- Annual GPU cost: $515K |
| 93 | +- Annual power cost: $67K |
| 94 | +- Total: $582K/year |
| 95 | + |
| 96 | +**SparseFlow:** |
| 97 | +- GPUs: 8× A100-80GB |
| 98 | +- Annual GPU cost: $258K |
| 99 | +- Annual power cost: $34K |
| 100 | +- Total: $292K/year |
| 101 | + |
| 102 | +**Savings:** |
| 103 | +- $290K/year (50% reduction) |
| 104 | +- 14 tons CO₂/year |
| 105 | +- ROI: Immediate |
| 106 | + |
| 107 | +## Troubleshooting |
| 108 | + |
| 109 | +### "CUDA not available" |
| 110 | + |
| 111 | +- Check: `nvidia-smi` |
| 112 | +- Install: CUDA Toolkit 11.8+ |
| 113 | + |
| 114 | +### "2:4 sparse not supported" |
| 115 | + |
| 116 | +- Requires: Ampere (SM80) or newer |
| 117 | +- Check: `torch.cuda.get_device_capability()` |
| 118 | + |
| 119 | +### "Slower than dense" |
| 120 | + |
| 121 | +- Check batch size (need ≥32 for speedup) |
| 122 | +- Check GPU utilization |
| 123 | +- Try different tile sizes |
| 124 | + |
| 125 | +## Support |
| 126 | + |
| 127 | +- GitHub Issues: https://github.com/MapleSilicon/SparseFlow/issues |
| 128 | +- Email: engineering@maplesilicon.com |
0 commit comments