Skip to content

(fix):Resolve Windows CUDA Flash Attention prebuilt wheels#604

Closed
tonyjohnvan wants to merge 2 commits intoace-step:mainfrom
tonyjohnvan:fix/windows-cuda-deps-and-flash-attn
Closed

(fix):Resolve Windows CUDA Flash Attention prebuilt wheels#604
tonyjohnvan wants to merge 2 commits intoace-step:mainfrom
tonyjohnvan:fix/windows-cuda-deps-and-flash-attn

Conversation

@tonyjohnvan
Copy link
Contributor

@tonyjohnvan tonyjohnvan commented Feb 16, 2026

Summary

Fix two dependency bugs affecting Windows CUDA users, introduced/exposed by the DGX Spark PR (#575):

  1. Windows uv run fails with "unsatisfiable" torch resolution — Adding win32/AMD64 to required-environments forced uv lock to cross-resolve all four platforms simultaneously. Since uv.lock is gitignored, Windows users had no lockfile and the implicit uv lock failed resolving torch==2.7.1+cu128 from the explicit PyTorch index. Fix: remove win32/AMD64 from required-environments (restoring pre-(fix):DGX Spark dependencies #575 local resolution behavior) and upgrade Windows torch to 2.10.0+cu128 to align with Linux x86_64.

  2. flash-attn unavailable on Python 3.12 — The only Windows flash-attn wheel (sdbds, cp311 built against torch 2.7.1) excluded Python 3.12 users entirely. Fix: switch to mjun0812/flash-attention-prebuild-wheels which provides cp311 and cp312 wheels built against torch 2.10 for both Windows (cu126, backward-compatible with cu128 runtime) and Linux (cu128).

Scope

Files changed:

  • pyproject.toml — Upgrade Windows torch/torchvision/torchaudio from 2.7.1 to 2.10.0+cu128, add platform_machine == 'AMD64' marker, remove win32 from required-environments
  • acestep/third_parts/nano-vllm/pyproject.toml — Replace sdbds flash-attn wheel with mjun0812 wheels (cp311 + cp312) for Windows; add cp312 wheel for Linux
  • requirements.txt — Mirror the same torch upgrade and flash-attn wheel changes for pip users

Explicitly out of scope:

  • Linux x86_64 torch (unchanged at 2.10.0+cu128)
  • Linux aarch64 / DGX Spark torch (unchanged at 2.10.0+cu130)
  • macOS arm64 torch (unchanged at >=2.9.1)
  • triton / triton-windows dependencies (unchanged)
  • All application code (zero Python source changes)

Risk and Compatibility

Platform torch flash-attn Status
Windows AMD64 2.7.1+cu128 → 2.10.0+cu128 sdbds cp311 → mjun0812 cp311+cp312 Changed
Linux x86_64 2.10.0+cu128 added cp312 wheel Additive only
Linux aarch64 (DGX Spark) 2.10.0+cu130 N/A (SDPA fallback) Unchanged
macOS arm64 >=2.9.1 N/A (SDPA fallback) Unchanged
  • Windows torch upgrade: all 2.10.0+cu128 wheels verified to exist on the PyTorch index for cp311, cp312, cp313 (win_amd64)
  • Windows flash-attn cu126 wheels are backward-compatible with cu128 CUDA runtime (CUDA minor versions are forward-compatible within the same major version)
  • required-environments removal for Windows: restores the exact behavior that worked before (fix):DGX Spark dependencies #575 — uv resolves locally at sync time instead of requiring cross-platform lockfile resolution

Regression Checks

  • uv lock succeeds (resolves 172 packages)
  • uv sync --dry-run succeeds
  • Lockfile contains correct Windows torch 2.10.0+cu128 wheels (cp311 + cp312)
  • Lockfile contains correct flash-attn entries for all four platform/python combos
  • DGX Spark (aarch64) resolution unchanged in lockfile
  • macOS arm64 resolution unchanged in lockfile
  • Windows end-to-end: uv run acestep (needs Windows tester)
  • Linux end-to-end: flash-attn import on Python 3.12 (needs Linux tester)

Reviewer Notes

  • uv.lock is gitignored (.gitignore line 112) — this is the root cause of Bug 1. Users never receive a lockfile, so uv run triggers uv lock on their machine. Committing uv.lock to version control would be a stronger fix but is a separate discussion.
  • No cp312 flash-attn for Windows + torch 2.7.1 exists anywhere — the sdbds repo only builds cp311, and mjun0812 only builds against torch 2.10. Upgrading Windows torch to 2.10.0 was necessary to unlock cp312 flash-attn support.
  • Follow-up: Consider un-gitignoring uv.lock so users get reproducible installs without needing to resolve dependencies themselves.

Summary by CodeRabbit

  • Chores
    • Updated PyTorch, torchvision, and torchaudio to 2.10.0+cu128 for Windows (AMD64) and aligned Linux x86_64 builds to 2.10.0+cu128.
    • Added/explicitly declared macOS ARM64 (Apple Silicon) CPU/MPS wheel entries and preserved Linux aarch64 configurations.
    • Introduced a platform-generic flash-attn dependency and added curated prebuilt wheel sources for Windows and Linux to broaden compatibility.
    • Simplified platform/Python gating for several native-accelerator wheels to reduce overly specific version constraints.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 16, 2026

📝 Walkthrough

Walkthrough

Dependency manifests updated: platform- and Python-gated wheel sources and dependency lines for PyTorch, torchvision, torchaudio, Triton, and flash-attn were reworked—Windows entries now include platform_machine == 'AMD64' and PyTorch versions bumped; flash-attn platform-specific wheel URLs replaced with generic entries and new wheel source blocks added.

Changes

Cohort / File(s) Summary
Nano-vLLM pyproject
acestep/third_parts/nano-vllm/pyproject.toml
Reworked Triton and flash-attn dependency constraints: removed Python-version-gated platform-specific Triton entries; replaced platform/Python-specific flash-attn wheel URLs with a generic flash-attn entry.
Top-level pyproject.toml
pyproject.toml
Bumped Windows PyTorch/torchvision/torchaudio to 2.10.0+cu128 / 0.25.0+cu128 / 2.10.0+cu128 with sys_platform == 'win32' and platform_machine == 'AMD64' gating; added tool.uv.sources wheel URLs for flash-attn and platform/python markers; added flash-attn dependency gated off macOS/ARM.
Requirements manifest
requirements.txt
Updated platform-specific wheel lines: added/adjusted Windows AMD64 gating, macOS arm64 entries, Linux x86_64 / aarch64 entries for CUDA/CPU variants; replaced Windows-specific flash-attn URL with a generic flash-attn specifier and aligned other platform flash-attn entries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • (fix):DGX Spark dependencies #575: Modifies the same manifest-level dependency entries (pyproject.toml/requirements.txt) to remap platform/architecture-specific PyTorch, Triton, and flash-attn wheel sources and markers.

Suggested reviewers

  • ChuxiJ

Poem

🐰 I nibbled through manifests, neat and quick,
Swapped wheels and markers, trimmed each wick,
AMD64 now gets its proper gate,
Flash-attn moved and versions update,
Hooray — builds hop forward, light and slick!

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main fix: resolving Windows CUDA Flash Attention prebuilt wheels configuration, which is the core objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@requirements.txt`:
- Around line 54-56: The Linux entry for "flash-attn" in requirements.txt
currently uses a generic specifier and should be replaced with explicit prebuilt
wheel URLs (like the Windows entries) so pip won't attempt to compile from
source; update the single "flash-attn; sys_platform == 'linux' and
platform_machine == 'x86_64'" line to explicit wheel installs for Linux CPython
3.11 and 3.12 (using python_version == '3.11' and python_version == '3.12'
conditionals) mirroring the style used for the Windows lines and consistent with
the wheel URLs used in acessórios/nano-vllm/pyproject.toml so users on Linux get
the prebuilt wheels.

@tonyjohnvan tonyjohnvan force-pushed the fix/windows-cuda-deps-and-flash-attn branch from 98447c7 to b151636 Compare February 16, 2026 07:11
@tonyjohnvan tonyjohnvan marked this pull request as draft February 16, 2026 07:15
…sh-attn

Two bugs introduced/exposed by the DGX Spark PR (ace-step#575):

1. Adding win32/AMD64 to required-environments forced uv lock to
   cross-resolve all four platforms. Since uv.lock is gitignored,
   Windows users had no lockfile and uv lock failed resolving
   torch==2.7.1+cu128 from the explicit pytorch index.
   Fix: remove win32 from required-environments so Windows resolves
   locally at sync time (restoring pre-ace-step#575 behavior), and upgrade
   Windows torch to 2.10.0+cu128 to align with Linux x86_64.

2. flash-attn only had a Python 3.11 (cp311) wheel for Windows
   (sdbds, built against torch 2.7.1). Python 3.12 users could not
   install flash-attn at all.
   Fix: switch to mjun0812/flash-attention-prebuild-wheels which
   provides cp311 and cp312 wheels built against torch 2.10 for both
   Windows (cu126, backward-compatible with cu128 runtime) and Linux
   (cu128). Added cp312 entries for both platforms.

Co-authored-by: Cursor <cursoragent@cursor.com>
@tonyjohnvan tonyjohnvan force-pushed the fix/windows-cuda-deps-and-flash-attn branch from b151636 to 217ba43 Compare February 16, 2026 07:18
- Added flash-attn dependency to pyproject.toml and requirements.txt for better platform support.
- Updated flash-attn source URLs in pyproject.toml to include specific wheels for Windows and Linux based on Python versions.
- Removed old flash-attn wheel links from nano-vllm's pyproject.toml to streamline dependency management.
@tonyjohnvan tonyjohnvan marked this pull request as ready for review February 16, 2026 10:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@acestep/third_parts/nano-vllm/pyproject.toml`:
- Around line 14-19: Update the pyproject.toml dependency entries to guard
platform/Python compatibility: make the "flash-attn" dependency conditional
(only Linux x86_64 with Python >=3.11 and Windows x86_64 for supported Python
versions) by adding appropriate environment markers (e.g., sys_platform and
platform_machine and python_version) and tighten "triton-windows" to exclude
Windows ARM by adding platform_machine == 'AMD64'; also update or raise the
requires-python metadata from ">=3.10" to ">=3.11" (or add python_version
markers on flash-attn) so the nano-vllm path dependency won’t force unresolved
wheels on unsupported OS/architectures and Python versions.

Comment on lines 14 to 19
"torch>=2.4.0",
# Triton and Flash Attention are optional on ROCm (Python 3.12) - SDPA fallback used instead
"triton-windows>=3.0.0,<3.4; sys_platform == 'win32' and python_version == '3.11'",
"triton>=3.0.0; sys_platform == 'linux' and python_version == '3.11'",
"triton>=3.0.0; sys_platform == 'linux'",
"triton-windows>=3.0.0,<3.4; sys_platform == 'win32'",
"transformers>=4.51.0",
"flash-attn @ https://github.com/sdbds/flash-attention-for-windows/releases/download/2.8.2/flash_attn-2.8.2+cu128torch2.7.1cxx11abiFALSEfullbackward-cp311-cp311-win_amd64.whl ; sys_platform == 'win32' and python_version == '3.11' and platform_machine == 'AMD64'",
"flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'x86_64'",
"flash-attn",
"xxhash",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, read the pyproject.toml file to see the full dependency section and requires-python
cat -n acestep/third_parts/nano-vllm/pyproject.toml | head -50

Repository: ace-step/ACE-Step-1.5

Length of output: 916


🏁 Script executed:

# Also check if there's a git history or PR context
cd acestep/third_parts/nano-vllm && git log --oneline -10 2>/dev/null || echo "No git history available"

Repository: ace-step/ACE-Step-1.5

Length of output: 137


🏁 Script executed:

# Search for any documentation about flash-attn wheels or platform support
rg -i "flash.attn|triton.windows" --type python --type toml -A 3 -B 3

Repository: ace-step/ACE-Step-1.5

Length of output: 95


🏁 Script executed:

# Search for PR context or discussion about flash-attn wheels
rg "flash.attn|cp311|cp312" --type toml --type md -B 2 -A 2

Repository: ace-step/ACE-Step-1.5

Length of output: 3731


🏁 Script executed:

# Check the git commit for more details about flash-attn update
cd acestep/third_parts/nano-vllm && git show 779bddf --stat 2>/dev/null || echo "Cannot show commit details"

Repository: ace-step/ACE-Step-1.5

Length of output: 50377


🌐 Web query:

flash-attn wheel availability python versions platforms 2024 2025

💡 Result:

Official flash-attn (Dao-AILab) distribution in 2024–2025

  • PyPI does not provide prebuilt wheels for the 2024–2025 releases (it uploads source tarballs only), so pip install flash-attn generally means building from source. This is true for:
    • 2.7.0.post2 (Nov 13, 2024): only flash_attn-2.7.0.post2.tar.gz (no .whl files) and metadata says Python >= 3.8. [2]
    • 2.8.3 (Aug 15, 2025): only flash_attn-2.8.3.tar.gz (no .whl files) and metadata says Python >= 3.9. [1]
  • Upstream states the project is primarily Linux-focused; Windows “might work” but is not the main supported path (and they ask for help setting up prebuilt Windows wheels). [3]

Where wheels were available in practice (community, 2024–2025 timeframe)

Because PyPI is source-only, most people relied on unofficial/community wheel builders:

  • Linux (x86_64 / manylinux): mjun0812/flash-attention-prebuild-wheels publishes large matrices of wheels by Python / PyTorch / CUDA combos. [4]
  • Windows (win_amd64): multiple community repos published wheels for specific setups, e.g.
    • NeedsMoar builds: wheels for Python 3.10 and 3.11 (CUDA-specific). [5]
    • sunsetcoder builds: wheels (notably mentions Python 3.10 availability). [6]

Practical summary for “wheel availability” (2024–2025)

  • If you installed from PyPI in 2024–2025: you were not getting wheels; you were building from source. [1][2]
  • If you needed wheels: you typically used community GitHub/HF wheel repos (Linux coverage tends to be broader than Windows; Windows wheels are often narrower and GPU/CUDA-specific). [3][4][5][6]
  • Aggregators such as flashattn.dev exist to help find matching community wheels by platform/Python/PyTorch/CUDA. [7]

Add platform/Python markers for flash-attn to prevent failed resolution on unsupported configurations.

At Line 18, flash-attn is unconditional in this submodule's pyproject.toml. While the parent project applies these markers at the root level, nano-vllm as a path dependency should also declare them defensively. Community wheels for flash-attn are available only for Linux x86_64 (Python 3.11+) and Windows x86_64 (Python 3.12–3.13); there are no macOS or Windows ARM wheels. If nano-vllm is installed standalone or in an environment where parent markers are not evaluated, installs on Python 3.10, macOS, or Windows ARM will fail during dependency resolution.

Similarly, Line 16 (triton-windows) should exclude Windows ARM by adding platform_machine == 'AMD64' to match the x86_64-only wheel availability.

Additionally, consider raising requires-python from >=3.10 to >=3.11 to align with flash-attn wheel availability, or apply version-specific markers as shown in the parent project's configuration.

✅ Suggested markers (align with root project)
-    "triton-windows>=3.0.0,<3.4; sys_platform == 'win32'",
-    "flash-attn",
+    "triton-windows>=3.0.0,<3.4; sys_platform == 'win32' and platform_machine == 'AMD64'",
+    "flash-attn; (sys_platform == 'linux' and platform_machine == 'x86_64' and python_version >= '3.11') or (sys_platform == 'win32' and platform_machine == 'AMD64' and python_version >= '3.12')",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"torch>=2.4.0",
# Triton and Flash Attention are optional on ROCm (Python 3.12) - SDPA fallback used instead
"triton-windows>=3.0.0,<3.4; sys_platform == 'win32' and python_version == '3.11'",
"triton>=3.0.0; sys_platform == 'linux' and python_version == '3.11'",
"triton>=3.0.0; sys_platform == 'linux'",
"triton-windows>=3.0.0,<3.4; sys_platform == 'win32'",
"transformers>=4.51.0",
"flash-attn @ https://github.com/sdbds/flash-attention-for-windows/releases/download/2.8.2/flash_attn-2.8.2+cu128torch2.7.1cxx11abiFALSEfullbackward-cp311-cp311-win_amd64.whl ; sys_platform == 'win32' and python_version == '3.11' and platform_machine == 'AMD64'",
"flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'x86_64'",
"flash-attn",
"xxhash",
"torch>=2.4.0",
"triton>=3.0.0; sys_platform == 'linux'",
"triton-windows>=3.0.0,<3.4; sys_platform == 'win32' and platform_machine == 'AMD64'",
"transformers>=4.51.0",
"flash-attn; (sys_platform == 'linux' and platform_machine == 'x86_64' and python_version >= '3.11') or (sys_platform == 'win32' and platform_machine == 'AMD64' and python_version >= '3.12')",
"xxhash",
🤖 Prompt for AI Agents
In `@acestep/third_parts/nano-vllm/pyproject.toml` around lines 14 - 19, Update
the pyproject.toml dependency entries to guard platform/Python compatibility:
make the "flash-attn" dependency conditional (only Linux x86_64 with Python
>=3.11 and Windows x86_64 for supported Python versions) by adding appropriate
environment markers (e.g., sys_platform and platform_machine and python_version)
and tighten "triton-windows" to exclude Windows ARM by adding platform_machine
== 'AMD64'; also update or raise the requires-python metadata from ">=3.10" to
">=3.11" (or add python_version markers on flash-attn) so the nano-vllm path
dependency won’t force unresolved wheels on unsupported OS/architectures and
Python versions.

@ChuxiJ ChuxiJ closed this Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants