Skip to content

Rust + CI etc #57

Merged
sergeyf merged 11 commits intomainfrom
rust
Jan 30, 2026
Merged

Rust + CI etc #57
sergeyf merged 11 commits intomainfrom
rust

Conversation

@sergeyf
Copy link
Collaborator

@sergeyf sergeyf commented Jan 26, 2026

I ran a lot of parity tests so this should be exactly identical. You don't need to review every line but maybe the overall change structure?

@sergeyf sergeyf requested a review from atalyaalon January 26, 2026 05:24
@sergeyf
Copy link
Collaborator Author

sergeyf commented Jan 26, 2026

@atalyaalon This is ready for review now. I have worked out the kinks with the updated CI.

@atalyaalon
Copy link

atalyaalon commented Jan 30, 2026

The overall architecture looks good: clean separation of concerns: Rust code is isolated - can be developed/tested independently. Also there's zero breakage if Rust unavailable, fallback to Python.
I suggest adding a documentation of Overall speedup to this PR, either in comments or README

Also used Claude to co-review:

PR #57 Review: Rust + CI

Summary

This PR adds a Rust-based featurizer implementation to accelerate pairwise feature generation, along with comprehensive CI/CD workflows for building and publishing both Python and Rust packages. The implementation maintains backward compatibility with automatic fallback to Python when the Rust extension is unavailable.

Architecture and Structure

This PR demonstrates production-grade architectural work:

  • Clean separation of concerns: s2and/ (pure Python) and s2and_rust/ (separate package) with clear boundaries

  • Progressive enhancement pattern: Try Rust → Fallback to Python → Always works (textbook good design)

  • Mature CI/CD: Smart version detection, multi-platform wheels, proper caching

  • Three-tier caching strategy: In-memory (WeakKeyDictionary) → Disk cache → Production mode (memory-only)

  • Comprehensive parity testing: 469 lines of tests ensuring correctness

The complexity is well-justified for a performance optimization layer. The author clearly thought through the design.


Issues to Address

Low Priority (Code Quality)

Issue 1: RAYON_NUM_THREADS side effects (featurizer.py:800)

Setting os.environ["RAYON_NUM_THREADS"] mutates global process state:

os.environ["RAYON_NUM_THREADS"] = str(max(1, n_jobs))
rust_featurizer.featurize_pairs(rust_pairs)

Problems:

  1. This affects ALL subsequent Rayon uses in the process, not just this call
  2. Rayon reads this at thread pool initialization, so changes may not take effect if the pool already exists
  3. In concurrent scenarios, different calls with different n_jobs will race to set this

Best practice:
Set this once at process startup or not at all (let Rayon use its default). If per-call control is needed, the Rust code should accept a thread count parameter:

// In lib.rs
#[pymethod]
fn featurize_pairs(&self, pairs: Vec<(String, String)>, num_threads: Option<usize>) -> PyResult<Vec<Vec<f64>>> {
    if let Some(n) = num_threads {
        rayon::ThreadPoolBuilder::new().num_threads(n).build_global().ok();
    }
    // ... existing implementation
}

Then from Python:

rust_features = rust_featurizer.featurize_pairs(rust_pairs, num_threads=n_jobs)

Issue 2: Race condition in cache access (feature_port.py:61, 103-127)

_RUST_FEATURIZER_CACHE uses WeakKeyDictionary without locking:

_RUST_FEATURIZER_CACHE: "weakref.WeakKeyDictionary[ANDData, object]" = weakref.WeakKeyDictionary()

def _get_rust_featurizer(dataset: ANDData, write_cache: Optional[bool] = None) -> Any:
    featurizer = _RUST_FEATURIZER_CACHE.get(dataset)  # Thread 1 and 2 both get None
    if featurizer is None:
        # Both threads create new featurizer
        featurizer = s2and_rust.RustFeaturizer.from_dataset(...)
        _RUST_FEATURIZER_CACHE[dataset] = featurizer  # Race condition

Impact: Multiple threads could create duplicate featurizers (wastes CPU but doesn't corrupt data). This is harmless for most use cases since:

  • Happens only once per dataset
  • Most uses are single-threaded or use multiprocessing (separate processes = no shared state)
  • Creating duplicate featurizers wastes CPU but doesn't cause data corruption

Best practice (following pattern in featurizer.py:50-99 BackgroundCacheWriter):

import threading

_RUST_FEATURIZER_CACHE: "weakref.WeakKeyDictionary[ANDData, object]" = weakref.WeakKeyDictionary()
_RUST_FEATURIZER_CACHE_LOCK = threading.Lock()

def _get_rust_featurizer(dataset: ANDData, write_cache: Optional[bool] = None) -> Any:
    if s2and_rust is None:
        raise RuntimeError(_RUST_NOT_AVAILABLE_MSG)

    # Fast path without lock
    featurizer = _RUST_FEATURIZER_CACHE.get(dataset)
    if featurizer is not None:
        return featurizer

    # Slow path with lock (double-checked locking)
    with _RUST_FEATURIZER_CACHE_LOCK:
        # Double-check after acquiring lock
        featurizer = _RUST_FEATURIZER_CACHE.get(dataset)
        if featurizer is not None:
            return featurizer

        # Create featurizer (only one thread does this)
        use_disk_cache = _env_flag("S2AND_RUST_FEATURIZER_DISK_CACHE", "1") and not _rust_prod_mode(dataset)
        # ... rest of implementation
        _RUST_FEATURIZER_CACHE[dataset] = featurizer

    return featurizer

Recommendation: Only address this if you're targeting multi-threaded usage patterns. For typical single-threaded or multiprocess usage, this is not a concern.


Issue 3: Module loading complexity (feature_port.py:19-57)

The module loading uses 50+ lines of sys.modules manipulation to handle a specific edge case: the s2and_rust/s2and_rust/ directory is a Python package stub that re-exports from the compiled .so. When the repo root is on sys.path but the extension isn't built, Python imports the stub instead of the compiled extension from site-packages.

# Current: complex sys.modules manipulation
def _load_s2and_rust(force_reload: bool = False) -> Optional[ModuleType]:
    if force_reload:
        sys.modules.pop("s2and_rust", None)
        sys.modules.pop("s2and_rust.s2and_rust", None)
    try:
        module = importlib.import_module("s2and_rust")
    except Exception:
        module = _load_s2and_rust_from_site_packages()
    # ... more fallback logic

Context: This complexity exists to work around a known maturin/PyO3 pitfall. The maturin project layout guide explicitly warns about this and recommends the "src layout" to avoid it. Major projects like pydantic-core, tokenizers, and tiktoken avoid this by requiring the extension (fail-fast on import) rather than making it optional.

Recommendation: Follow pydantic-core's convention - prefix the compiled Rust module with underscore (_s2and_rust):

  1. In Cargo.toml, rename the lib to _s2and_rust
  2. Have the Python package import from ._s2and_rust

This is the standard pattern for PyO3/maturin projects (pydantic-core uses _pydantic_core, etc.). It eliminates shadowing and allows the simple import pattern:

try:
    import s2and_rust
except ImportError:
    s2and_rust = None

Note: The current implementation works correctly - it handles the edge case. The suggestion is to simplify by restructuring, not because the code is broken.


Issue 4: Repeated environment variable parsing (featurizer.py:426, 781-783)

Environment variables are parsed on every function call in featurizer.py, while model.py already implements the correct caching pattern via _use_rust_constraints().

Locations with repeated parsing:

  • featurizer.py:426 (S2AND_USE_RUST_FEATURIZER) - parsed for every signature pair in _single_pair_featurize
  • featurizer.py:781-783 (S2AND_RUST_BATCH, S2AND_RUST_BATCH_THRESHOLD) - parsed on every batch

Already correct in model.py (good pattern to follow):

_USE_RUST_CONSTRAINTS_CACHE: Optional[bool] = None

def _use_rust_constraints() -> bool:
    global _USE_RUST_CONSTRAINTS_CACHE
    if _USE_RUST_CONSTRAINTS_CACHE is None:
        use_rust_feat = os.environ.get("S2AND_USE_RUST_FEATURIZER", "1").lower() in {"1", "true", "yes"}
        use_rust_constraints = os.environ.get("S2AND_USE_RUST_CONSTRAINT", "1").lower() in {"1", "true", "yes"}
        _USE_RUST_CONSTRAINTS_CACHE = use_rust_feat and use_rust_constraints
    return _USE_RUST_CONSTRAINTS_CACHE

Inconsistent in featurizer.py:

def _single_pair_featurize(work_input, index=-1):
    # Called for EVERY signature pair - parses env var each time
    use_rust = os.environ.get("S2AND_USE_RUST_FEATURIZER", "1").lower() in {"1", "true", "yes"}

Impact: The performance impact is negligible (~200ns per call), but this creates inconsistency within the codebase - model.py caches correctly while featurizer.py does not.

Recommendation: For consistency, apply the same caching pattern from model.py to featurizer.py. This is a code quality/consistency issue rather than a performance concern.


Issue 5: Disk Cache Invalidation (feature_port.py:62, 91-97)

The cache key uses a hardcoded RUST_FEATURIZER_CACHE_VERSION = 2. If extraction logic in the Rust code changes without manually bumping this constant, stale caches produce silent wrong results.

key = (
    f"{dataset.name}_v{FEATURIZER_VERSION}_rv{RUST_FEATURIZER_CACHE_VERSION}"
    ...
)

Recommendation: Use the s2and_rust package version instead of a manual constant. This way the cache auto-invalidates on every release without requiring manual bookkeeping:

rust_version = getattr(s2and_rust, "__version__", "unknown")

Test Coverage

Excellent:

  • Comprehensive parity tests covering various signature/paper configurations
  • Good use of pytest fixtures and parametrization
  • Tests verify edge cases (empty counters, missing specter embeddings, etc.)

Could add:

  • Cache serialization round-trip test
  • Tests for concurrent access patterns (if targeting multi-threaded usage)

CI/CD Workflow

The release-rust.yml workflow is well-structured with good separation of concerns:

Positives:

  • Version detection logic prevents unnecessary builds
  • Proper matrix builds for multiple Python versions and platforms
  • Good use of caching for Cargo builds
  • Inline Python script works fine (common pattern in GitHub Actions)

Documentation

The README updates are clear and comprehensive. Consider adding:

  • Performance benchmarks showing speedup vs. Python (would justify the complexity)

Recommendations Summary

Low Priority (Code Quality)

  1. Issue 1: RAYON_NUM_THREADS side effects - Consider passing thread count as parameter instead of mutating env
  2. Issue 2: Race condition in cache - Add locking only if targeting multi-threaded patterns
  3. Issue 3: Module loading complexity - Prefix compiled module with underscore (_s2and_rust) per pydantic-core convention
  4. Issue 4: Repeated environment variable parsing - Apply model.py's caching pattern to featurizer.py for consistency

Medium Priority

  1. Issue 5: Disk Cache Invalidation - Use s2and_rust.__version__ in cache key instead of manual constant

Verdict

RECOMMEND MERGE - No blocking issues, however Issue 5 (Disk Cache Invalidation) should be addressed to prevent silent correctness issues when Rust extraction logic changes. Consider addressing the low priority issues to improve code quality.

The parity testing gives high confidence in correctness. The overall architecture is excellent - this is production-grade work with clean separation of concerns, robust fallback mechanisms, and comprehensive CI/CD.

Key strengths:

  • Clean two-package structure (s2and + s2and_rust)
  • Progressive enhancement with graceful degradation
  • Comprehensive parity testing
  • Well-designed caching strategy
  • Mature CI/CD with smart build triggers

Copy link

@atalyaalon atalyaalon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall change structure looks good. See a few comments above.

@atalyaalon
Copy link

atalyaalon commented Jan 30, 2026

@sergeyf additional medium priority issues found:

Medium Priority

Issue 6: Missing incremental_dont_use_cluster_seeds Parameter

File: s2and/model.py:1119-1126

Code:

value = _get_constraint_value(
    dataset,
    unassigned_signature,
    signature,
    dont_merge_cluster_seeds=self.dont_merge_cluster_seeds,
    rust_featurizer=rust_featurizer,
    use_rust_constraints=use_rust_constraints,
)

Problem: In predict_incremental_helper, the call to _get_constraint_value is missing the incremental_dont_use_cluster_seeds parameter. The Python get_constraint is called with incremental_dont_use_cluster_seeds defaulting to False. Is that the desired behavior? Or is it a potential bug?

Issue 7: Publish Job Dependency Failure

File: .github/workflows/release-rust.yml:273-305

The Problem

The publish job on line 276 has this dependency chain:

needs: [detect-versions, s2and-dist, wheels-windows, wheels-macos, wheels-linux, sdist]

But the conditions for each job are different:

Job Condition
s2and-dist (line 128) s2and_changed == 'true' || force_build == 'true'
wheels-windows (line 151) rust_changed == 'true' || force_build == 'true'
wheels-macos (line 181) rust_changed == 'true' || force_build == 'true'
wheels-linux (line 216) rust_changed == 'true' || force_build == 'true'
sdist (line 256) rust_changed == 'true' || force_build == 'true'

Scenario: Only s2and version changes (no Rust changes)

If someone bumps only the pyproject.toml version (e.g., 0.2.00.2.1) without changing the Rust version:

  • s2and_changed = true
  • rust_changed = false
  • publish_any = true (because s2and_changed is true)

What happens:

  1. s2and-dist runs ✓ (because s2and_changed == true)
  2. wheels-windows skipped (because rust_changed == false)
  3. wheels-macos skipped
  4. wheels-linux skipped
  5. sdist skipped
  6. publish job condition is met (publish_any == true)...

But the publish job has needs: [..., wheels-windows, wheels-macos, wheels-linux, sdist].

In GitHub Actions, when a job in needs is skipped, the dependent job will also be skipped by default. So the publish job will be skipped even though:

  • The intent was to publish s2and
  • The s2and-dist artifact was successfully built
  • The publish job's own if condition evaluates to true

Scenario: Only Rust version changes (no s2and changes)

Conversely, if only Rust version changes:

  • s2and_changed = false
  • rust_changed = true
  1. s2and-dist skipped
  2. All wheel jobs run
  3. publish job skipped because s2and-dist was skipped

The download step on line 284-288 will also fail if dist-s2and artifact doesn't exist.

Suggested Fixes

Split into two independent publish jobs:

publish-s2and:
  if: ... && needs.detect-versions.outputs.s2and_changed == 'true'
  needs: [detect-versions, s2and-dist]

publish-rust:
  if: ... && needs.detect-versions.outputs.rust_changed == 'true'
  needs: [detect-versions, wheels-windows, wheels-macos, wheels-linux, sdist]

Please consider addressing issues 6 and 7.

@sergeyf
Copy link
Collaborator Author

sergeyf commented Jan 30, 2026

Addressed issues!

@sergeyf sergeyf merged commit 7188ea0 into main Jan 30, 2026
17 checks passed
@sergeyf sergeyf deleted the rust branch January 31, 2026 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments