Rust + CI etc by sergeyf · Pull Request #57 · allenai/S2AND

sergeyf · 2026-01-26T05:24:25Z

I ran a lot of parity tests so this should be exactly identical. You don't need to review every line but maybe the overall change structure?

sergeyf · 2026-01-26T21:07:33Z

@atalyaalon This is ready for review now. I have worked out the kinks with the updated CI.

atalyaalon · 2026-01-30T02:15:07Z

The overall architecture looks good: clean separation of concerns: Rust code is isolated - can be developed/tested independently. Also there's zero breakage if Rust unavailable, fallback to Python.
I suggest adding a documentation of Overall speedup to this PR, either in comments or README

Also used Claude to co-review:

PR #57 Review: Rust + CI

Summary

This PR adds a Rust-based featurizer implementation to accelerate pairwise feature generation, along with comprehensive CI/CD workflows for building and publishing both Python and Rust packages. The implementation maintains backward compatibility with automatic fallback to Python when the Rust extension is unavailable.

Architecture and Structure

This PR demonstrates production-grade architectural work:

Clean separation of concerns: s2and/ (pure Python) and s2and_rust/ (separate package) with clear boundaries
Progressive enhancement pattern: Try Rust → Fallback to Python → Always works (textbook good design)
Mature CI/CD: Smart version detection, multi-platform wheels, proper caching
Three-tier caching strategy: In-memory (WeakKeyDictionary) → Disk cache → Production mode (memory-only)
Comprehensive parity testing: 469 lines of tests ensuring correctness

The complexity is well-justified for a performance optimization layer. The author clearly thought through the design.

Issues to Address

Low Priority (Code Quality)

Issue 1: RAYON_NUM_THREADS side effects (featurizer.py:800)

Setting os.environ["RAYON_NUM_THREADS"] mutates global process state:

os.environ["RAYON_NUM_THREADS"] = str(max(1, n_jobs))
rust_featurizer.featurize_pairs(rust_pairs)

Problems:

This affects ALL subsequent Rayon uses in the process, not just this call
Rayon reads this at thread pool initialization, so changes may not take effect if the pool already exists
In concurrent scenarios, different calls with different n_jobs will race to set this

Best practice:
Set this once at process startup or not at all (let Rayon use its default). If per-call control is needed, the Rust code should accept a thread count parameter:

// In lib.rs
#[pymethod]
fn featurize_pairs(&self, pairs: Vec<(String, String)>, num_threads: Option<usize>) -> PyResult<Vec<Vec<f64>>> {
    if let Some(n) = num_threads {
        rayon::ThreadPoolBuilder::new().num_threads(n).build_global().ok();
    }
    // ... existing implementation
}

Then from Python:

rust_features = rust_featurizer.featurize_pairs(rust_pairs, num_threads=n_jobs)

Issue 2: Race condition in cache access (feature_port.py:61, 103-127)

_RUST_FEATURIZER_CACHE uses WeakKeyDictionary without locking:

_RUST_FEATURIZER_CACHE: "weakref.WeakKeyDictionary[ANDData, object]" = weakref.WeakKeyDictionary()

def _get_rust_featurizer(dataset: ANDData, write_cache: Optional[bool] = None) -> Any:
    featurizer = _RUST_FEATURIZER_CACHE.get(dataset)  # Thread 1 and 2 both get None
    if featurizer is None:
        # Both threads create new featurizer
        featurizer = s2and_rust.RustFeaturizer.from_dataset(...)
        _RUST_FEATURIZER_CACHE[dataset] = featurizer  # Race condition

Impact: Multiple threads could create duplicate featurizers (wastes CPU but doesn't corrupt data). This is harmless for most use cases since:

Happens only once per dataset
Most uses are single-threaded or use multiprocessing (separate processes = no shared state)
Creating duplicate featurizers wastes CPU but doesn't cause data corruption

Best practice (following pattern in featurizer.py:50-99 BackgroundCacheWriter):

import threading

_RUST_FEATURIZER_CACHE: "weakref.WeakKeyDictionary[ANDData, object]" = weakref.WeakKeyDictionary()
_RUST_FEATURIZER_CACHE_LOCK = threading.Lock()

def _get_rust_featurizer(dataset: ANDData, write_cache: Optional[bool] = None) -> Any:
    if s2and_rust is None:
        raise RuntimeError(_RUST_NOT_AVAILABLE_MSG)

    # Fast path without lock
    featurizer = _RUST_FEATURIZER_CACHE.get(dataset)
    if featurizer is not None:
        return featurizer

    # Slow path with lock (double-checked locking)
    with _RUST_FEATURIZER_CACHE_LOCK:
        # Double-check after acquiring lock
        featurizer = _RUST_FEATURIZER_CACHE.get(dataset)
        if featurizer is not None:
            return featurizer

        # Create featurizer (only one thread does this)
        use_disk_cache = _env_flag("S2AND_RUST_FEATURIZER_DISK_CACHE", "1") and not _rust_prod_mode(dataset)
        # ... rest of implementation
        _RUST_FEATURIZER_CACHE[dataset] = featurizer

    return featurizer

Recommendation: Only address this if you're targeting multi-threaded usage patterns. For typical single-threaded or multiprocess usage, this is not a concern.

Issue 3: Module loading complexity (feature_port.py:19-57)

The module loading uses 50+ lines of sys.modules manipulation to handle a specific edge case: the s2and_rust/s2and_rust/ directory is a Python package stub that re-exports from the compiled .so. When the repo root is on sys.path but the extension isn't built, Python imports the stub instead of the compiled extension from site-packages.

# Current: complex sys.modules manipulation
def _load_s2and_rust(force_reload: bool = False) -> Optional[ModuleType]:
    if force_reload:
        sys.modules.pop("s2and_rust", None)
        sys.modules.pop("s2and_rust.s2and_rust", None)
    try:
        module = importlib.import_module("s2and_rust")
    except Exception:
        module = _load_s2and_rust_from_site_packages()
    # ... more fallback logic

Context: This complexity exists to work around a known maturin/PyO3 pitfall. The maturin project layout guide explicitly warns about this and recommends the "src layout" to avoid it. Major projects like pydantic-core, tokenizers, and tiktoken avoid this by requiring the extension (fail-fast on import) rather than making it optional.

Recommendation: Follow pydantic-core's convention - prefix the compiled Rust module with underscore (_s2and_rust):

In Cargo.toml, rename the lib to _s2and_rust
Have the Python package import from ._s2and_rust

This is the standard pattern for PyO3/maturin projects (pydantic-core uses _pydantic_core, etc.). It eliminates shadowing and allows the simple import pattern:

try:
    import s2and_rust
except ImportError:
    s2and_rust = None

Note: The current implementation works correctly - it handles the edge case. The suggestion is to simplify by restructuring, not because the code is broken.

Issue 4: Repeated environment variable parsing (featurizer.py:426, 781-783)

Environment variables are parsed on every function call in featurizer.py, while model.py already implements the correct caching pattern via _use_rust_constraints().

Locations with repeated parsing:

featurizer.py:426 (S2AND_USE_RUST_FEATURIZER) - parsed for every signature pair in _single_pair_featurize
featurizer.py:781-783 (S2AND_RUST_BATCH, S2AND_RUST_BATCH_THRESHOLD) - parsed on every batch

Already correct in model.py (good pattern to follow):

_USE_RUST_CONSTRAINTS_CACHE: Optional[bool] = None

def _use_rust_constraints() -> bool:
    global _USE_RUST_CONSTRAINTS_CACHE
    if _USE_RUST_CONSTRAINTS_CACHE is None:
        use_rust_feat = os.environ.get("S2AND_USE_RUST_FEATURIZER", "1").lower() in {"1", "true", "yes"}
        use_rust_constraints = os.environ.get("S2AND_USE_RUST_CONSTRAINT", "1").lower() in {"1", "true", "yes"}
        _USE_RUST_CONSTRAINTS_CACHE = use_rust_feat and use_rust_constraints
    return _USE_RUST_CONSTRAINTS_CACHE

Inconsistent in featurizer.py:

def _single_pair_featurize(work_input, index=-1):
    # Called for EVERY signature pair - parses env var each time
    use_rust = os.environ.get("S2AND_USE_RUST_FEATURIZER", "1").lower() in {"1", "true", "yes"}

Impact: The performance impact is negligible (~200ns per call), but this creates inconsistency within the codebase - model.py caches correctly while featurizer.py does not.

Recommendation: For consistency, apply the same caching pattern from model.py to featurizer.py. This is a code quality/consistency issue rather than a performance concern.

Issue 5: Disk Cache Invalidation (feature_port.py:62, 91-97)

The cache key uses a hardcoded RUST_FEATURIZER_CACHE_VERSION = 2. If extraction logic in the Rust code changes without manually bumping this constant, stale caches produce silent wrong results.

key = (
    f"{dataset.name}_v{FEATURIZER_VERSION}_rv{RUST_FEATURIZER_CACHE_VERSION}"
    ...
)

Recommendation: Use the s2and_rust package version instead of a manual constant. This way the cache auto-invalidates on every release without requiring manual bookkeeping:

rust_version = getattr(s2and_rust, "__version__", "unknown")

Test Coverage

Excellent:

Comprehensive parity tests covering various signature/paper configurations
Good use of pytest fixtures and parametrization
Tests verify edge cases (empty counters, missing specter embeddings, etc.)

Could add:

Cache serialization round-trip test
Tests for concurrent access patterns (if targeting multi-threaded usage)

CI/CD Workflow

The release-rust.yml workflow is well-structured with good separation of concerns:

Positives:

Version detection logic prevents unnecessary builds
Proper matrix builds for multiple Python versions and platforms
Good use of caching for Cargo builds
Inline Python script works fine (common pattern in GitHub Actions)

Documentation

The README updates are clear and comprehensive. Consider adding:

Performance benchmarks showing speedup vs. Python (would justify the complexity)

Recommendations Summary

Low Priority (Code Quality)

Issue 1: RAYON_NUM_THREADS side effects - Consider passing thread count as parameter instead of mutating env
Issue 2: Race condition in cache - Add locking only if targeting multi-threaded patterns
Issue 3: Module loading complexity - Prefix compiled module with underscore (_s2and_rust) per pydantic-core convention
Issue 4: Repeated environment variable parsing - Apply model.py's caching pattern to featurizer.py for consistency

Medium Priority

Issue 5: Disk Cache Invalidation - Use s2and_rust.__version__ in cache key instead of manual constant

Verdict

RECOMMEND MERGE - No blocking issues, however Issue 5 (Disk Cache Invalidation) should be addressed to prevent silent correctness issues when Rust extraction logic changes. Consider addressing the low priority issues to improve code quality.

The parity testing gives high confidence in correctness. The overall architecture is excellent - this is production-grade work with clean separation of concerns, robust fallback mechanisms, and comprehensive CI/CD.

Key strengths:

Clean two-package structure (s2and + s2and_rust)
Progressive enhancement with graceful degradation
Comprehensive parity testing
Well-designed caching strategy
Mature CI/CD with smart build triggers

atalyaalon

Overall change structure looks good. See a few comments above.

atalyaalon · 2026-01-30T19:11:05Z

@sergeyf additional medium priority issues found:

Medium Priority

Issue 6: Missing `incremental_dont_use_cluster_seeds` Parameter

File: s2and/model.py:1119-1126

Code:

value = _get_constraint_value(
    dataset,
    unassigned_signature,
    signature,
    dont_merge_cluster_seeds=self.dont_merge_cluster_seeds,
    rust_featurizer=rust_featurizer,
    use_rust_constraints=use_rust_constraints,
)

Problem: In predict_incremental_helper, the call to _get_constraint_value is missing the incremental_dont_use_cluster_seeds parameter. The Python get_constraint is called with incremental_dont_use_cluster_seeds defaulting to False. Is that the desired behavior? Or is it a potential bug?

Issue 7: Publish Job Dependency Failure

File: .github/workflows/release-rust.yml:273-305

The Problem

The publish job on line 276 has this dependency chain:

needs: [detect-versions, s2and-dist, wheels-windows, wheels-macos, wheels-linux, sdist]

But the conditions for each job are different:

Job	Condition
`s2and-dist` (line 128)	`s2and_changed == 'true' \|\| force_build == 'true'`
`wheels-windows` (line 151)	`rust_changed == 'true' \|\| force_build == 'true'`
`wheels-macos` (line 181)	`rust_changed == 'true' \|\| force_build == 'true'`
`wheels-linux` (line 216)	`rust_changed == 'true' \|\| force_build == 'true'`
`sdist` (line 256)	`rust_changed == 'true' \|\| force_build == 'true'`

Scenario: Only `s2and` version changes (no Rust changes)

If someone bumps only the pyproject.toml version (e.g., 0.2.0 → 0.2.1) without changing the Rust version:

s2and_changed = true
rust_changed = false
publish_any = true (because s2and_changed is true)

What happens:

s2and-dist runs ✓ (because s2and_changed == true)
wheels-windows skipped (because rust_changed == false)
wheels-macos skipped
wheels-linux skipped
sdist skipped
publish job condition is met (publish_any == true)...

But the publish job has needs: [..., wheels-windows, wheels-macos, wheels-linux, sdist].

In GitHub Actions, when a job in needs is skipped, the dependent job will also be skipped by default. So the publish job will be skipped even though:

The intent was to publish s2and
The s2and-dist artifact was successfully built
The publish job's own if condition evaluates to true

Scenario: Only Rust version changes (no s2and changes)

Conversely, if only Rust version changes:

s2and_changed = false
rust_changed = true

s2and-dist skipped
All wheel jobs run ✓
publish job skipped because s2and-dist was skipped

The download step on line 284-288 will also fail if dist-s2and artifact doesn't exist.

Suggested Fixes

Split into two independent publish jobs:

publish-s2and:
  if: ... && needs.detect-versions.outputs.s2and_changed == 'true'
  needs: [detect-versions, s2and-dist]

publish-rust:
  if: ... && needs.detect-versions.outputs.rust_changed == 'true'
  needs: [detect-versions, wheels-windows, wheels-macos, wheels-linux, sdist]

Please consider addressing issues 6 and 7.

sergeyf · 2026-01-30T23:19:22Z

Addressed issues!

sergeyf added 4 commits January 25, 2026 16:21

rustified

9eb8e14

review

892b59e

0.2.0 release

3bf469c

workflow

970afe6

sergeyf requested a review from atalyaalon January 26, 2026 05:24

sergeyf added 6 commits January 25, 2026 22:16

CI fix and lightgbm update

89adf71

more fixes

6ba98af

still wrestling with stuff

529741a

mypy

3a83f00

build tests in workflow

c64be55

fix warnings from rust

fbbdff0

atalyaalon approved these changes Jan 30, 2026

View reviewed changes

addressing review

d77dcc9

sergeyf merged commit 7188ea0 into main Jan 30, 2026
17 checks passed

sergeyf deleted the rust branch January 31, 2026 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust + CI etc #57

Rust + CI etc #57
sergeyf merged 11 commits intomainfrom
rust

sergeyf commented Jan 26, 2026

Uh oh!

sergeyf commented Jan 26, 2026

Uh oh!

atalyaalon commented Jan 30, 2026 •

edited

Loading

Uh oh!

atalyaalon left a comment

Uh oh!

atalyaalon commented Jan 30, 2026 •

edited

Loading

Uh oh!

sergeyf commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

sergeyf commented Jan 26, 2026

Uh oh!

sergeyf commented Jan 26, 2026

Uh oh!

atalyaalon commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #57 Review: Rust + CI

Summary

Architecture and Structure

Issues to Address

Low Priority (Code Quality)

Issue 1: RAYON_NUM_THREADS side effects (featurizer.py:800)

Issue 2: Race condition in cache access (feature_port.py:61, 103-127)

Issue 3: Module loading complexity (feature_port.py:19-57)

Issue 4: Repeated environment variable parsing (featurizer.py:426, 781-783)

Issue 5: Disk Cache Invalidation (feature_port.py:62, 91-97)

Test Coverage

CI/CD Workflow

Documentation

Recommendations Summary

Low Priority (Code Quality)

Medium Priority

Verdict

Uh oh!

atalyaalon left a comment

Choose a reason for hiding this comment

Uh oh!

atalyaalon commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Medium Priority

Issue 6: Missing incremental_dont_use_cluster_seeds Parameter

Issue 7: Publish Job Dependency Failure

The Problem

Scenario: Only s2and version changes (no Rust changes)

Scenario: Only Rust version changes (no s2and changes)

Suggested Fixes

Uh oh!

sergeyf commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

atalyaalon commented Jan 30, 2026 •

edited

Loading

atalyaalon commented Jan 30, 2026 •

edited

Loading

Issue 6: Missing `incremental_dont_use_cluster_seeds` Parameter

Scenario: Only `s2and` version changes (no Rust changes)