Conversation
|
@atalyaalon This is ready for review now. I have worked out the kinks with the updated CI. |
|
The overall architecture looks good: clean separation of concerns: Rust code is isolated - can be developed/tested independently. Also there's zero breakage if Rust unavailable, fallback to Python. Also used Claude to co-review: PR #57 Review: Rust + CISummaryThis PR adds a Rust-based featurizer implementation to accelerate pairwise feature generation, along with comprehensive CI/CD workflows for building and publishing both Python and Rust packages. The implementation maintains backward compatibility with automatic fallback to Python when the Rust extension is unavailable. Architecture and StructureThis PR demonstrates production-grade architectural work:
The complexity is well-justified for a performance optimization layer. The author clearly thought through the design. Issues to AddressLow Priority (Code Quality)Issue 1: RAYON_NUM_THREADS side effects (featurizer.py:800)Setting os.environ["RAYON_NUM_THREADS"] = str(max(1, n_jobs))
rust_featurizer.featurize_pairs(rust_pairs)Problems:
Best practice: // In lib.rs
#[pymethod]
fn featurize_pairs(&self, pairs: Vec<(String, String)>, num_threads: Option<usize>) -> PyResult<Vec<Vec<f64>>> {
if let Some(n) = num_threads {
rayon::ThreadPoolBuilder::new().num_threads(n).build_global().ok();
}
// ... existing implementation
}Then from Python: rust_features = rust_featurizer.featurize_pairs(rust_pairs, num_threads=n_jobs)Issue 2: Race condition in cache access (feature_port.py:61, 103-127)
_RUST_FEATURIZER_CACHE: "weakref.WeakKeyDictionary[ANDData, object]" = weakref.WeakKeyDictionary()
def _get_rust_featurizer(dataset: ANDData, write_cache: Optional[bool] = None) -> Any:
featurizer = _RUST_FEATURIZER_CACHE.get(dataset) # Thread 1 and 2 both get None
if featurizer is None:
# Both threads create new featurizer
featurizer = s2and_rust.RustFeaturizer.from_dataset(...)
_RUST_FEATURIZER_CACHE[dataset] = featurizer # Race conditionImpact: Multiple threads could create duplicate featurizers (wastes CPU but doesn't corrupt data). This is harmless for most use cases since:
Best practice (following pattern in featurizer.py:50-99 BackgroundCacheWriter): import threading
_RUST_FEATURIZER_CACHE: "weakref.WeakKeyDictionary[ANDData, object]" = weakref.WeakKeyDictionary()
_RUST_FEATURIZER_CACHE_LOCK = threading.Lock()
def _get_rust_featurizer(dataset: ANDData, write_cache: Optional[bool] = None) -> Any:
if s2and_rust is None:
raise RuntimeError(_RUST_NOT_AVAILABLE_MSG)
# Fast path without lock
featurizer = _RUST_FEATURIZER_CACHE.get(dataset)
if featurizer is not None:
return featurizer
# Slow path with lock (double-checked locking)
with _RUST_FEATURIZER_CACHE_LOCK:
# Double-check after acquiring lock
featurizer = _RUST_FEATURIZER_CACHE.get(dataset)
if featurizer is not None:
return featurizer
# Create featurizer (only one thread does this)
use_disk_cache = _env_flag("S2AND_RUST_FEATURIZER_DISK_CACHE", "1") and not _rust_prod_mode(dataset)
# ... rest of implementation
_RUST_FEATURIZER_CACHE[dataset] = featurizer
return featurizerRecommendation: Only address this if you're targeting multi-threaded usage patterns. For typical single-threaded or multiprocess usage, this is not a concern. Issue 3: Module loading complexity (feature_port.py:19-57)The module loading uses 50+ lines of # Current: complex sys.modules manipulation
def _load_s2and_rust(force_reload: bool = False) -> Optional[ModuleType]:
if force_reload:
sys.modules.pop("s2and_rust", None)
sys.modules.pop("s2and_rust.s2and_rust", None)
try:
module = importlib.import_module("s2and_rust")
except Exception:
module = _load_s2and_rust_from_site_packages()
# ... more fallback logicContext: This complexity exists to work around a known maturin/PyO3 pitfall. The maturin project layout guide explicitly warns about this and recommends the "src layout" to avoid it. Major projects like pydantic-core, tokenizers, and tiktoken avoid this by requiring the extension (fail-fast on import) rather than making it optional. Recommendation: Follow pydantic-core's convention - prefix the compiled Rust module with underscore (
This is the standard pattern for PyO3/maturin projects (pydantic-core uses try:
import s2and_rust
except ImportError:
s2and_rust = NoneNote: The current implementation works correctly - it handles the edge case. The suggestion is to simplify by restructuring, not because the code is broken. Issue 4: Repeated environment variable parsing (featurizer.py:426, 781-783)Environment variables are parsed on every function call in Locations with repeated parsing:
Already correct in model.py (good pattern to follow): _USE_RUST_CONSTRAINTS_CACHE: Optional[bool] = None
def _use_rust_constraints() -> bool:
global _USE_RUST_CONSTRAINTS_CACHE
if _USE_RUST_CONSTRAINTS_CACHE is None:
use_rust_feat = os.environ.get("S2AND_USE_RUST_FEATURIZER", "1").lower() in {"1", "true", "yes"}
use_rust_constraints = os.environ.get("S2AND_USE_RUST_CONSTRAINT", "1").lower() in {"1", "true", "yes"}
_USE_RUST_CONSTRAINTS_CACHE = use_rust_feat and use_rust_constraints
return _USE_RUST_CONSTRAINTS_CACHEInconsistent in featurizer.py: def _single_pair_featurize(work_input, index=-1):
# Called for EVERY signature pair - parses env var each time
use_rust = os.environ.get("S2AND_USE_RUST_FEATURIZER", "1").lower() in {"1", "true", "yes"}Impact: The performance impact is negligible (~200ns per call), but this creates inconsistency within the codebase - Recommendation: For consistency, apply the same caching pattern from Issue 5: Disk Cache Invalidation (feature_port.py:62, 91-97)The cache key uses a hardcoded key = (
f"{dataset.name}_v{FEATURIZER_VERSION}_rv{RUST_FEATURIZER_CACHE_VERSION}"
...
)Recommendation: Use the rust_version = getattr(s2and_rust, "__version__", "unknown")Test CoverageExcellent:
Could add:
CI/CD WorkflowThe Positives:
DocumentationThe README updates are clear and comprehensive. Consider adding:
Recommendations SummaryLow Priority (Code Quality)
Medium Priority
VerdictRECOMMEND MERGE - No blocking issues, however Issue 5 (Disk Cache Invalidation) should be addressed to prevent silent correctness issues when Rust extraction logic changes. Consider addressing the low priority issues to improve code quality. The parity testing gives high confidence in correctness. The overall architecture is excellent - this is production-grade work with clean separation of concerns, robust fallback mechanisms, and comprehensive CI/CD. Key strengths:
|
atalyaalon
left a comment
There was a problem hiding this comment.
Overall change structure looks good. See a few comments above.
|
@sergeyf additional medium priority issues found: Medium PriorityIssue 6: Missing
|
| Job | Condition |
|---|---|
s2and-dist (line 128) |
s2and_changed == 'true' || force_build == 'true' |
wheels-windows (line 151) |
rust_changed == 'true' || force_build == 'true' |
wheels-macos (line 181) |
rust_changed == 'true' || force_build == 'true' |
wheels-linux (line 216) |
rust_changed == 'true' || force_build == 'true' |
sdist (line 256) |
rust_changed == 'true' || force_build == 'true' |
Scenario: Only s2and version changes (no Rust changes)
If someone bumps only the pyproject.toml version (e.g., 0.2.0 → 0.2.1) without changing the Rust version:
s2and_changed = truerust_changed = falsepublish_any = true(becauses2and_changedis true)
What happens:
s2and-distruns ✓ (becauses2and_changed == true)wheels-windowsskipped (becauserust_changed == false)wheels-macosskippedwheels-linuxskippedsdistskippedpublishjob condition is met (publish_any == true)...
But the publish job has needs: [..., wheels-windows, wheels-macos, wheels-linux, sdist].
In GitHub Actions, when a job in needs is skipped, the dependent job will also be skipped by default. So the publish job will be skipped even though:
- The intent was to publish
s2and - The
s2and-distartifact was successfully built - The
publishjob's ownifcondition evaluates totrue
Scenario: Only Rust version changes (no s2and changes)
Conversely, if only Rust version changes:
s2and_changed = falserust_changed = true
s2and-distskipped- All wheel jobs run ✓
publishjob skipped becauses2and-distwas skipped
The download step on line 284-288 will also fail if dist-s2and artifact doesn't exist.
Suggested Fixes
Split into two independent publish jobs:
publish-s2and:
if: ... && needs.detect-versions.outputs.s2and_changed == 'true'
needs: [detect-versions, s2and-dist]
publish-rust:
if: ... && needs.detect-versions.outputs.rust_changed == 'true'
needs: [detect-versions, wheels-windows, wheels-macos, wheels-linux, sdist]Please consider addressing issues 6 and 7.
|
Addressed issues! |
I ran a lot of parity tests so this should be exactly identical. You don't need to review every line but maybe the overall change structure?