Terminal bench artifacts + adding memory layer #1

edgarpavlovsky · 2025-11-06T20:49:32Z

This pull request introduces new documentation and configuration files, restructures the project layout for clarity, and adds support for terminal-bench integration. The main themes are enforcing consistent development rules (especially around Python version and dependency management), improving project organization, and providing detailed instructions for benchmarking and integration.

Development rules and documentation:

Added .ai-rules.md, .claude.md, .cursorrules, and WARP.md to standardize AI assistant usage, enforce Python 3.12+ requirement, and mandate the use of uv for dependency management across different environments and tools. [1] [2] [3]
Updated README.md to clarify runtime state directory usage, configuration file location, and reorganized the project structure to use a src/ directory for source code and a separate state/ directory for runtime data. [1] [2] [3]

Terminal-bench benchmarking integration:

Added benchmark/README.md and benchmark/USAGE.md with comprehensive instructions for installing, running, and troubleshooting Fireteam as a terminal-bench agent, including real-time logging, state isolation, and advanced usage tips. [1] [2]
Added new adapter package files: benchmark/__init__.py and benchmark/adapters/__init__.py, establishing a clear entry point and import structure for terminal-bench integration. [1] [2]

Configuration and environment updates:

Added ANTHROPIC_API_KEY to .env.example to support Anthropic API integration for Claude agents.

Codebase cleanup:

Removed the old agents/base.py file, likely in favor of the new structure under src/agents/.

These changes collectively improve developer onboarding, enforce best practices, and enable robust benchmarking and integration workflows for Fireteam.

- Refactor StateManager to use centralized config for state directory - Add state module initialization - Add runs/ directory to .gitignore to exclude benchmark artifacts

- Created 165 total tests (161 unit + 4 new e2e/integration) - Added test infrastructure (conftest.py, helpers.py) - Enhanced MemoryManager with embedding_model parameter - Added lightweight embedding tests for fast CI - Added E2E hello world subprocess test - Added terminal-bench integration test - Created GitHub Actions workflow with 3 jobs - Updated documentation and added TODO for improvements - Fixed config.py .env loading from repo root - All fast tests passing (163/163)

Lightweight tests are already included in fast tests since they're not marked as slow/e2e/integration. Running them separately caused duplication.

- Marked all tests that load Qwen3 model (~1.2GB) as @pytest.mark.slow - This excludes them from CI fast-tests job - Fast tests now: 127 tests in ~25s (was 163 tests in ~60s) - Slow memory tests: 36 tests (use heavy model) - Lightweight tests: 2 tests (use 80MB model, run in fast job) This prevents CI timeouts and keeps fast tests truly fast.

Temporarily running all tests on feature branches to validate before merging to main.

Use github.head_ref for pull request events to properly detect e/* branches.

- Use subprocess.call() to stream output directly to console - Add --livestream flag for real-time terminal-bench output - Simplify assertions to just check return code - Remove output parsing (terminal-bench handles success/failure) - This provides much better observability during long test runs

Terminal-bench tests need local debugging. E2E tests remain enabled. Fast tests (127 tests): ✅ Running E2E tests (1 test): ✅ Running Integration tests: ⏸️ Disabled for now

Only run on: - Pull requests to main - Direct pushes to main This prevents duplicate runs (once on push, once on PR).

Co-authored-by: edgarpavlovsky <edgarpavlovsky@gmail.com>

…ests-in-ci-58aa

Improve state manager configuration and ignore benchmark runs

dffdf8c

- Refactor StateManager to use centralized config for state directory - Add state module initialization - Add runs/ directory to .gitignore to exclude benchmark artifacts

edgarpavlovsky force-pushed the e/clean-benchmark branch from 20d9e29 to dffdf8c Compare November 6, 2025 20:52

cleaning

6feb21b

edgarpavlovsky changed the title ~~Add terminal-bench run artifacts and improve state manager configuration~~ Terminal bench artifacts, evolving state management into memeory management Nov 6, 2025

edgarpavlovsky added 2 commits November 6, 2025 15:51

refactor

e87140c

adding memory

3581f61

edgarpavlovsky changed the title ~~Terminal bench artifacts, evolving state management into memeory management~~ Terminal bench artifacts + adding memory layer Nov 6, 2025

edgarpavlovsky and others added 12 commits November 6, 2025 18:17

Update CI badge URL with correct org name

bade035

Fix CI: remove duplicate lightweight test run

7b3c9ce

Lightweight tests are already included in fast tests since they're not marked as slow/e2e/integration. Running them separately caused duplication.

Enable e2e and integration tests on e/* branches for validation

29eb35f

Temporarily running all tests on feature branches to validate before merging to main.

Fix CI conditional logic for PR triggers

1cb7a2a

Use github.head_ref for pull request events to properly detect e/* branches.

Temporarily disable terminal-bench integration tests in CI

9be85bc

Terminal-bench tests need local debugging. E2E tests remain enabled. Fast tests (127 tests): ✅ Running E2E tests (1 test): ✅ Running Integration tests: ⏸️ Disabled for now

Remove duplicate CI runs on push to e/* branches

aa1d465

Only run on: - Pull requests to main - Direct pushes to main This prevents duplicate runs (once on push, once on PR).

Fix: Add timeouts and logging to E2E tests

676b250

Co-authored-by: edgarpavlovsky <edgarpavlovsky@gmail.com>

feat: Install Claude CLI in CI and add timeouts

016f886

Co-authored-by: edgarpavlovsky <edgarpavlovsky@gmail.com>

Merge pull request #2 from darkresearch/cursor/investigate-slow-e2e-t…

386e98c

…ests-in-ci-58aa

edgarpavlovsky merged commit a9f057e into main Nov 7, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Terminal bench artifacts + adding memory layer #1

Terminal bench artifacts + adding memory layer #1

Uh oh!

edgarpavlovsky commented Nov 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Terminal bench artifacts + adding memory layer #1

Terminal bench artifacts + adding memory layer #1

Uh oh!

Conversation

edgarpavlovsky commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edgarpavlovsky commented Nov 6, 2025 •

edited

Loading