Skip to content

msv-lab/mnimi

Repository files navigation

Mnimi - LLM Cache for Agentic Workflows

Mnimi implements a cache design pattern that modularizes LLM workflows and ensures statistical integrity by enforcing per-component sample independence constraints. Mnimi helps to avoid redundant queries, making runs reproducible, faster, and cheaper, while also enabling deterministic debugging.

This implementation

  • Is single-file, dependency-free, ~300 LOC; to use it, just copy cached_llm.py to your project.
  • Provides a single API function sample(prompt: str, batch: int = 1) -> Iterator[str].
  • Supports agentic workflows like retries that conflict with naive caching.
  • Supports batch sampling (getting multiple responses for a single HTTP request).
  • Supports cache slicing (extracting relevant parts of cache).
  • Supports multiple providers (302.ai, Fireworks, CloseAI, XMCP, etc).

Details of the approach are discussed here:

Statistical Independence Aware Caching for LLM Workflows
Yihan Dai, Dimitrios Stamatios Bouras, Haoxiang Jia, Sergey Mechtaev
LLM4Code@ICSE 2026
https://arxiv.org/abs/2511.22118

Independent, Repeatable and Persistent Sampling

Without caching, each call of sample returns an independent value:

model = AI302("gpt-4o", 1.0)

prompt = "Choose a number between 1 and 1000."

n1 = next(model.sample(prompt))  # "92"
n2 = next(model.sample(prompt))  # "747"

Using Repeatable (in-memory cache), you can change the semantics of sample so that for each call it returns exactly the same sequence:

model = Repeatable(model)

n3 = next(model.sample(prompt))  # "131" 
n4 = next(model.sample(prompt))  # "131" - same result

for s in islice(model.sample(prompt), 2):
    print(s)  # "131" "561" - independent within sequence

After that, the model can again be turned into independent sampling using Independent:

model = Independent(model)

n5 = next(model.sample(prompt))  # "131" 
n6 = next(model.sample(prompt))  # "561"

Note that it will still take values from the underlying in-memory cache.

Persistent is like Repeatable, but it persists across runs of your program:

model = AI302("gpt-4o", 1.0)
model = Persistent(model, "~/.llm_cache")

prompt = "Choose a number between 1 and 1000."

n1 = next(model.sample(prompt))  # "92"
n2 = next(model.sample(prompt))  # "92" - same result across runs

Agentic Workflow Patterns

The recommended way to use this cache is as follows:

  1. Use a repeatable/persistent model by default for maximum efficiency and determinism.
  2. Most of your code should not rely on whether model is independent or repeatable.
  3. If your logic relies on independence, start it with model = Independent(model).
  4. If your logic relies on repeatability, start it with model = Repeatable(model).

This example combines persistent caching with retries, which require independent sampling:

model = AI302("gpt-4o", 1.0)
model = Persistent(model, "~/.llm_cache")

ind_model = Independent(model)

for attempt in range(NUM_RETRIES):
    try:
        rep_model = Repeatable(ind_model)
        x = step_1(rep_model)
        y = step_2(rep_model, x)
        z = step_3(rep_model, y)
        break
    except Exception as e:
        if attempt == NUM_RETRIES - 1:
            raise Exception("did not get good response")
    pass

Here, the calls of sample are independent across attempts, but repeat across the calls of step_1, step_2 and step_3 within each individual attempt.

Cache Slicing

Cache slicing lets you extract a minimal subset of a larger cache for sharing just what's needed for replication. To achieve that, use nested caches:

model = Persistent(model, "/path/to/original_cache/")
model = Persistent(model, "/path/to/sliced_cache")
run_experiment(model)

After execution, sliced_cache contains exactly what was used during this run.

Batch Sampling

Providers differ in max batch size. This API decouples provider limits from your experiment setup:

model = CloseAI("gpt-4o", 1.0, max_batch=10),
model = Persistent(model, "~/.llm_cache")

for r in islice(model.sample(prompt, batch=20), 40):
    process(r)

In this example, because the provider allows only 10 samples per request while the algorithm needs 40 in total (ideally in batches of 20), the code will automatically split the work into four requests of 10 samples each.

Replication Mode

To ensure that your run relies only on cache (without quering the underlying model), use Persistent(model, cache_dir, replication=True). In case of a cache miss, sample will raise ReplicationCacheMiss.

Supported Providers

This code supports OpenAI-compatile HTTP API, with convenience wrappers for the following providers:

Provider API key environment variable
302.ai AI302_API_KEY
CloseAI CLOSEAI_API_KEY
FireworksAI FIREWORKS_API_KEY
XMCP XMCP_API_KEY
OpenAI Compatible set in code

Since different providers have different names for the same model, the model's cache identifier can be changed by setting alias:

models = [
    CloseAI("gpt-4o", 1.0),
    AI302("deepseek-v3-huoshan", 0.5, alias="deepseek-v3"),
    FireworksAI("accounts/fireworks/models/llama-v3p1-8b-instruct", 1.0, alias="llama-3.1-8b"),
    XMCP("ali/qwen2.5-7b-instruct", 1.0, alias="qwen2.5-7b")
]

Measuring query time and token use

To get total time taken by API queries, and the total number of input and output tokens:

m = AI302("gpt-4o", 1.0, max_batch=2)

# ... use model m ...

query_time = m.total_query_time()
in_tokens, out_tokens = m.total_token_count()

Structured Output and Auto-Retries

The optional module structured_output provides convenience functions to robustly parse LLM output. It allows declaratively defining expected output specification:

prompt = """
Choose one programming problem and present it inside <problem>...</problem> tags.

Then provide at least two distinct solution approaches. For each approach, use this exact structure:

- Start with <algorithm>ALGORITHM_NAME</algorithm> on its own line.
- Immediately follow with a fenced Markdown code block containing a complete, runnable implementation.

Ensure each algorithm is meaningfully different (e.g., brute force vs dynamic programming
vs greedy vs divide-and-conquer), not just small variations.
"""

spec = Sequence([
           Tag("problem"),
           Repeat(
               Sequence([
                   Tag("algorithm"),
                   Code()
               ]))])

response = next(model.sample(prompt))

result = parse(spec, response)

print(result)  # ["problem description...", [["algorithm name", "algorithm code"], ...]]

It also enables automatic retries until LLM satisfies output constraints:

result = query_retry(model, prompt, spec, retries=5)

Futhermore, retries can depend on a data validator:

result = query_retry(model, prompt, spec, retries=5,
                     validator=lambda x: len(x[1]) >= 2)

Development

To run tests, execute

uvx pytest

About

Statistical Independence Aware Caching for LLM Workflows

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages