MMA: Multimodal Memory Agent

This is the official repository for the paper:

MMA: Multimodal Memory Agent

Yihao Lu*, Wanru Cheng*, Zeyu Zhang*†, Hao Tang‡

*Equal contribution. †Project lead. ‡Corresponding author.

Paper | HF Paper

Note

⚠️ Repository Structure: This repo contains the MMA Agent framework (based on MIRIX) and the MMA-Bench evaluation toolkit.

📖 Introduction

Long-horizon multimodal agents often suffer from "Blind Faith"—relying on stale, low-credibility, or conflicting retrieved memories. This triggers overconfident errors in safety-critical scenarios.

We present MMA (Multimodal Memory Agent), a confidence-aware architecture that assigns dynamic reliability scores to memories based on:

Source Credibility ($S$): Prioritizing trusted users over gossip.
Temporal Decay ($T$): Discounting stale information.
Conflict-Aware Consensus ($C_{con}$): Reweighting evidence based on semantic support.

We also introduce MMA-Bench, a diagnostic benchmark designed to probe Belief Dynamics and Cognitive Robustness. Using this framework, we uncover the "Visual Placebo Effect", where agents become overconfident when presented with irrelevant visual noise.

⚙️ Installation

1. Environment Setup

We recommend creating a unified environment for both the agent and the benchmark.

# 1. Create environment
conda create -n mma python=3.10 -y
conda activate mma

# 2. Install dependencies (Split by module)
pip install -r MMA/requirements.txt        # For the Agent (RAG, VectorDB)
pip install -r MMA-Bench/requirements.txt  # For Benchmark Gen & Eval (OpenAI, DashScope)

2. User Configuration (🔑 Input Required)

You must configure API keys and data paths before running experiments.

Option A: Environment Variables (Recommended) Create a .env file in the root directory:

# --- For MMA Agent, Baselines & Judge ---
OPENAI_API_KEY="sk-..."           # For GPT-4.1-mini(Baseline) & GPT-4o-mini (Judge)

# --- For MMA-Bench Generation & Qwen Baselines ---
DASHSCOPE_API_KEY="sk-..."        # For Qwen3-Max (Logic) & Qwen-Image-Plus (Vision) & Qwen3-VL-Plus (Baseline)

Option B: Dataset Preparation

LOCOMO: Download locomo10.json and place it in MMA/public_evaluations/data/.
FEVER: Download paper_dev.jsonl and place it in MMA/public_evaluations/data/.

🛠️ MMA-Bench: The Cognitive Robustness Benchmark

MMA-Bench evaluates Epistemic Prudence (knowing when to abstain) and Conflict Resolution across 30 programmatically generated cases with 10 temporal sessions each.

1. Data Generation

Generate the "Trust-Trap" evolution graph (Text + Visual Evidence).

Requires Qwen3-Max (Logic) and Qwen-Image-Plus (Vision).

cd MMA-Bench
python -m src.generator --num_cases 30 --output_dir ./data/mma_bench_v1

2. Run Inference

Evaluate models under different modality settings to detecting Visual Anchoring.

# Text Mode (Oracle Captions): Tests pure reasoning logic
python -m src.inference --model gpt-4.1-mini --mode text

# Vision Mode (Raw Images): Tests multimodal conflict resolution
python -m src.inference --model qwen3-vl-plus --mode vision

3. Evaluation (CoRe Score)

Compute the Cognitive Reliability (CoRe) Score using the 3-step probe (Verdict, Wager, Reflection).

python -m src.grader --judge_model gpt-4o-mini

🧪 Experiments & Reproducibility

1. FEVER (Fact Verification)

Evaluate Stability. MMA matches baseline accuracy while reducing standard deviation by 35.2%.

cd MMA/public_evaluations

# Run Ablation Study (Baseline vs. Full MMA vs. w/o Consensus)
python run_fever_eval.py \
  --fever_data_path data/paper_dev.jsonl \
  --config_path ../configs/mirix.yaml \
  --limit 500 \
  --seeds "42,922,2025" \
  --formula_modes "st,tri"

2. LOCOMO (Long-Context QA)

Evaluate Safety in sparse retrieval. The st variant (Source + Time) achieves the highest Utility (609.0) by minimizing hallucinations.

cd MMA/public_evaluations

# Run LOCOMO Evaluation
python run_instance.py \
  --agent_name mma \
  --dataset LOCOMO \
  --global_idx 0 \
  --config_path ../configs/mma_gpt4.yaml

3. MMA-Bench Results (Type B & D)

Type B (Inversion): MMA achieves 41.18% Vision Dominance (solving the conflict), while Baseline collapses to 0%.
Type D (Unknowable): Visual inputs cause a drop in CoRe score (0.69 → -0.38), illustrating the Visual Placebo Effect.

📂 Directory Structure

.
├── assets/                 # Images for README (Logo)
├── MMA/                    # [Agent Framework] Core implementation
│   ├── MMA/                # Source code for the Agent (Memory, Confidence)
│   ├── configs/            # Configuration YAMLs for different backends
│   ├── public_evaluations/ # Evaluation scripts (FEVER, LOCOMO)
│   ├── frontend/           # Web UI for the agent
│   ├── scripts/            # Utility scripts (e.g., DB reset)
│   ├── tests/              # Unit tests
│   └── requirements.txt    # Shared dependencies
├── MMA-Bench/              # [Benchmark Toolkit]
│   ├── mma_bench/          # Source code (Generator, Inference, Grader)
│   ├── data/               # Generated Benchmark Data & Images
│   └── requirements.txt    # Shared dependencies
└── README.md               # Main Documentation

Acknowledgements

We acknowledge the use of the following resources:

MIRIX: Foundational memory architecture.
Base Models: Qwen & GPT (API).

Citation

If you find our work useful, please cite:

@article{,
  title={MMA: Multimodal Memory Agent},
  author={Yihao Lu, Wanru Cheng, Zeyu Zhang, Hao Tang},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2026}
}

License

This project is licensed under the Apache 2.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMA: Multimodal Memory Agent

Paper | HF Paper

📖 Introduction

⚙️ Installation

1. Environment Setup

2. User Configuration (🔑 Input Required)

🛠️ MMA-Bench: The Cognitive Robustness Benchmark

1. Data Generation

2. Run Inference

3. Evaluation (CoRe Score)

🧪 Experiments & Reproducibility

1. FEVER (Fact Verification)

2. LOCOMO (Long-Context QA)

3. MMA-Bench Results (Type B & D)

📂 Directory Structure

Acknowledgements

Citation

License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
MMA-Bench		MMA-Bench
MMA		MMA
assets		assets
.gitignore		.gitignore
README.md		README.md

AIGeeksGroup/MMA

Folders and files

Latest commit

History

Repository files navigation

MMA: Multimodal Memory Agent

Paper | HF Paper

📖 Introduction

⚙️ Installation

1. Environment Setup

2. User Configuration (🔑 Input Required)

🛠️ MMA-Bench: The Cognitive Robustness Benchmark

1. Data Generation

2. Run Inference

3. Evaluation (CoRe Score)

🧪 Experiments & Reproducibility

1. FEVER (Fact Verification)

2. LOCOMO (Long-Context QA)

3. MMA-Bench Results (Type B & D)

📂 Directory Structure

Acknowledgements

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages