M3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Jie Huang1,* Ruixun Liu1,* Sirui Sun1 Xinyi Yang1 Yin Li2 Yixin Zhu1 Yiwu Zhong1,†
1Peking University 2University of Wisconsin-Madison
* Equal contribution. † Corresponding author.
M3Eval benchmark overview figure
TL;DR: M3Eval systematically evaluates multiple dimensions of multi-modal memory through video tasks grounded in cognitive psychology, offering new insights into the limitations of current multi-modal memory and informing the design of future systems.

Abstract

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial effort in developing video datasets and benchmarks, existing work primarily focuses on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference.

To address this gap, we introduce M3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks isolating key aspects of memory. Leveraging M3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors.

We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, whereas our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models.

Video Memory Tasks

For each task, the left shows how the task is constructed; the right is a short demo video for illustration only, not drawn from the dataset. Click "detailed examples" below for more real samples with full QA.

Task 1

Divided Attention encoding concurrent information

Divided Attention task figure

This video displays two separate videos side by side, with Video 1 on the left and Video 2 on the right throughout.

Click to expand to check detailed examples

Task 2

Memory Interference robustness to distraction

Memory Interference task figure

This video contains two separate videos played sequentially: Video 1 is played first, followed by Video 2.

Click to expand to check detailed examples

Task 3

Interleaved Events temporal organization

Interleaved Events task figure

This video interleaves two separate videos. Each is divided into N equal-duration segments and alternately concatenated (A1, B1, A2, B2, ..., An, Bn).

Click to expand to check detailed examples

Task 4

N-Back symbol grounding and memory capacity

N-Back task figure

This video presents a sequence of K clips. The task is to judge whether the last clip matches the clip N positions back in the same symbolic attribute (e.g., scene).

Click to expand to check detailed examples

Detailed Experimental Results

For each task, "Main Result" presents the primary results; "Further Experiment" provides deeper analysis; and the final "Finding" summarizes the key takeaway.

Task 1

Divided Attention

Main Result

Accuracy (%) on three divided attention metrics under the split-screen setting without swaps and with frequent swaps of the two videos' left/right positions.

Acc. (%) No swapping Swapping
Source
Identification
Order
Understanding
Content
Retention
Source
Identification
Order
Understanding
Content
Retention
Human89.5890.0092.1681.25 (-8.33)85.00 (-5.00)86.27 (-5.89)
Random25.0025.0025.0025.00 (0.00)25.00 (0.00)25.00 (0.00)
Closed-Source Models
Gemini-3.1-Pro-Preview62.5052.5049.0237.50 (-25.00)52.50 (0.00)56.86 (+7.84)
GPT-5.427.0835.0047.0635.42 (+8.34)30.00 (-5.00)49.02 (+1.96)
Open-Source Agents
VideoLucy16.6742.5037.2514.58 (-2.09)25.00 (-17.50)39.22 (+1.97)
M3-Agent27.0830.0023.5331.25 (+4.17)35.00 (+5.00)23.53 (0.00)
Open-Source Models
Qwen3.5-4B18.7525.0031.3714.58 (-4.17)22.50 (-2.50)33.33 (+1.96)
Qwen3-VL-8B-Instruct16.6725.0037.2512.50 (-4.17)30.00 (+5.00)35.29 (-1.96)
InternVL3.5-8B29.1737.5033.3325.00 (-4.17)40.00 (+2.50)27.45 (-5.88)
Qwen3.5-9B35.4225.0025.4918.75 (-16.67)30.00 (+5.00)13.73 (-11.76)
Qwen3.5-27B41.6725.0035.2927.08 (-14.59)32.50 (+7.50)35.29 (0.00)

Further Experiment

Divided Attention further experiment
Attention shifts induced by split-screen interference. For each case, the left panel shows the single-video condition, whereas the right panel shows the split-screen condition. Under the split-screen setting, the model's attention is disrupted, resulting in erroneous responses.
Finding 1: Existing multi-modal models lack robust memory for parallel tasks, probably due to attention confusion across concurrent visual streams.

Task 2

Memory Interference

Main Result

Proactive: the first video (V1) interferes with recall of the second video (V2); retroactive: the second video (V2) interferes with recall of the first video (V1). Delta denotes proactive minus retroactive.

Model Accuracy (%) Intrusion Rate (%)
Proactive Retroactive Delta Proactive Retroactive Delta
Human94.5574.5520.003.6420.00-16.36
Random25.0025.000.0050.0050.000.00
Closed-Source Models
Gemini-3.1-Pro-Preview63.6454.559.0923.6430.91-7.27
GPT-5.443.6440.003.6443.6434.559.09
Open-Source Agents
VideoLucy29.0943.64-14.5543.6434.559.09
M3-Agent43.6436.367.2840.0034.555.45
Open-Source Models
Qwen3.5-4B29.0938.18-9.0945.4538.187.27
Qwen3-VL-8B-Instruct25.4529.09-3.6454.5552.731.82
InternVL3.5-8B52.7349.093.6432.7341.82-9.09
Qwen3.5-9B29.0938.18-9.0950.9141.829.09
Qwen3.5-27B45.4540.005.4540.0043.64-3.64

Further Experiment

Memory Interference further experiment
Accuracy changes from video repetition. Repeating either the target or interfering video improves accuracy for most models, suggesting repetition as a promising way to enhance model memory.
Finding 2: Retroactive interference exceeds proactive interference in humans, whereas both occur comparably in multi-modal models. Further experiments surprisingly find that repeating target/interfering videos can both enhance performance during interference.

Task 3

Interleaved Events

Main Result

Accuracy (%) on four interleaved reconstruction metrics.

Model Source Identification Order Understanding Content Retention False Memory Discrimination
Human75.9580.0083.6482.11
Random25.0025.0025.0025.00
Closed-Source Models
Gemini-3.1-Pro-Preview43.0450.0049.0926.32
GPT-5.443.0440.0047.277.37
Open-Source Agents
VideoLucy30.3823.3343.6440.00
M3-Agent27.8540.0021.8215.79
Open-Source Models
Qwen3.5-4B30.3820.0041.8223.16
Qwen3-VL-8B-Instruct21.5223.3330.913.16
InternVL3.5-8B25.3226.6741.821.05
Qwen3.5-9B26.5840.0025.457.37
Qwen3.5-27B39.2433.3334.553.16

Further Experiment

Interleaved Events further experiment
Accuracy of source grounding. Spatial source uses the split-screen format; temporal source uses the interleaved format. Models perform notably better on spatial than temporal source grounding.
Finding 3: Multi-modal memory is less capable than human memory of structuring temporally interleaved information. Further experiments show that memory source grounding is consistently stronger along the spatial dimension than along the temporal dimension.

Task 4

N-Back

Main Result

N-Back main results
Average accuracy on the N-Back task. Overall performance of each model and human under two symbolic attributes: scene and action, averaged over all K and N configurations.

Further Experiment

N-Back further experiment
Effect of N and K on accuracy. Each video clip is abstracted as a symbol. Sample points from different models are shown with fitted curves and confidence intervals.
Finding 4: Multi-modal models lag far behind humans in symbolic memory tasks. Unlike humans, models show no decay over increasing temporal distance, yet degrade sharply with increasing number of symbols, revealing a fundamental inability to filter irrelevant memory.

BibTeX

@article{m3eval2026,
  title   = {M3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks},
  author  = {Huang, Jie and Liu, Ruixun and Sun, Sirui and Yang, Xinyi and Li, Yin and Zhu, Yixin and Zhong, Yiwu},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026},
}