Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically engineered for continuous audiovisual streams, (ii) explicit short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) novel cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with rigid indexes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating hippocampal principles offers a promising pathway for robust and efficient long-form audiovisual AI.
HippoMM introduces a novel architecture inspired by the human hippocampus to handle long audiovisual events. The system consists of three main components:
The architecture dynamically segments input streams using adaptive temporal windows and implements a dual-process encoding strategy that balances detailed recall with efficient retrieval.
The HippoVlog benchmark contains diverse question types across multiple modalities to evaluate comprehensive understanding of long-form audiovisual content. Below we present representative examples from each category in our evaluation framework.
Example requires both audio and visual cues for understanding
Q: While the main character is holding the jar labeled 'Cotswold Lavender Soothing Muscle Rub' in the car, what does she mention about their Airbnb experiences?
A: They occasionally encounter rough mattresses.
Example primarily depends on audio cues for understanding
Q: What is the main character's reason for being excited in the conversation?
A: The main character expresses excitement because it is their birthday, as they explicitly state, 'It's literally my birthday today, I'm so excited.'
Example primarily depends on visual cues for understanding
Q: What color is the main character's sweater?
A: The main character is wearing a gray sweater in the provided scene.
Example requires comprehensive semantic understanding
Q: What was the main character's sentiment about her trip to Vienna?
A: She hoped her love for the city was conveyed.
We evaluate HippoMM on the newly introduced HippoVlog benchmark designed for long audiovisual event understanding. Our approach demonstrates significant improvements over state-of-the-art methods in both accuracy and inference speed.
Method | PT ↓ (hours) |
ART ↓ (seconds) |
Modality Performance | Avg. Acc. ↑ (%) |
|||
---|---|---|---|---|---|---|---|
A+V ↑ | A ↑ | V ↑ | S ↑ | ||||
Prior Methods | |||||||
NotebookLM | -- | -- | 28.40 | 23.20 | 28.00 | 26.80 | 26.60 |
Video RAG | 9.46 | 112.5 | 63.6 | 67.2 | 41.2 | 84.8 | 64.2 |
Ablation Studies | |||||||
HippoMM w/o DR, AR | 5.09 | 4.14 | 66.8 | 73.2 | 60.4 | 90.0 | 72.6 |
HippoMM w/o FR, AR | 5.09 | 27.3 | 72.0 | 80.0 | 66.8 | 83.2 | 75.5 |
HippoMM w/o AR | 5.09 | 11.2 | 68.8 | 80.8 | 65.6 | 92.0 | 76.8 |
HippoMM (Ours) | 5.09 | 20.4 | 70.8 | 81.6 | 66.8 | 93.6 | 78.2 |
Note: PT: Processing Time; ART: Average Response Time; A+V: Cross-modal (Audio+Visual) accuracy; A: Audio-only accuracy; V: Visual-only accuracy; S: Semantic understanding accuracy. Best results are in bold, second best are underlined. HippoMM significantly outperforms prior methods in all modality-specific tasks while maintaining efficient processing and response times. Ablation studies demonstrate the importance of each component: Detailed Recall (DR), Fast Retrieval (FR), and Adaptive Reasoning (AR).