SeCom

To deliver coherent and personalized experiences in long-term conversations, existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization. In this paper, we present two key findings: (1) The granularity of memory unit matters: Turn-level, session-level, and summarization-based methods each exhibit limitations in both memory retrieval accuracy and the semantic quality of the retrieved content. (2) Prompt compression methods, such as LLMLingua-2, can effectively serve as a denoising mechanism, enhancing memory retrieval accuracy across different granularities.
Building on these insights, we propose SeCom, a method that constructs the memory bank at segment level by introducing a conversation Segmentation model that partitions long-term conversations into topically coherent segments, while applying Compression based denoising on memory units to enhance memory retrieval. Experimental results show that SeCom exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg.

SeCom

To address the above challenges, we present SeCom, a system that constructs memory bank at segment level by introducing a Conversation Segmentation Model, while applying Compression-Based Denoising on memory units to enhance memory retrieval.

Conversation Segmentation Model

Given a conversation session $\mathbf{c}$, the conversation segmentation model $f_{\mathcal{I}}$ aims to identify a set of segment indices $\mathcal{I}=\{(p_{k}, q_{k})\}_{k=1}^{K}$, where $K$ denotes the total number of segments within the session $\mathbf{c}$, $p_{k}$ and $q_{k}$ represent the indexes of the first and last interaction turns for the $k$-th segment $\mathbf{s}_{k}$, with $p_{k} \leq q_{k}$, $p_{k+1} = q_k + 1$. This can be formulated as: $$ f_{\mathcal{I}}(\mathbf{c}) = \{\mathbf{s}_k\}_{k=1}^K, \\ \text{where}~\mathbf{s}_k =\{\mathbf{t}_{p_k}, \mathbf{t}_{p_k+1}, ..., \mathbf{t}_{q_k}\}. $$ We employ GPT-4 as the conversation segmentation model $f_{\mathcal{I}}$. We find that more lightweight models, such as Mistral-7B and even RoBERTa scale models, can also perform segmention well, making our approach applicable in resource-constrained environments.

Compression-Based Memory Denoising

Given a target user request $u^*$ and context budget $N$, the memory retrieval system $f_R$ retrieves $N$ memory units $\{\mathbf{m}_n\in\mathcal{M}\}_{n=1}^N$ from the memory bank $\mathcal{M}$ as the context in response to the user request $u^*$. With the consideration that the inherent redundancy in natural language can act as noise for the retrieval system, we denoise memory units by removing such redundancy via a prompt compression model $f_{Comp}$ before retrieval: $$ \{\mathbf{m}_n\in\mathcal{M}\}_{n=1}^N \leftarrow f_R(u^*, f_{Comp}(\mathcal{M}), N). $$

Experiment

We evaluate SeCom against four intuitive approaches and four state-of-the-art models on long-term conversation benchmarks: LOCOMO and Long-MT-Bench+.

Main Result

As shown in the following table, SeCom outperforms all baseline approaches, exhibiting a significant performance advantage, particularly on the long-conversation benchmark LOCOMO. Interestingly, there is a significant performance disparity in turn-Level and session-Level methods when using different retrieval models. In contrast, SeCom enjoys greater robustness in terms of the deployed retrieval system.

The figure below presents the pairwise comparison result by instructing GPT-4 to determine the superior response. SeCom achieves a higher win rate compared to all baseline methods. We attribute this to the fact that topical segments in SeCom can strike a balance between including more relevant, coherent information while excluding irrelevant content, thus leading to more robust and superior retrieval performance.

Impact of the Memory Unit Granularity

The figure below compares QA performance across different memory granularities under varying context budgets, demonstrating the superiority of segment-level memory over turn-level and session-level memory.

Evaluation of Conversation Segmentation Model

We evaluate the conversation segmentation module independently on widely used dialogue segmentation datasets: DialSeg711, TIAGE and SuperDialSeg. The following table presents the result, showing that our segmentation model consistently outperforms baselines in the unsupervised segmentation setting.

The Effect of Compression-Based Memory Denoising

As shown in table below, removing the proposed compression-based memory denoising mechanism will result in a performance drop up to 9.46 points of GPT4Score on LOCOMO, highlighting the critical role of this denoising mechanism: by effectively improving the retrieval system, it significantly enhances the overall effectiveness of the system.

BibTeX

@inproceedings{pan2025secom, title={SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents}, author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Xufang Luo and Hao Cheng and Dongsheng Li and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Jianfeng Gao}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=xKDZAW0He3} }

SeCom: On Memory Construction and Retrieval for
Personalized Conversational Agents

Abstract

Key Takeaways

What's the Impact of Memory Granularity?

Does Memory Denoising Help?