RAGLens | Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Introduction

Retrieval-augmented generation improves factuality by grounding model outputs in retrieved evidence, but unfaithful generations remain a persistent challenge. Models can still introduce unsupported details, contradict the source material, or overstate information that is only weakly grounded in the retrieved context.

Existing detectors often rely on expensive external LLM judges or large supervised classifiers. Although these approaches can be effective, they are costly to deploy and often offer limited insight into why a response is flagged as unreliable.

RAGLens approaches this problem through sparse representation probing. By using sparse autoencoders to disentangle internal activations and a generalized additive model to make predictions transparent, the method provides a practical route to faithfulness detection while also supporting explanation and targeted mitigation.

RAGLens Framework

RAGLens treats hallucination detection as a sparse feature extraction problem. Output tokens are encoded with an SAE attached to an LLM hookpoint, informative features are selected with mutual information, and the resulting sparse summaries are passed to an interpretable predictor. The same sparse features can then be used to localize suspicious spans and provide feedback for revision.

Overview of the RAGLens pipeline for sparse probing, faithfulness prediction, explanation, and mitigation.

RAGLens links sparse internal features to three connected goals: reliable faithfulness prediction, interpretable analysis of hallucination-related behavior, and mitigation through targeted feedback.

Main Results

We first evaluate RAGLens against prompting-based detectors, uncertainty baselines, and prior methods that also exploit internal model representations. Across RAGTruth and Dolly, the sparse-feature view provides a strong and consistent signal for hallucination detection.

Detection Performance

The main benchmark comparison shows that RAGLens performs strongly across both summarization and question answering settings, outperforming prior baselines on the core evaluation suites.

Benchmark comparison of RAGLens against prior hallucination detection methods on RAGTruth and Dolly.

RAGLens consistently outperforms prior detectors on RAGTruth and Dolly, showing that sparse SAE features provide a practical and effective signal for faithfulness prediction.

Internal Signals vs. Self-Judgment

We also compare sparse internal signals with chain-of-thought style self-judgment. This analysis asks whether the model's latent activations contain faithfulness information that is not fully captured by explicit self-evaluation prompts.

Comparison between chain-of-thought self-judgment and sparse internal feature signals for hallucination detection.

Sparse internal features offer a stronger faithfulness signal than chain-of-thought self-judgment across datasets, suggesting that the model's internal knowledge can be more informative than its explicit critique of its own outputs.

Generalization Across Settings

Beyond in-domain performance, RAGLens is designed to capture reusable hallucination cues. The next analyses examine transfer across datasets and subtasks, as well as the layerwise distribution of hallucination-related information across different model families.

Transfer Across Datasets and Subtasks

The following comparison evaluates whether a detector trained in one setting continues to perform when transferred to other datasets or RAG subtasks.

Transfer results across datasets and RAGTruth subtasks.

RAGLens retains meaningful performance under cross-dataset and cross-subtask transfer, indicating that it learns reusable hallucination cues rather than benchmark-specific heuristics.

Layerwise Behavior Across Models

We further analyze where hallucination-related signals concentrate inside the model. These patterns are informative for choosing effective hookpoints and understanding how faithfulness information is encoded.

Layerwise analysis across Llama and Qwen models on RAGTruth subtasks.

Hallucination-related information is concentrated in specific layers rather than distributed uniformly, supporting the use of targeted layers for faithfulness detection.

Interpretation and Mitigation

A central goal of RAGLens is to remain interpretable after training. Because the detector operates on a small set of sparse features, the selected signals can be examined directly and used to support more targeted mitigation of unfaithful outputs.

Interpretable Sparse Features

The feature analysis below illustrates how individual SAE features can be associated with coherent semantic meanings, activated spans, and learned influence curves in the final predictor.

Examples of sparse features, activated spans, and their learned influence curves.

The selected sparse features are interpretable at multiple levels: they can be summarized semantically, localized to activated spans, and linked directly to the detector's response through additive shape functions.

Revision with Targeted Feedback

Once hallucination-related spans are identified, the same internal signals can be turned into feedback for revision, enabling instance-level and token-level mitigation strategies.

Mitigation results using instance-level and token-level feedback derived from RAGLens.

Targeted feedback, particularly at the token level, reduces hallucination rates across multiple judges and human annotation, showing that the detector supports both analysis and repair.

Design Analysis

The following studies clarify the main design choices behind RAGLens, including the feature extractor, the number of selected sparse features, the final predictor, and the benefit of sparse features over raw hidden states.

Feature Extractors

The representation study compares SAEs and Transcoders, using both pre- and post-activation signals, to understand which sparse feature space is most effective for faithfulness detection.

The choice of sparse feature extractor materially affects performance, supporting the representation design used in the final RAGLens pipeline.

Feature Selection

The feature-count analysis examines how performance changes as more sparse features are retained and compares mutual-information ranking with random selection.

Feature-count analysis comparing mutual-information ranking with random selection.

Mutual-information ranking yields much stronger detectors than random feature choice, demonstrating that selecting informative sparse features is a key part of the method.

Predictor Choice

We compare generalized additive models with logistic regression, multilayer perceptrons, and gradient boosting to evaluate how much complexity is needed in the final predictor.

Predictor comparison among logistic regression, GAM, MLP, and XGBoost.

The generalized additive model provides a strong balance between performance and transparency, matching the overall design goal of an interpretable detector.

Sparse Features vs. Hidden States

The final comparison asks whether sparse features are intrinsically more useful than raw hidden states when the detector is built under a comparable feature budget.

Comparison of sparse autoencoder features and hidden states under different feature budgets.

Sparse SAE features provide stronger signals than raw hidden states in the compact-detector setting, highlighting the practical value of sparse representation probing.

Feature Case Studies

The case studies below make the selected sparse features concrete. They illustrate how individual features respond to specific hallucination patterns, including unsupported numeric details, temporal claims, ratings, and context-sensitive summarization errors. Importantly, we show that the change in feature activation values will causally affect the faithfulness of model output, and the identified features are highly sensitive to the given context, confirming their specificity to unfaithful RAG outputs.

Numeric and Temporal Specifics

One recurrent pattern is the tendency to hallucinate precise numeric or time-related details. The feature below activates consistently around this type of unsupported specificity.

Examples of a sparse feature responding to hallucinated numeric and time-related details.

This feature consistently activates on unsupported numeric and temporal specifics. However, increasing the activation of this feature will steer the model towards generating more faithful outputs.

Opening Hours

Another feature specializes in a narrower family of hallucinations involving opening hours, day and time claims. The examples below also show that the increase of this feature's activation will lead to more faithful outputs, confirming the causal relationship between this feature and hallucinated opening hours.

Counterfactual Specificity

The counterfactual analysis examines how a selected feature responds when the context is modified in a controlled way, providing a more direct view of feature specificity.

Counterfactual analysis of a sparse feature tracking ungrounded numeric spans in summarization.

The example shows that the feature will no longer be activated when the source of the hallucinated numeric content is removed from the retrieved context, confirming that the feature is specifically tracking ungrounded numeric information rather than a more general signal of numeric outputs.

BibTeX

@inproceedings{xiong2026toward,
  title={Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders},
  author={Guangzhi Xiong and Zhenghao He and Bohan Liu and Sanchit Sinha and Aidong Zhang},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=hgBZP67BkP}
}

RAGLens: Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders