Hallucination Analysis in RAG Systems Using Attribution Graphs

Retrieval-Augmented Generation (RAG) systems are a crucial component of modern applications built on large language models (LLMs). However, these systems face a critical challenge: hallucinations—instances in which a model generates information not grounded in the retrieved knowledge base. This issue is particularly problematic in domains requiring high factual accuracy.

Traditional hallucination detection methods in RAG systems include semantic validation and entity verification. Yet, such approaches are limited to surface-level analysis and fail to account for the underlying mechanisms of LLM computation.

Recent advances in mechanistic interpretability, particularly Anthropic’s development of Circuit Tracing, have opened new opportunities for analyzing the internal computations of LLMs. Circuit Tracing builds attribution graphs that represent the influence of features (neurons) on one another, enabling researchers to trace intermediate computational steps leading to model outputs. In this study, we analyze attribution graphs to identify hallucination-related patterns within RAG systems.

Introduction

Sparse Autoencoders and Transcoders

Sparse Autoencoders (SAEs) decompose activations of deep learning models into interpretable, human-readable features, allowing researchers to examine the internal representations of transformer models. SAEs map activations into a sparse multidimensional latent space, reconstructed under a sparsity penalty that encourages activation of only a limited set of features (neurons).

Subsequent research extended SAEs into transcoders, trained to reconstruct the output of one neural component from its input. Cross-Layer Transcoders (CLTs) are advanced transcoder architectures that model feature interactions across multiple layers of the network.

1_Statistical methods — rely on numerical confidence scores such as perplexity (the degree to which a probability distribution predicts a sample) and uncertainty estimates (standard deviation under all influencing factors). While computationally efficient, these methods often lack the granularity required for precise hallucination detection.

2_Model-based methods — employ secondary LLMs to evaluate response quality. However, these approaches inherit the same hallucination risks from the auxiliary models they depend on.

3_Retrieval-based methods — assess the alignment between the generated response and retrieved context, computing direct correlations between model outputs and selected knowledge sources.

4_Information-theoretic methods — employ heuristic probabilistic measures that analyze token-level distances and output distributions in vector space.

Related Work

RAG systems typically adopt binary classification schemes, which fail to capture the full variability and complexity of hallucinations, particularly when hallucinatory segments are embedded within otherwise valid responses. Moreover, most classifiers operate post hoc, without analyzing the internal mechanisms engaged during generation.

Larger classifiers with more parameters can detect richer hallucination patterns, but they often require multiple forward passes of the LLM, limiting scalability.

Limitations
in Hallucination Detection

Circuit Tracing is a methodology of mechanistic interpretability that linearizes transformer computations and constructs attribution graphs.

The core idea is that internal computations can be decomposed into linear transformations within the residual stream—that is, the flow of gradients through residual blocks in the network, while nonlinear layers are “frozen.” These linearized representations are then used to construct and analyze computational graphs over selected prompts by tracing individual computational steps in a replacement model.

Core Components

Replacement Model: For a fixed input prompt, the transformer is approximated by a linearized model in which nonlinearities are frozen, yielding a purely linear gradient flow. Cross-Layer Transcoders (CLTs) are inserted into the residual stream, acting as interpretable features that read and write signals across layers. This substitution transforms computations into a fully connected network amenable to classical circuit-tracing methods.

Attribution Graphs: Based on the local linear model, graphs are constructed whose nodes correspond to active CLT features, input tokens, reconstruction errors, and output logits. Edges represent direct linear influence computed via derivatives. These components are assembled into an adjacency matrix A, encoding influence weights.

Interface: An interactive interface enables exploration of attribution graphs and their features, allowing researchers to highlight key computational mechanisms.

Global Weights: Beyond single-prompt analyses, the method allows direct inspection of replacement model parameters (“global weights”).

Circuit Tracing

For scalability and practical considerations, we selected Qwen2.5-7B as the base model, with the following configuration:

Parameters: 7.61B (6.53B excluding embeddings)
Layers: 28
Attention architecture (GQA): 28 heads for Q, 4 for K/V
Context window: 131,072 tokens
Components: RoPE, SwiGLU, RMSNorm, QKV bias

We adapted the CLT architecture to SwiGLU, yielding a hierarchical structure where each layer maintains its own feature set. Using over 1M text samples, we extracted hidden states from source and target layers of the base LLM, training the transcoder to predict target hidden states via mean squared error (MSE) loss, combined with L1 regularization to promote feature sparsity.

This procedure embeds interpretable linear layers between transformer layers, enabling construction of replacement models and subsequent attribution graph analysis.

Training the Transcoder for Qwen2.5-7B

In RAG systems, a hallucination is defined as a case where the model produces unverifiable information or relies on internal knowledge rather than retrieved context. Experimental data structures include user queries, retrieved context, and generated responses.

Groundedness Metric

The core idea is to measure the proportion of influence attributable to input tokens corresponding to the retrieved context relative to the total influence of all active input tokens.

Let C={c1,c2,...,ck} — denote the set of retrieved context tokens, and let Ii — denote the total influence of node i on the logits, as computed via the attribution graph. Then the contextual influence ratio is given by:

Hallucination Analysis

The total influence of nodes is computed using a truncated power series:

where the index i ranges over nodes corresponding to input tokens.

w∈R^n — the logit weight vector — nonzero only for logit nodes;
Ã — the normalized adjacency matrix of the attribution graph.

Normalization is performed row-wise:

where:

Replacement Score (RS)

The fraction of influence from input tokens to output tokens transmitted through features (neurons) rather than error nodes:

Completeness Score (CS)

The fraction of incoming edges per node not explained by error nodes:

Final Groundedness Metric

The final metric combines context proportion and graph quality:

with α=0.5 by default.

Binary Detection Rule

A hallucination is detected when:

where T∈(0,1) a tunable threshold, with a default value of T=0.4

Preliminary Results

We evaluated the detector on a balanced test dataset, achieving 85% accuracy in hallucination classification.

Technical Limitations

Computational overhead: full detection is orders of magnitude slower than simple text generation.
CLT quality dependency: performance depends critically on the trained cross-layer transcoder.
Limited token-level precision: context-token alignment can result in false positives.

We present a novel hallucination detection framework for RAG systems grounded in the internal mechanisms of LLMs. By leveraging Circuit Tracing and attribution graphs, we move beyond surface-level methods toward a mechanistic understanding of model computations.

Conclusion

[1] Anthropic Research Team. (2025). Tracing the thoughts of a large language model. Anthropic Research Blog. https://www.anthropic.com/research/tracing-thoughts-language-model

[2] Anthropic Research Team. (2025). Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer Circuits Publication. https://transformer-circuits.pub/2025/attribution-graphs/methods.html

[3] Anthropic Research Team. (2025). On the Biology of a Large Language Model. Transformer Circuits Publication. https://transformer-circuits.pub/2025/attribution-graphs/biology.html

[4] Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv:2309.08600v3.

[5] An, Y., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., & Qiu, Z. (2025). Qwen2.5 Technical Report. arXiv:2412.15115v2.

[6] Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2025). Ragas: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217v2.

Bibliography

< Back to homepage