Gerlach, Luka (2026) Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation. Master's, Universität zu Köln.
|
PDF
3MB |
Abstract
Large Language Models (LLMs) can be steered by intervening on internal activations at inference time, offering an alternative to prompting and parameter updates. Yet, such interventions are often hard to interpret: it remains unclear what exactly a steering direction represents and which side e!ects it introduces. Motivated by recent mechanistic interpretability results that suggest that Sparse Autoencoders (SAEs) can extract sparse, human-interpretable features from model activations, we develop a steering pipeline that constructs steering directions from a small, inspectable set of Sparse Autoencoder (SAE) latents, directly addressing the interpretability limitations of prior steering methods. We evaluate the method on Ekman-emotion steering (Gemma 2 9B IT and Llama 3.1 8B IT) and on broader concept steering with AxBench (Gemma 2 9B IT). Across benchmarks, our approach improves steering performance over comparable training-free baselines while making interventions more transparent through token-level interpretations of the selected latents. Finally, we present a qualitative bias mitigation case study indicating that our approach can reduce dataset-induced topical bias transfer when building steering directions.
| Item URL in elib: | https://elib.dlr.de/224376/ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Document Type: | Thesis (Master's) | ||||||||||||
| Title: | Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation | ||||||||||||
| Authors: |
| ||||||||||||
| DLR Supervisors: |
| ||||||||||||
| Date: | 2026 | ||||||||||||
| Open Access: | Yes | ||||||||||||
| Number of Pages: | 108 | ||||||||||||
| Status: | Published | ||||||||||||
| Keywords: | Large Language Models, Activation Engineering | ||||||||||||
| Institution: | Universität zu Köln | ||||||||||||
| HGF - Research field: | Aeronautics, Space and Transport | ||||||||||||
| HGF - Program: | Space | ||||||||||||
| HGF - Program Themes: | Space System Technology | ||||||||||||
| DLR - Research area: | Raumfahrt | ||||||||||||
| DLR - Program: | R SY - Space System Technology | ||||||||||||
| DLR - Research theme (Project): | R - Collaboration of aviation operators and AI systems | ||||||||||||
| Location: | Köln-Porz | ||||||||||||
| Institutes and Institutions: | Institute of Software Technology Institute of Software Technology > Intelligent and Distributed Systems | ||||||||||||
| Deposited By: | Hecking, Dr. Tobias | ||||||||||||
| Deposited On: | 11 May 2026 12:12 | ||||||||||||
| Last Modified: | 11 May 2026 12:12 |
Repository Staff Only: item control page