elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Imprint | Privacy Policy | Accessibility | Contact | Deutsch
Fontsize: [-] Text [+]

Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation

Gerlach, Luka (2026) Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation. Master's, Universität zu Köln.

[img] PDF
3MB

Abstract

Large Language Models (LLMs) can be steered by intervening on internal activations at inference time, offering an alternative to prompting and parameter updates. Yet, such interventions are often hard to interpret: it remains unclear what exactly a steering direction represents and which side e!ects it introduces. Motivated by recent mechanistic interpretability results that suggest that Sparse Autoencoders (SAEs) can extract sparse, human-interpretable features from model activations, we develop a steering pipeline that constructs steering directions from a small, inspectable set of Sparse Autoencoder (SAE) latents, directly addressing the interpretability limitations of prior steering methods. We evaluate the method on Ekman-emotion steering (Gemma 2 9B IT and Llama 3.1 8B IT) and on broader concept steering with AxBench (Gemma 2 9B IT). Across benchmarks, our approach improves steering performance over comparable training-free baselines while making interventions more transparent through token-level interpretations of the selected latents. Finally, we present a qualitative bias mitigation case study indicating that our approach can reduce dataset-induced topical bias transfer when building steering directions.

Item URL in elib:https://elib.dlr.de/224376/
Document Type:Thesis (Master's)
Title:Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation
Authors:
AuthorsInstitution or Email of AuthorsAuthor's ORCID iDORCID Put Code
Gerlach, Lukaluka.gerlach (at) dlr.deUNSPECIFIEDUNSPECIFIED
DLR Supervisors:
ContributionDLR SupervisorInstitution or E-MailDLR Supervisor's ORCID iD
Thesis advisorDiallo, Diaoulédiaoule.diallo (at) dlr.dehttps://orcid.org/0000-0001-9226-0050
Thesis advisorHecking, TobiasTobias.Hecking (at) dlr.dehttps://orcid.org/0000-0003-0833-7989
Date:2026
Open Access:Yes
Number of Pages:108
Status:Published
Keywords:Large Language Models, Activation Engineering
Institution:Universität zu Köln
HGF - Research field:Aeronautics, Space and Transport
HGF - Program:Space
HGF - Program Themes:Space System Technology
DLR - Research area:Raumfahrt
DLR - Program:R SY - Space System Technology
DLR - Research theme (Project):R - Collaboration of aviation operators and AI systems
Location: Köln-Porz
Institutes and Institutions:Institute of Software Technology
Institute of Software Technology > Intelligent and Distributed Systems
Deposited By: Hecking, Dr. Tobias
Deposited On:11 May 2026 12:12
Last Modified:11 May 2026 12:12

Repository Staff Only: item control page

Browse
Search
Help & Contact
Information
OpenAIRE Validator logo electronic library is running on EPrints 3.3.12
Website and database design: Copyright © German Aerospace Center (DLR). All rights reserved.