Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation

Gerlach, Luka (2026) Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation. Masterarbeit, Universität zu Köln.

PDF
3MB

Kurzfassung

Large Language Models (LLMs) can be steered by intervening on internal activations at inference time, offering an alternative to prompting and parameter updates. Yet, such interventions are often hard to interpret: it remains unclear what exactly a steering direction represents and which side e!ects it introduces. Motivated by recent mechanistic interpretability results that suggest that Sparse Autoencoders (SAEs) can extract sparse, human-interpretable features from model activations, we develop a steering pipeline that constructs steering directions from a small, inspectable set of Sparse Autoencoder (SAE) latents, directly addressing the interpretability limitations of prior steering methods. We evaluate the method on Ekman-emotion steering (Gemma 2 9B IT and Llama 3.1 8B IT) and on broader concept steering with AxBench (Gemma 2 9B IT). Across benchmarks, our approach improves steering performance over comparable training-free baselines while making interventions more transparent through token-level interpretations of the selected latents. Finally, we present a qualitative bias mitigation case study indicating that our approach can reduce dataset-induced topical bias transfer when building steering directions.

elib-URL des Eintrags:

https://elib.dlr.de/224376/

Dokumentart:

Hochschulschrift (Masterarbeit)

Titel:

Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Gerlach, Luka	luka.gerlach (at) dlr.de	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT

DLR-Supervisor:

Beitragsart	DLR-Supervisor	Institution oder E-Mail-Adresse	DLR-Supervisor-ORCID-iD
Thesis advisor	Diallo, Diaoulé	diaoule.diallo (at) dlr.de	https://orcid.org/0000-0001-9226-0050
Thesis advisor	Hecking, Tobias	Tobias.Hecking (at) dlr.de	https://orcid.org/0000-0003-0833-7989

Datum:

2026

Open Access:

Seitenanzahl:

108

Status:

veröffentlicht

Stichwörter:

Large Language Models, Activation Engineering

Institution:

Universität zu Köln

HGF - Forschungsbereich:

Luftfahrt, Raumfahrt und Verkehr

HGF - Programm:

Raumfahrt

HGF - Programmthema:

Technik für Raumfahrtsysteme

DLR - Schwerpunkt:

Raumfahrt

DLR - Forschungsgebiet:

R SY - Technik für Raumfahrtsysteme

DLR - Teilgebiet (Projekt, Vorhaben):

R - Kollaboration von Luftfahrt-Operateuren und KI-Systemen

Standort:

Köln-Porz

Institute & Einrichtungen:

Institut für Softwaretechnologie
Institut für Softwaretechnologie > Intelligente und verteilte Systeme

Hinterlegt von:

Hecking, Dr. Tobias

Hinterlegt am:

11 Mai 2026 12:12

Letzte Änderung:

11 Mai 2026 12:12

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags