Gerlach, Luka (2026) Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation. Masterarbeit, Universität zu Köln.
|
PDF
3MB |
Kurzfassung
Large Language Models (LLMs) can be steered by intervening on internal activations at inference time, offering an alternative to prompting and parameter updates. Yet, such interventions are often hard to interpret: it remains unclear what exactly a steering direction represents and which side e!ects it introduces. Motivated by recent mechanistic interpretability results that suggest that Sparse Autoencoders (SAEs) can extract sparse, human-interpretable features from model activations, we develop a steering pipeline that constructs steering directions from a small, inspectable set of Sparse Autoencoder (SAE) latents, directly addressing the interpretability limitations of prior steering methods. We evaluate the method on Ekman-emotion steering (Gemma 2 9B IT and Llama 3.1 8B IT) and on broader concept steering with AxBench (Gemma 2 9B IT). Across benchmarks, our approach improves steering performance over comparable training-free baselines while making interventions more transparent through token-level interpretations of the selected latents. Finally, we present a qualitative bias mitigation case study indicating that our approach can reduce dataset-induced topical bias transfer when building steering directions.
| elib-URL des Eintrags: | https://elib.dlr.de/224376/ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dokumentart: | Hochschulschrift (Masterarbeit) | ||||||||||||
| Titel: | Explaining and Steering Large Language Model Behavior via Sparse Autoencoders: Identifying and Manipulating Interpretable Features in Text Generation | ||||||||||||
| Autoren: |
| ||||||||||||
| DLR-Supervisor: |
| ||||||||||||
| Datum: | 2026 | ||||||||||||
| Open Access: | Ja | ||||||||||||
| Seitenanzahl: | 108 | ||||||||||||
| Status: | veröffentlicht | ||||||||||||
| Stichwörter: | Large Language Models, Activation Engineering | ||||||||||||
| Institution: | Universität zu Köln | ||||||||||||
| HGF - Forschungsbereich: | Luftfahrt, Raumfahrt und Verkehr | ||||||||||||
| HGF - Programm: | Raumfahrt | ||||||||||||
| HGF - Programmthema: | Technik für Raumfahrtsysteme | ||||||||||||
| DLR - Schwerpunkt: | Raumfahrt | ||||||||||||
| DLR - Forschungsgebiet: | R SY - Technik für Raumfahrtsysteme | ||||||||||||
| DLR - Teilgebiet (Projekt, Vorhaben): | R - Kollaboration von Luftfahrt-Operateuren und KI-Systemen | ||||||||||||
| Standort: | Köln-Porz | ||||||||||||
| Institute & Einrichtungen: | Institut für Softwaretechnologie Institut für Softwaretechnologie > Intelligente und verteilte Systeme | ||||||||||||
| Hinterlegt von: | Hecking, Dr. Tobias | ||||||||||||
| Hinterlegt am: | 11 Mai 2026 12:12 | ||||||||||||
| Letzte Änderung: | 11 Mai 2026 12:12 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags