Vogt, Julian (2024) Aligning Large Language Models During Inference Time. Masterarbeit, Ruhr-Universität Bochum.
![]() |
PDF
5MB |
Kurzfassung
Large language models have led to significant advancements in natural language processing and have become an integral part of everyday life. While they are able to perform various tasks with increasing accuracy, sensitive domains such as healthcare or justice place high demands on their safety and reliability. Models that do not follow our ethical standards can produce harmful results and permanently damage trust in artificial intelligence. To mitigate this risk, we have developed an alignment technique that operates entirely during inference. It extracts the activations of a few positive and negative examples during the forward pass, and then uses latent space arithmetic and post-processing to generate effective steering vectors. A misalignment in subsequent forward passes is automatically detected and the steering vectors are applied until the alignment is restored. We started by implementing Turner et al.'s Activation Addition technique and iteratively improved it \cite{activation-addition}. In the first iteration, the steering vector was obtained from 50 positive and 50 negative examples instead of just one, resulting in a nearly bias-free mean steering vector. By steering over multiple layers in the transformer stack, we were able to gradually increase the alignment without overwriting the information contained in the embeddings. In the third iteration, Welch's t-test was applied to identify and eliminate irrelevant dimensions of the steering vector containing noise and bias, resulting in significant performance improvements. Finally, a self-regulating steering system was developed that uses latent space arithmetic to detect misalignment in the embeddings at any time and autonomously starts steering until the alignment is restored. The development process was accompanied by an evaluation framework that quantified the alignment and the associated performance loss of each modification. This allowed us to adopt only those that provided improvement. We extracted the detection mechanism of the self-regulating steering system and developed a token-wise few-shot text classifier. It used the same 50 positive and negative examples and the decoder-only model to determine the sentiment at any token position with high accuracy. Unlike scalar sentiment analysis models, it does not get confused when the sentiment changes within the sentence. Our work contributes to a comprehensive control over the alignment of LLMs and represents a further step towards safe AI models.
elib-URL des Eintrags: | https://elib.dlr.de/210009/ | ||||||||
---|---|---|---|---|---|---|---|---|---|
Dokumentart: | Hochschulschrift (Masterarbeit) | ||||||||
Titel: | Aligning Large Language Models During Inference Time | ||||||||
Autoren: |
| ||||||||
Datum: | 2 Dezember 2024 | ||||||||
Erschienen in: | Aligning Large Language Models During Inference Time | ||||||||
Open Access: | Ja | ||||||||
Seitenanzahl: | 101 | ||||||||
Status: | eingereichter Beitrag | ||||||||
Stichwörter: | Artificial Intelligence, Safety, Security, Large Language Models, Alignment, Activation Addition | ||||||||
Institution: | Ruhr-Universität Bochum | ||||||||
Abteilung: | Faculty of Computer Science | ||||||||
HGF - Forschungsbereich: | Luftfahrt, Raumfahrt und Verkehr | ||||||||
HGF - Programm: | Verkehr | ||||||||
HGF - Programmthema: | Straßenverkehr | ||||||||
DLR - Schwerpunkt: | Verkehr | ||||||||
DLR - Forschungsgebiet: | V ST Straßenverkehr | ||||||||
DLR - Teilgebiet (Projekt, Vorhaben): | V - SaiNSOR | ||||||||
Standort: | andere | ||||||||
Institute & Einrichtungen: | Institut für KI-Sicherheit | ||||||||
Hinterlegt von: | Vogt, Julian | ||||||||
Hinterlegt am: | 04 Dez 2024 08:45 | ||||||||
Letzte Änderung: | 04 Dez 2024 08:45 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags