elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Impressum | Datenschutz | Kontakt | English
Schriftgröße: [-] Text [+]

Aligning Large Language Models During Inference Time

Vogt, Julian (2024) Aligning Large Language Models During Inference Time. Masterarbeit, Ruhr-Universität Bochum.

[img] PDF
5MB

Kurzfassung

Large language models have led to significant advancements in natural language processing and have become an integral part of everyday life. While they are able to perform various tasks with increasing accuracy, sensitive domains such as healthcare or justice place high demands on their safety and reliability. Models that do not follow our ethical standards can produce harmful results and permanently damage trust in artificial intelligence. To mitigate this risk, we have developed an alignment technique that operates entirely during inference. It extracts the activations of a few positive and negative examples during the forward pass, and then uses latent space arithmetic and post-processing to generate effective steering vectors. A misalignment in subsequent forward passes is automatically detected and the steering vectors are applied until the alignment is restored. We started by implementing Turner et al.'s Activation Addition technique and iteratively improved it \cite{activation-addition}. In the first iteration, the steering vector was obtained from 50 positive and 50 negative examples instead of just one, resulting in a nearly bias-free mean steering vector. By steering over multiple layers in the transformer stack, we were able to gradually increase the alignment without overwriting the information contained in the embeddings. In the third iteration, Welch's t-test was applied to identify and eliminate irrelevant dimensions of the steering vector containing noise and bias, resulting in significant performance improvements. Finally, a self-regulating steering system was developed that uses latent space arithmetic to detect misalignment in the embeddings at any time and autonomously starts steering until the alignment is restored. The development process was accompanied by an evaluation framework that quantified the alignment and the associated performance loss of each modification. This allowed us to adopt only those that provided improvement. We extracted the detection mechanism of the self-regulating steering system and developed a token-wise few-shot text classifier. It used the same 50 positive and negative examples and the decoder-only model to determine the sentiment at any token position with high accuracy. Unlike scalar sentiment analysis models, it does not get confused when the sentiment changes within the sentence. Our work contributes to a comprehensive control over the alignment of LLMs and represents a further step towards safe AI models.

elib-URL des Eintrags:https://elib.dlr.de/210009/
Dokumentart:Hochschulschrift (Masterarbeit)
Titel:Aligning Large Language Models During Inference Time
Autoren:
AutorenInstitution oder E-Mail-AdresseAutoren-ORCID-iDORCID Put Code
Vogt, JulianInstitute for AI Safety and SecurityNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Datum:2 Dezember 2024
Erschienen in:Aligning Large Language Models During Inference Time
Open Access:Ja
Seitenanzahl:101
Status:eingereichter Beitrag
Stichwörter:Artificial Intelligence, Safety, Security, Large Language Models, Alignment, Activation Addition
Institution:Ruhr-Universität Bochum
Abteilung:Faculty of Computer Science
HGF - Forschungsbereich:Luftfahrt, Raumfahrt und Verkehr
HGF - Programm:Verkehr
HGF - Programmthema:Straßenverkehr
DLR - Schwerpunkt:Verkehr
DLR - Forschungsgebiet:V ST Straßenverkehr
DLR - Teilgebiet (Projekt, Vorhaben):V - SaiNSOR
Standort: andere
Institute & Einrichtungen:Institut für KI-Sicherheit
Hinterlegt von: Vogt, Julian
Hinterlegt am:04 Dez 2024 08:45
Letzte Änderung:04 Dez 2024 08:45

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags

Blättern
Suchen
Hilfe & Kontakt
Informationen
electronic library verwendet EPrints 3.3.12
Gestaltung Webseite und Datenbank: Copyright © Deutsches Zentrum für Luft- und Raumfahrt (DLR). Alle Rechte vorbehalten.