Aligning Large Language Models During Inference Time

Vogt, Julian (2024) Aligning Large Language Models During Inference Time. Masterarbeit, Ruhr-Universität Bochum.

PDF
5MB

Kurzfassung

Large language models have led to significant advancements in natural language processing and have become an integral part of everyday life. While they are able to perform various tasks with increasing accuracy, sensitive domains such as healthcare or justice place high demands on their safety and reliability. Models that do not follow our ethical standards can produce harmful results and permanently damage trust in artificial intelligence. To mitigate this risk, we have developed an alignment technique that operates entirely during inference. It extracts the activations of a few positive and negative examples during the forward pass, and then uses latent space arithmetic and post-processing to generate effective steering vectors. A misalignment in subsequent forward passes is automatically detected and the steering vectors are applied until the alignment is restored. We started by implementing Turner et al.'s Activation Addition technique and iteratively improved it \cite{activation-addition}. In the first iteration, the steering vector was obtained from 50 positive and 50 negative examples instead of just one, resulting in a nearly bias-free mean steering vector. By steering over multiple layers in the transformer stack, we were able to gradually increase the alignment without overwriting the information contained in the embeddings. In the third iteration, Welch's t-test was applied to identify and eliminate irrelevant dimensions of the steering vector containing noise and bias, resulting in significant performance improvements. Finally, a self-regulating steering system was developed that uses latent space arithmetic to detect misalignment in the embeddings at any time and autonomously starts steering until the alignment is restored. The development process was accompanied by an evaluation framework that quantified the alignment and the associated performance loss of each modification. This allowed us to adopt only those that provided improvement. We extracted the detection mechanism of the self-regulating steering system and developed a token-wise few-shot text classifier. It used the same 50 positive and negative examples and the decoder-only model to determine the sentiment at any token position with high accuracy. Unlike scalar sentiment analysis models, it does not get confused when the sentiment changes within the sentence. Our work contributes to a comprehensive control over the alignment of LLMs and represents a further step towards safe AI models.

elib-URL des Eintrags:

https://elib.dlr.de/210009/

Dokumentart:

Hochschulschrift (Masterarbeit)

Titel:

Aligning Large Language Models During Inference Time

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Vogt, Julian	Institute for AI Safety and Security	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT

Datum:

2 Dezember 2024

Erschienen in:

Aligning Large Language Models During Inference Time

Open Access:

Seitenanzahl:

101

Status:

veröffentlicht

Stichwörter:

Artificial Intelligence, Safety, Security, Large Language Models, Alignment, Activation Addition

Institution:

Ruhr-Universität Bochum

Abteilung:

Faculty of Computer Science

HGF - Forschungsbereich:

Luftfahrt, Raumfahrt und Verkehr

HGF - Programm:

Verkehr

HGF - Programmthema:

Straßenverkehr

DLR - Schwerpunkt:

Verkehr

DLR - Forschungsgebiet:

V ST Straßenverkehr

DLR - Teilgebiet (Projekt, Vorhaben):

V - SaiNSOR

Standort:

andere

Institute & Einrichtungen:

Institut für KI-Sicherheit

Hinterlegt von:

Vogt, Julian

Hinterlegt am:

04 Dez 2024 08:45

Letzte Änderung:

28 Mär 2025 07:58

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags