End-to-End Adaptation of LLMs for Low-Resource Languages

Sandaruwan, Udith und Fonseka, Nimesh und Salwathura, Pamith und Alwis, D. M. Sanjeewa und Rajakaruna Wanigasekara, Chathura und LOGEESHAN, V. (2026) End-to-End Adaptation of LLMs for Low-Resource Languages. IEEE Access. IEEE - Institute of Electrical and Electronics Engineers. doi: 10.1109/ACCESS.2026.3693119. ISSN 2169-3536.

PDF - Verlagsversion (veröffentlichte Fassung)
3MB

Offizielle URL: https://ieeexplore.ieee.org/document/11517384

Kurzfassung

While Large Language Models (LLMs) have revolutionized information processing, their benefits are disproportionately skewed toward high-resource languages, leaving languages like Sinhala behind. Building on our earlier work, An Approach to Training and Fine-Tuning Large Language Models for Low-Resource Languages, this extended version presents the complete development and evaluation of a Sinhala Large Language Model (LLM) obtained by training a LLaMA 3.1 base model for this under-represented language. Developing the Sinhala LLM required addressing challenges inherent to low-resource language modelling. The scarcity of online Sinhala text and the prevalence of printed-only literature required extensive efforts in corpus creation, translation, and noise reduction. Sinhala’s complex orthography further necessitated the design of a custom tokenization strategy aligned with existing pre-trained architectures. Additionally, the process of continued pre-training, fine-tuning, and quantization was constrained by limited hardware and memory resources. Despite these limitations, the trained SinhalaLLMdemonstrates significant progress in adapting large-scale architectures to a low-resource context. Experimental results highlight consistent improvements in text generation quality, contextual understanding, and response coherence. This study demonstrates that effective LLMs can be built for low-resource languages even with limited hardware, providing a reproducible framework for researchers facing similar constraints.

elib-URL des Eintrags:

https://elib.dlr.de/224501/

Dokumentart:

Zeitschriftenbeitrag

Titel:

End-to-End Adaptation of LLMs for Low-Resource Languages

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Sandaruwan, Udith	University of Moratuwa	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Fonseka, Nimesh	University of Moratuwa	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Salwathura, Pamith	University of Moratuwa	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Alwis, D. M. Sanjeewa	Decryptogen LLC	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Rajakaruna Wanigasekara, Chathura	Chathura.Wanigasekara (at) dlr.de	https://orcid.org/0000-0003-4371-6108	217032390
LOGEESHAN, V.	University of Moratuwa	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT

Datum:

Mai 2026

Erschienen in:

IEEE Access

Referierte Publikation:

Open Access:

Gold Open Access:

In SCOPUS:

In ISI Web of Science:

DOI:

10.1109/ACCESS.2026.3693119

Verlag:

IEEE - Institute of Electrical and Electronics Engineers

ISSN:

2169-3536

Status:

veröffentlicht

Stichwörter:

Continued Pre-training, Data Scarcity, Large Language Models (LLMs), LoRA Fine-tuning, Low-Resource Languages, Quantization, Transfer Learning

HGF - Forschungsbereich:

keine Zuordnung

HGF - Programm:

keine Zuordnung

HGF - Programmthema:

keine Zuordnung

DLR - Schwerpunkt:

keine Zuordnung

DLR - Forschungsgebiet:

keine Zuordnung

DLR - Teilgebiet (Projekt, Vorhaben):

keine Zuordnung

Standort:

Geesthacht

Institute & Einrichtungen:

Institut für Maritime Technologien und Antriebssysteme > Energiekonverter und -systeme

Hinterlegt von:

Rajakaruna Wanigasekara, Chathura

Hinterlegt am:

08 Jun 2026 09:42

Letzte Änderung:

10 Jun 2026 12:48

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags