Sandaruwan, Udith und Fonseka, Nimesh und Salwathura, Pamith und Alwis, D. M. Sanjeewa und Rajakaruna Wanigasekara, Chathura und LOGEESHAN, V. (2026) End-to-End Adaptation of LLMs for Low-Resource Languages. IEEE Access. IEEE - Institute of Electrical and Electronics Engineers. doi: 10.1109/ACCESS.2026.3693119. ISSN 2169-3536.
|
PDF
- Verlagsversion (veröffentlichte Fassung)
3MB |
Offizielle URL: https://ieeexplore.ieee.org/document/11517384
Kurzfassung
While Large Language Models (LLMs) have revolutionized information processing, their benefits are disproportionately skewed toward high-resource languages, leaving languages like Sinhala behind. Building on our earlier work, An Approach to Training and Fine-Tuning Large Language Models for Low-Resource Languages, this extended version presents the complete development and evaluation of a Sinhala Large Language Model (LLM) obtained by training a LLaMA 3.1 base model for this under-represented language. Developing the Sinhala LLM required addressing challenges inherent to low-resource language modelling. The scarcity of online Sinhala text and the prevalence of printed-only literature required extensive efforts in corpus creation, translation, and noise reduction. Sinhala’s complex orthography further necessitated the design of a custom tokenization strategy aligned with existing pre-trained architectures. Additionally, the process of continued pre-training, fine-tuning, and quantization was constrained by limited hardware and memory resources. Despite these limitations, the trained SinhalaLLMdemonstrates significant progress in adapting large-scale architectures to a low-resource context. Experimental results highlight consistent improvements in text generation quality, contextual understanding, and response coherence. This study demonstrates that effective LLMs can be built for low-resource languages even with limited hardware, providing a reproducible framework for researchers facing similar constraints.
| elib-URL des Eintrags: | https://elib.dlr.de/224501/ | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dokumentart: | Zeitschriftenbeitrag | ||||||||||||||||||||||||||||
| Titel: | End-to-End Adaptation of LLMs for Low-Resource Languages | ||||||||||||||||||||||||||||
| Autoren: |
| ||||||||||||||||||||||||||||
| Datum: | Mai 2026 | ||||||||||||||||||||||||||||
| Erschienen in: | IEEE Access | ||||||||||||||||||||||||||||
| Referierte Publikation: | Ja | ||||||||||||||||||||||||||||
| Open Access: | Ja | ||||||||||||||||||||||||||||
| Gold Open Access: | Ja | ||||||||||||||||||||||||||||
| In SCOPUS: | Ja | ||||||||||||||||||||||||||||
| In ISI Web of Science: | Ja | ||||||||||||||||||||||||||||
| DOI: | 10.1109/ACCESS.2026.3693119 | ||||||||||||||||||||||||||||
| Verlag: | IEEE - Institute of Electrical and Electronics Engineers | ||||||||||||||||||||||||||||
| ISSN: | 2169-3536 | ||||||||||||||||||||||||||||
| Status: | veröffentlicht | ||||||||||||||||||||||||||||
| Stichwörter: | Continued Pre-training, Data Scarcity, Large Language Models (LLMs), LoRA Fine-tuning, Low-Resource Languages, Quantization, Transfer Learning | ||||||||||||||||||||||||||||
| HGF - Forschungsbereich: | keine Zuordnung | ||||||||||||||||||||||||||||
| HGF - Programm: | keine Zuordnung | ||||||||||||||||||||||||||||
| HGF - Programmthema: | keine Zuordnung | ||||||||||||||||||||||||||||
| DLR - Schwerpunkt: | keine Zuordnung | ||||||||||||||||||||||||||||
| DLR - Forschungsgebiet: | keine Zuordnung | ||||||||||||||||||||||||||||
| DLR - Teilgebiet (Projekt, Vorhaben): | keine Zuordnung | ||||||||||||||||||||||||||||
| Standort: | Geesthacht | ||||||||||||||||||||||||||||
| Institute & Einrichtungen: | Institut für Maritime Technologien und Antriebssysteme > Energiekonverter und -systeme | ||||||||||||||||||||||||||||
| Hinterlegt von: | Rajakaruna Wanigasekara, Chathura | ||||||||||||||||||||||||||||
| Hinterlegt am: | 08 Jun 2026 09:42 | ||||||||||||||||||||||||||||
| Letzte Änderung: | 10 Jun 2026 12:48 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags