Sandaruwan, Udith und Fonseka, Nimesh und Salwathura, Pamith und Alwis, D. M. Sanjeewa und Thembiliyagoda, Champika und Amarathunga, Thimira und Rajakaruna Wanigasekara, Chathura und LOGEESHAN, V. (2025) An Approach to Training and Fine-tuning Large Language Models for Low-Resource Languages. In: 6th IEEE Annual World AI IoT Congress, AIIoT 2025. IEEE. 2025 IEEE World AI IoT Congress (AIIoT), 2025-05-28 - 2025-05-30, Seattle, WA, USA. doi: 10.1109/AIIoT65859.2025.11105278. ISBN 979-833152508-8.
|
PDF
- Nur DLR-intern zugänglich
705kB |
Offizielle URL: https://ieeexplore.ieee.org/document/11105278
Kurzfassung
The essence of communication is information exchange. However, people living in less connected areas often face socioeconomic and technological barriers, which limit their access to crucial information that can improve their livelihood. Developing a Sinhala Large Language Model (LLM) presents significant challenges. Sinhala is a low-resource language with limited online content and a substantial portion of its literature confined to print. Preparing a diverse and clean dataset required significant effort in text generation and translation, and noise reduction. Additionally, Sinhala's complex orthography necessitated custom tokenization strategies compatible with pre-trained models, further complicating the process. Severe hardware limitations and memory bottlenecks were also major constraints that were experienced during the continued pre-training, finetuning, and quantization phases. Despite these challenges, the Sinhala LLM represents a critical step in bridging the digital divide for Sinhala speakers. Although the constraints restrict the model's performance, it gives important insights into the training of LLMs for underrepresented languages. It contributes to the fast-growing topic of low-resource language modeling and paves the way for further research in this domain.
| elib-URL des Eintrags: | https://elib.dlr.de/215544/ | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dokumentart: | Konferenzbeitrag (Vorlesung) | ||||||||||||||||||||||||||||||||||||
| Titel: | An Approach to Training and Fine-tuning Large Language Models for Low-Resource Languages | ||||||||||||||||||||||||||||||||||||
| Autoren: |
| ||||||||||||||||||||||||||||||||||||
| Datum: | September 2025 | ||||||||||||||||||||||||||||||||||||
| Erschienen in: | 6th IEEE Annual World AI IoT Congress, AIIoT 2025 | ||||||||||||||||||||||||||||||||||||
| Referierte Publikation: | Ja | ||||||||||||||||||||||||||||||||||||
| Open Access: | Nein | ||||||||||||||||||||||||||||||||||||
| Gold Open Access: | Nein | ||||||||||||||||||||||||||||||||||||
| In SCOPUS: | Ja | ||||||||||||||||||||||||||||||||||||
| In ISI Web of Science: | Nein | ||||||||||||||||||||||||||||||||||||
| DOI: | 10.1109/AIIoT65859.2025.11105278 | ||||||||||||||||||||||||||||||||||||
| Verlag: | IEEE | ||||||||||||||||||||||||||||||||||||
| ISBN: | 979-833152508-8 | ||||||||||||||||||||||||||||||||||||
| Status: | veröffentlicht | ||||||||||||||||||||||||||||||||||||
| Stichwörter: | Large Language Models (LLMs), Low-Resource Languages, Sinhala Language Processing, Continued Pretraining | ||||||||||||||||||||||||||||||||||||
| Veranstaltungstitel: | 2025 IEEE World AI IoT Congress (AIIoT) | ||||||||||||||||||||||||||||||||||||
| Veranstaltungsort: | Seattle, WA, USA | ||||||||||||||||||||||||||||||||||||
| Veranstaltungsart: | internationale Konferenz | ||||||||||||||||||||||||||||||||||||
| Veranstaltungsbeginn: | 28 Mai 2025 | ||||||||||||||||||||||||||||||||||||
| Veranstaltungsende: | 30 Mai 2025 | ||||||||||||||||||||||||||||||||||||
| HGF - Forschungsbereich: | keine Zuordnung | ||||||||||||||||||||||||||||||||||||
| HGF - Programm: | keine Zuordnung | ||||||||||||||||||||||||||||||||||||
| HGF - Programmthema: | keine Zuordnung | ||||||||||||||||||||||||||||||||||||
| DLR - Schwerpunkt: | keine Zuordnung | ||||||||||||||||||||||||||||||||||||
| DLR - Forschungsgebiet: | keine Zuordnung | ||||||||||||||||||||||||||||||||||||
| DLR - Teilgebiet (Projekt, Vorhaben): | keine Zuordnung | ||||||||||||||||||||||||||||||||||||
| Standort: | Geesthacht | ||||||||||||||||||||||||||||||||||||
| Institute & Einrichtungen: | Institut für Maritime Energiesysteme > Energiekonverter und -systeme | ||||||||||||||||||||||||||||||||||||
| Hinterlegt von: | Rajakaruna Wanigasekara, Chathura | ||||||||||||||||||||||||||||||||||||
| Hinterlegt am: | 29 Sep 2025 07:46 | ||||||||||||||||||||||||||||||||||||
| Letzte Änderung: | 29 Sep 2025 07:46 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags