Gadziomski, Patryk Pawel (2025) Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models. Bachelorarbeit, Hochschule der Medien Stuttgart.
Dieses Archiv kann nicht den Volltext zur Verfügung stellen.
Kurzfassung
This bachelor thesis examines the extraction of location references (addresses) using open-source large language models, as well as the topic classification of webpages with the same approach. The goal is not only to determine whether large language models outperform traditional methods, but also to assess when it might be reasonable to trade off accuracy for a more sustainable extraction pipeline. Therefore, the energy consumption of the models is also analysed. To answer these questions, an experimental pipeline was developed, and the results were evaluated using classification metrics, similarity metrics, qualitative error analysis, energy consumption, and geocoding (spatial accuracy). The results highlight the importance of prompt engineering and model efficiency in terms of both accuracy and energy usage, particularly in the extraction of full addresses and address components. These findings provide valuable guidance for designing efficient and sustainable NLP pipelines for address extraction and web classification.
| elib-URL des Eintrags: | https://elib.dlr.de/215838/ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dokumentart: | Hochschulschrift (Bachelorarbeit) | ||||||||||||
| Titel: | Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models | ||||||||||||
| Autoren: |
| ||||||||||||
| DLR-Supervisor: |
| ||||||||||||
| Datum: | 30 Juni 2025 | ||||||||||||
| Open Access: | Nein | ||||||||||||
| Seitenanzahl: | 118 | ||||||||||||
| Status: | veröffentlicht | ||||||||||||
| Stichwörter: | LLM, NLP, geocoding | ||||||||||||
| Institution: | Hochschule der Medien Stuttgart | ||||||||||||
| Abteilung: | Information Sciences | ||||||||||||
| HGF - Forschungsbereich: | keine Zuordnung | ||||||||||||
| HGF - Programm: | keine Zuordnung | ||||||||||||
| HGF - Programmthema: | keine Zuordnung | ||||||||||||
| DLR - Schwerpunkt: | Digitalisierung | ||||||||||||
| DLR - Forschungsgebiet: | D DAT - Daten | ||||||||||||
| DLR - Teilgebiet (Projekt, Vorhaben): | D - OpenSearch@DLR, R - Fernerkundung u. Geoforschung | ||||||||||||
| Standort: | Oberpfaffenhofen | ||||||||||||
| Institute & Einrichtungen: | Deutsches Fernerkundungsdatenzentrum > Georisiken und zivile Sicherheit | ||||||||||||
| Hinterlegt von: | Rittlinger, Vanessa | ||||||||||||
| Hinterlegt am: | 27 Okt 2025 09:34 | ||||||||||||
| Letzte Änderung: | 27 Okt 2025 09:34 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags