elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Impressum | Datenschutz | Barrierefreiheit | Kontakt | English
Schriftgröße: [-] Text [+]

Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models

Gadziomski, Patryk Pawel und Pfeffer, Magnus und Rittlinger, Vanessa und Voigt, Stefan (2025) Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models. CERN. OSSYM - 7th International Open Search Symposium, 2025-10-08 - 2025-10-10, Helsinki, Finnland. doi: 10.5281/zenodo.17258462. ISBN 978-92-9083-705-3. ISSN 2957-4935.

Dieses Archiv kann nicht den Volltext zur Verfügung stellen.

Offizielle URL: https://e-publishing.cern.ch/index.php/OSSYM/issue/view/ossym2025/OSSYM-2025-Proceedings

Kurzfassung

With the steadily growing relevance of the Web as information source and the associated increase in web content, the systematic extraction of structured information from it is becoming increasingly important. Every day, thousands of social media posts and news articles are published, containing not only thematic but also geospatial information such as location names and addresses. These data are relevant for numerous applications and research fields including open-data projects like OpenStreetMap, the optimization of search engine indices, or their use for example in crisis management. The automated extraction of ad-dresses from web sources particularly from imprint pages could efficiently capture legal information, such as compliance with the General Data Protection Regulation(GDPR)or improve deep learning models for Named Entity Recognition (NER). Despite the high relevance of this task, existing methods for address extraction from text have so far yielded only limited results due to inconsistent formatting of addresses, the ambiguity of words, and the embedding of addresses in unstructured texts. Since rule-based methods for address extraction achieve only limited quality, the use of Large Language Models (LLMs) is pro-posed as a promising alternative to specifically extract ad-dresses from imprint pages.Since the release of GPT-3, LLMs have enabled significant advancements in various fields, particularly in automated text processing. Information extraction, as a subfield of Natural Language Processing (NLP), is gaining increasing relevance due to LLMs availability and functionality and becomes an active research topic. Given the continuous evolution of these models, this trend is expected to persist. This applies both to the technical developments of LLMs and to advancements in prompting methods, and refinement of model out-put through targeted inputs.The data used in this study originates from the "Legal" datasets of the Open WebIndexof the OpenWebSearch.EU projects, providing substantial amounts of imprint data. The data is restricted to German language. The extracted dataset is annotated using LLMs, followed by manual correction. The result is the creation of an annotated dataset with manually collected “gold standard”of geolocation samples for comparison and quality assessment.The hit rate of the LLM is documented to establish a well-founded basis for further work.The geolocalization results are documented to compare with different model outputs and the different applied prompting techniques. Following the ex-traction, an evaluation is conducted to determine at which level the models can extract relevant geoinformation. Addresses consist of country, postal code, city, street name, and house number. A specific score is assigned to each of these elements. This metric is designed to assess the effectiveness of address extraction using LLMs. Additionally, it is examined whether the German language yields better results for German addresses or,if in general, the English language enables better extraction.To make optimal use of spatial data, the websites in the dataset are classified thematically e.g. by company type or a more diverse classification. For this classification task, the LLMs are provided with preexisting thematic categories. The websites are classified according to the plain text and the URL of the website.The thematically classified data enables targeted evaluation of the geocoded data points through subsequent visualization and analysis.The data obtained includes all addresses found in the imprint as well as the classification of the website.The output data from models with many parameters is expected to be complete and will likely surpass previous rule-based and data-driven approaches. However, there remains a possibility that the models, regardless of their parameter size, may not be able to perform address extraction and classification at sufficient quality.Extracting the spatial context in form of coordinates from the addresses enables a large-scale geographic analysis of imprint entries.The thematic context can indicate the type of institution on the site.Based on the achieved quality and hit rate, the computational power required by each model is analyzed to deter-mine to optimize for the required computational resources. Therefore, the evaluation does not only capture the outputs but also records the number of generated tokens, the resulting costs, and the processing time. LLMs are expected to achieve a significantly higher hit rate in address extraction than conventional methods through targeted prompting.The knowledge gained from this study can contribute to the improvement of data-driven geospatial text data analysis and can be used in areas such as geospatial search engines and many types of open data projects.It should be noted that this study is work in progress, and the results presented reflect first analysis results.

elib-URL des Eintrags:https://elib.dlr.de/216498/
Dokumentart:Konferenzbeitrag (Vortrag)
Titel:Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models
Autoren:
AutorenInstitution oder E-Mail-AdresseAutoren-ORCID-iDORCID Put Code
Gadziomski, Patryk PawelHochschule der Medien StuttgartNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Pfeffer, MagnusHochschule der Medien StuttgartNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Rittlinger, Vanessavanessa.rittlinger (at) dlr.dehttps://orcid.org/0009-0000-9246-7174NICHT SPEZIFIZIERT
Voigt, StefanStefan.Voigt (at) dlr.dehttps://orcid.org/0000-0002-5908-331XNICHT SPEZIFIZIERT
Datum:2025
Referierte Publikation:Nein
Open Access:Nein
Gold Open Access:Nein
In SCOPUS:Nein
In ISI Web of Science:Nein
DOI:10.5281/zenodo.17258462
Seitenbereich:Seite 79
Herausgeber:
HerausgeberInstitution und/oder E-Mail-Adresse der HerausgeberHerausgeber-ORCID-iDORCID Put Code
Wagner, AndreasCERN, Geneva, SwitzerlandNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Granitzer, MichaelUniversity of Passau, GermanyNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Gütl, ChristianGraz University of Technology, AustriaNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Öster, PerCSC - IT Center for Science, FinlandNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Sharikadze, MegiLeibniz Supercomputing Centre, Munich, GermanyNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Voigt, StefanOpen Search Foundation, Germany/Stefan.Voigt (at) dlr.dehttps://orcid.org/0000-0002-5908-331XNICHT SPEZIFIZIERT
Verlag:CERN
Name der Reihe:Proceedings of 7h International Open Search Symposium #ossym2025, CSC – IT Center for Science, Helsinki, Finland and Online, 8-10 October 2025
ISSN:2957-4935
ISBN:978-92-9083-705-3
Status:veröffentlicht
Stichwörter:NLP, LLM, Geocoding, Addressextraction, terrabyte
Veranstaltungstitel:OSSYM - 7th International Open Search Symposium
Veranstaltungsort:Helsinki, Finnland
Veranstaltungsart:internationale Konferenz
Veranstaltungsbeginn:8 Oktober 2025
Veranstaltungsende:10 Oktober 2025
Veranstalter :CSC – IT Center for Science Ltd.
HGF - Forschungsbereich:keine Zuordnung
HGF - Programm:keine Zuordnung
HGF - Programmthema:keine Zuordnung
DLR - Schwerpunkt:Digitalisierung
DLR - Forschungsgebiet:D DAT - Daten
DLR - Teilgebiet (Projekt, Vorhaben):D - OpenSearch@DLR, R - Fernerkundung u. Geoforschung, R - HPDA-Nutzung
Standort: Oberpfaffenhofen
Institute & Einrichtungen:Deutsches Fernerkundungsdatenzentrum > Georisiken und zivile Sicherheit
Hinterlegt von: Rittlinger, Vanessa
Hinterlegt am:19 Nov 2025 12:14
Letzte Änderung:19 Nov 2025 12:14

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags

Blättern
Suchen
Hilfe & Kontakt
Informationen
OpenAIRE Validator logo electronic library verwendet EPrints 3.3.12
Gestaltung Webseite und Datenbank: Copyright © Deutsches Zentrum für Luft- und Raumfahrt (DLR). Alle Rechte vorbehalten.