Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models

Gadziomski, Patryk Pawel und Pfeffer, Magnus und Rittlinger, Vanessa und Voigt, Stefan (2025) Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models. CERN. OSSYM - 7th International Open Search Symposium, 2025-10-08 - 2025-10-10, Helsinki, Finnland. doi: 10.5281/zenodo.17258462. ISBN 978-92-9083-705-3. ISSN 2957-4935.

Dieses Archiv kann nicht den Volltext zur Verfügung stellen.

Offizielle URL: https://e-publishing.cern.ch/index.php/OSSYM/issue/view/ossym2025/OSSYM-2025-Proceedings

Kurzfassung

With the steadily growing relevance of the Web as information source and the associated increase in web content, the systematic extraction of structured information from it is becoming increasingly important. Every day, thousands of social media posts and news articles are published, containing not only thematic but also geospatial information such as location names and addresses. These data are relevant for numerous applications and research fields including open-data projects like OpenStreetMap, the optimization of search engine indices, or their use for example in crisis management. The automated extraction of ad-dresses from web sources particularly from imprint pages could efficiently capture legal information, such as compliance with the General Data Protection Regulation(GDPR)or improve deep learning models for Named Entity Recognition (NER). Despite the high relevance of this task, existing methods for address extraction from text have so far yielded only limited results due to inconsistent formatting of addresses, the ambiguity of words, and the embedding of addresses in unstructured texts. Since rule-based methods for address extraction achieve only limited quality, the use of Large Language Models (LLMs) is pro-posed as a promising alternative to specifically extract ad-dresses from imprint pages.Since the release of GPT-3, LLMs have enabled significant advancements in various fields, particularly in automated text processing. Information extraction, as a subfield of Natural Language Processing (NLP), is gaining increasing relevance due to LLMs availability and functionality and becomes an active research topic. Given the continuous evolution of these models, this trend is expected to persist. This applies both to the technical developments of LLMs and to advancements in prompting methods, and refinement of model out-put through targeted inputs.The data used in this study originates from the "Legal" datasets of the Open WebIndexof the OpenWebSearch.EU projects, providing substantial amounts of imprint data. The data is restricted to German language. The extracted dataset is annotated using LLMs, followed by manual correction. The result is the creation of an annotated dataset with manually collected “gold standard”of geolocation samples for comparison and quality assessment.The hit rate of the LLM is documented to establish a well-founded basis for further work.The geolocalization results are documented to compare with different model outputs and the different applied prompting techniques. Following the ex-traction, an evaluation is conducted to determine at which level the models can extract relevant geoinformation. Addresses consist of country, postal code, city, street name, and house number. A specific score is assigned to each of these elements. This metric is designed to assess the effectiveness of address extraction using LLMs. Additionally, it is examined whether the German language yields better results for German addresses or,if in general, the English language enables better extraction.To make optimal use of spatial data, the websites in the dataset are classified thematically e.g. by company type or a more diverse classification. For this classification task, the LLMs are provided with preexisting thematic categories. The websites are classified according to the plain text and the URL of the website.The thematically classified data enables targeted evaluation of the geocoded data points through subsequent visualization and analysis.The data obtained includes all addresses found in the imprint as well as the classification of the website.The output data from models with many parameters is expected to be complete and will likely surpass previous rule-based and data-driven approaches. However, there remains a possibility that the models, regardless of their parameter size, may not be able to perform address extraction and classification at sufficient quality.Extracting the spatial context in form of coordinates from the addresses enables a large-scale geographic analysis of imprint entries.The thematic context can indicate the type of institution on the site.Based on the achieved quality and hit rate, the computational power required by each model is analyzed to deter-mine to optimize for the required computational resources. Therefore, the evaluation does not only capture the outputs but also records the number of generated tokens, the resulting costs, and the processing time. LLMs are expected to achieve a significantly higher hit rate in address extraction than conventional methods through targeted prompting.The knowledge gained from this study can contribute to the improvement of data-driven geospatial text data analysis and can be used in areas such as geospatial search engines and many types of open data projects.It should be noted that this study is work in progress, and the results presented reflect first analysis results.

elib-URL des Eintrags:

https://elib.dlr.de/216498/

Dokumentart:

Konferenzbeitrag (Vortrag)

Titel:

Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Gadziomski, Patryk Pawel	Hochschule der Medien Stuttgart	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Pfeffer, Magnus	Hochschule der Medien Stuttgart	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Rittlinger, Vanessa	vanessa.rittlinger (at) dlr.de	https://orcid.org/0009-0000-9246-7174	NICHT SPEZIFIZIERT
Voigt, Stefan	Stefan.Voigt (at) dlr.de	https://orcid.org/0000-0002-5908-331X	NICHT SPEZIFIZIERT

Datum:

2025

Referierte Publikation:

Nein

Open Access:

Nein

Gold Open Access:

Nein

In SCOPUS:

Nein

In ISI Web of Science:

Nein

DOI:

10.5281/zenodo.17258462

Seitenbereich:

Seite 79

Herausgeber:

Herausgeber	Institution und/oder E-Mail-Adresse der Herausgeber	Herausgeber-ORCID-iD	ORCID Put Code
Wagner, Andreas	CERN, Geneva, Switzerland	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Granitzer, Michael	University of Passau, Germany	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Gütl, Christian	Graz University of Technology, Austria	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Öster, Per	CSC - IT Center for Science, Finland	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Sharikadze, Megi	Leibniz Supercomputing Centre, Munich, Germany	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Voigt, Stefan	Open Search Foundation, Germany/Stefan.Voigt (at) dlr.de	https://orcid.org/0000-0002-5908-331X	NICHT SPEZIFIZIERT

Verlag:

CERN

Name der Reihe:

Proceedings of 7h International Open Search Symposium #ossym2025, CSC – IT Center for Science, Helsinki, Finland and Online, 8-10 October 2025

ISSN:

2957-4935

ISBN:

978-92-9083-705-3

Status:

veröffentlicht

Stichwörter:

NLP, LLM, Geocoding, Addressextraction, terrabyte

Veranstaltungstitel:

OSSYM - 7th International Open Search Symposium

Veranstaltungsort:

Helsinki, Finnland

Veranstaltungsart:

internationale Konferenz

Veranstaltungsbeginn:

8 Oktober 2025

Veranstaltungsende:

10 Oktober 2025

Veranstalter :

CSC – IT Center for Science Ltd.

HGF - Forschungsbereich:

keine Zuordnung

HGF - Programm:

keine Zuordnung

HGF - Programmthema:

keine Zuordnung

DLR - Schwerpunkt:

Digitalisierung

DLR - Forschungsgebiet:

D DAT - Daten

DLR - Teilgebiet (Projekt, Vorhaben):

D - OpenSearch@DLR, R - Fernerkundung u. Geoforschung, R - HPDA-Nutzung

Standort:

Oberpfaffenhofen

Institute & Einrichtungen:

Deutsches Fernerkundungsdatenzentrum > Georisiken und zivile Sicherheit

Hinterlegt von:

Rittlinger, Vanessa

Hinterlegt am:

19 Nov 2025 12:14

Letzte Änderung:

19 Nov 2025 12:14

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags