Gadziomski, Patryk Pawel und Pfeffer, Magnus und Rittlinger, Vanessa und Voigt, Stefan (2025) Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models. CERN. OSSYM - 7th International Open Search Symposium, 2025-10-08 - 2025-10-10, Helsinki, Finnland. doi: 10.5281/zenodo.17258462. ISBN 978-92-9083-705-3. ISSN 2957-4935.
Dieses Archiv kann nicht den Volltext zur Verfügung stellen.
Offizielle URL: https://e-publishing.cern.ch/index.php/OSSYM/issue/view/ossym2025/OSSYM-2025-Proceedings
Kurzfassung
With the steadily growing relevance of the Web as information source and the associated increase in web content, the systematic extraction of structured information from it is becoming increasingly important. Every day, thousands of social media posts and news articles are published, containing not only thematic but also geospatial information such as location names and addresses. These data are relevant for numerous applications and research fields including open-data projects like OpenStreetMap, the optimization of search engine indices, or their use for example in crisis management. The automated extraction of ad-dresses from web sources particularly from imprint pages could efficiently capture legal information, such as compliance with the General Data Protection Regulation(GDPR)or improve deep learning models for Named Entity Recognition (NER). Despite the high relevance of this task, existing methods for address extraction from text have so far yielded only limited results due to inconsistent formatting of addresses, the ambiguity of words, and the embedding of addresses in unstructured texts. Since rule-based methods for address extraction achieve only limited quality, the use of Large Language Models (LLMs) is pro-posed as a promising alternative to specifically extract ad-dresses from imprint pages.Since the release of GPT-3, LLMs have enabled significant advancements in various fields, particularly in automated text processing. Information extraction, as a subfield of Natural Language Processing (NLP), is gaining increasing relevance due to LLMs availability and functionality and becomes an active research topic. Given the continuous evolution of these models, this trend is expected to persist. This applies both to the technical developments of LLMs and to advancements in prompting methods, and refinement of model out-put through targeted inputs.The data used in this study originates from the "Legal" datasets of the Open WebIndexof the OpenWebSearch.EU projects, providing substantial amounts of imprint data. The data is restricted to German language. The extracted dataset is annotated using LLMs, followed by manual correction. The result is the creation of an annotated dataset with manually collected “gold standard”of geolocation samples for comparison and quality assessment.The hit rate of the LLM is documented to establish a well-founded basis for further work.The geolocalization results are documented to compare with different model outputs and the different applied prompting techniques. Following the ex-traction, an evaluation is conducted to determine at which level the models can extract relevant geoinformation. Addresses consist of country, postal code, city, street name, and house number. A specific score is assigned to each of these elements. This metric is designed to assess the effectiveness of address extraction using LLMs. Additionally, it is examined whether the German language yields better results for German addresses or,if in general, the English language enables better extraction.To make optimal use of spatial data, the websites in the dataset are classified thematically e.g. by company type or a more diverse classification. For this classification task, the LLMs are provided with preexisting thematic categories. The websites are classified according to the plain text and the URL of the website.The thematically classified data enables targeted evaluation of the geocoded data points through subsequent visualization and analysis.The data obtained includes all addresses found in the imprint as well as the classification of the website.The output data from models with many parameters is expected to be complete and will likely surpass previous rule-based and data-driven approaches. However, there remains a possibility that the models, regardless of their parameter size, may not be able to perform address extraction and classification at sufficient quality.Extracting the spatial context in form of coordinates from the addresses enables a large-scale geographic analysis of imprint entries.The thematic context can indicate the type of institution on the site.Based on the achieved quality and hit rate, the computational power required by each model is analyzed to deter-mine to optimize for the required computational resources. Therefore, the evaluation does not only capture the outputs but also records the number of generated tokens, the resulting costs, and the processing time. LLMs are expected to achieve a significantly higher hit rate in address extraction than conventional methods through targeted prompting.The knowledge gained from this study can contribute to the improvement of data-driven geospatial text data analysis and can be used in areas such as geospatial search engines and many types of open data projects.It should be noted that this study is work in progress, and the results presented reflect first analysis results.
| elib-URL des Eintrags: | https://elib.dlr.de/216498/ | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dokumentart: | Konferenzbeitrag (Vortrag) | ||||||||||||||||||||||||||||
| Titel: | Towards the Extraction of Location References and Topics from Semi-Structured Textual Data from the Open Web Index using Open-Source Large Language Models | ||||||||||||||||||||||||||||
| Autoren: |
| ||||||||||||||||||||||||||||
| Datum: | 2025 | ||||||||||||||||||||||||||||
| Referierte Publikation: | Nein | ||||||||||||||||||||||||||||
| Open Access: | Nein | ||||||||||||||||||||||||||||
| Gold Open Access: | Nein | ||||||||||||||||||||||||||||
| In SCOPUS: | Nein | ||||||||||||||||||||||||||||
| In ISI Web of Science: | Nein | ||||||||||||||||||||||||||||
| DOI: | 10.5281/zenodo.17258462 | ||||||||||||||||||||||||||||
| Seitenbereich: | Seite 79 | ||||||||||||||||||||||||||||
| Herausgeber: |
| ||||||||||||||||||||||||||||
| Verlag: | CERN | ||||||||||||||||||||||||||||
| Name der Reihe: | Proceedings of 7h International Open Search Symposium #ossym2025, CSC – IT Center for Science, Helsinki, Finland and Online, 8-10 October 2025 | ||||||||||||||||||||||||||||
| ISSN: | 2957-4935 | ||||||||||||||||||||||||||||
| ISBN: | 978-92-9083-705-3 | ||||||||||||||||||||||||||||
| Status: | veröffentlicht | ||||||||||||||||||||||||||||
| Stichwörter: | NLP, LLM, Geocoding, Addressextraction, terrabyte | ||||||||||||||||||||||||||||
| Veranstaltungstitel: | OSSYM - 7th International Open Search Symposium | ||||||||||||||||||||||||||||
| Veranstaltungsort: | Helsinki, Finnland | ||||||||||||||||||||||||||||
| Veranstaltungsart: | internationale Konferenz | ||||||||||||||||||||||||||||
| Veranstaltungsbeginn: | 8 Oktober 2025 | ||||||||||||||||||||||||||||
| Veranstaltungsende: | 10 Oktober 2025 | ||||||||||||||||||||||||||||
| Veranstalter : | CSC – IT Center for Science Ltd. | ||||||||||||||||||||||||||||
| HGF - Forschungsbereich: | keine Zuordnung | ||||||||||||||||||||||||||||
| HGF - Programm: | keine Zuordnung | ||||||||||||||||||||||||||||
| HGF - Programmthema: | keine Zuordnung | ||||||||||||||||||||||||||||
| DLR - Schwerpunkt: | Digitalisierung | ||||||||||||||||||||||||||||
| DLR - Forschungsgebiet: | D DAT - Daten | ||||||||||||||||||||||||||||
| DLR - Teilgebiet (Projekt, Vorhaben): | D - OpenSearch@DLR, R - Fernerkundung u. Geoforschung, R - HPDA-Nutzung | ||||||||||||||||||||||||||||
| Standort: | Oberpfaffenhofen | ||||||||||||||||||||||||||||
| Institute & Einrichtungen: | Deutsches Fernerkundungsdatenzentrum > Georisiken und zivile Sicherheit | ||||||||||||||||||||||||||||
| Hinterlegt von: | Rittlinger, Vanessa | ||||||||||||||||||||||||||||
| Hinterlegt am: | 19 Nov 2025 12:14 | ||||||||||||||||||||||||||||
| Letzte Änderung: | 19 Nov 2025 12:14 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags