elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Impressum | Datenschutz | Kontakt | English
Schriftgröße: [-] Text [+]

Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge

Hu, Xuke und Kersten, Jens und Klan, Friederike und Farzana, Sheikh Mastura (2024) Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge. International Journal of Geographical Information Science. Taylor & Francis. doi: 10.1080/13658816.2024.2405182. ISSN 1365-8816.

[img] PDF - Verlagsversion (veröffentlichte Fassung)
3MB

Kurzfassung

Toponym resolution is crucial for extracting geographic information from natural language texts, such as social media posts and news articles. Despite the advancements in current methods, including state-of-the-art deep learning solutions like GENRE and a sophisticated voting system that integrates seven individual methods, further enhancing their accuracy is essential. To achieve this goal, we propose a novel method that combines lightweight and open-source large language models and geo-knowledge. Specifically, we first fine-tune Mistral (7B), Baichuan2 (7B), Llama2 (7B & 13B), and Falcon (7B) to estimate toponyms’ unambiguous reference (e.g., city, state, country) given their contexts. Subsequently, we correct inaccuracies in generated references and determine their geo-coordinates via sequentially querying GeoNames, Nominatim, and ArcGIS geocoders until a successful geocoding result is achieved. Our methods demonstrate enhanced performance compared to 20 existing methods, as evidenced across seven challenging datasets including 83,365 toponyms worldwide, with the Mistral-based method leading, followed by Baichuan2, Llama2, and Falcon-based methods. Specifically, the Mistral-based method achieves an Accuracy@161km of 0.91, surpassing GENRE, the best individual method, by 17% and the seven-methods composite voting system by 7%. Moreover, our methods are computationally efficient, operable on one general GPU, have modest memory requirements (14 GB for 7B models and 27 GB for 13B models), and exceed both GENRE and the voting system in inferring speed.

elib-URL des Eintrags:https://elib.dlr.de/208870/
Dokumentart:Zeitschriftenbeitrag
Titel:Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge
Autoren:
AutorenInstitution oder E-Mail-AdresseAutoren-ORCID-iDORCID Put Code
Hu, XukeXuke.Hu (at) dlr.dehttps://orcid.org/0000-0002-5649-0243174099789
Kersten, Jensjens.kersten (at) dlr.dehttps://orcid.org/0000-0002-4735-7360NICHT SPEZIFIZIERT
Klan, FriederikeFriederike.Klan (at) dlr.dehttps://orcid.org/0000-0002-1856-7334NICHT SPEZIFIZIERT
Farzana, Sheikh MasturaSheikh.Farzana (at) dlr.deNICHT SPEZIFIZIERTNICHT SPEZIFIZIERT
Datum:24 September 2024
Erschienen in:International Journal of Geographical Information Science
Referierte Publikation:Ja
Open Access:Ja
Gold Open Access:Nein
In SCOPUS:Ja
In ISI Web of Science:Ja
DOI:10.1080/13658816.2024.2405182
Verlag:Taylor & Francis
ISSN:1365-8816
Status:veröffentlicht
Stichwörter:Geoparsing; toponym resolution; geocoding; large language model
HGF - Forschungsbereich:Luftfahrt, Raumfahrt und Verkehr
HGF - Programm:Raumfahrt
HGF - Programmthema:Technik für Raumfahrtsysteme
DLR - Schwerpunkt:Raumfahrt
DLR - Forschungsgebiet:R SY - Technik für Raumfahrtsysteme
DLR - Teilgebiet (Projekt, Vorhaben):R - Big Data und KI für die Entscheidungsunterstützung, D - OpenSearch@DLR
Standort: Jena
Institute & Einrichtungen:Institut für Datenwissenschaften > Datengewinnung und -mobilisierung
Hinterlegt von: Hu, Xuke
Hinterlegt am:19 Dez 2024 10:43
Letzte Änderung:19 Dez 2024 10:43

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags

Blättern
Suchen
Hilfe & Kontakt
Informationen
electronic library verwendet EPrints 3.3.12
Gestaltung Webseite und Datenbank: Copyright © Deutsches Zentrum für Luft- und Raumfahrt (DLR). Alle Rechte vorbehalten.