Bernahrndt, Marius (2025) An Approach to Generate Training Data for Question-to-AQL Querying Models. Masterarbeit, University of Cologne.
|
PDF
6MB |
Kurzfassung
Graph databases are powerful tools for representing and querying complex knowledge structures. Query languages such as ArangoDB's AQL are challenging for non-expert users. Leveraging large language models (LLMs) for natural-language interfaces is an obvious step, but their preparation depends on suitable training corpora that map user questions to executable queries. For AQL, such corpora do not yet exist. This thesis introduces an approach to automatically generate such training data. The method combines schema-guided path sampling with LLM verbalization, ensuring that queries remain executable while questions are expressed in natural language. Fine-tuning an instruction-tuned model on the resulting corpus yields robust, well-formed AQL queries with high execution accuracy. Most remaining discrepancies concern semantic aspects such as collection choice, traversal direction, or operator selection, whereas syntax remains largely stable. Overall, the results demonstrate that schema-guided generation with LLM support can provide a faithful and sufficiently broad dataset, enabling the training of functional question-to-AQL models and offering a reproducible foundation for future NL2AQL research and system development.
| elib-URL des Eintrags: | https://elib.dlr.de/220679/ | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Dokumentart: | Hochschulschrift (Masterarbeit) | ||||||||
| Titel: | An Approach to Generate Training Data for Question-to-AQL Querying Models | ||||||||
| Autoren: |
| ||||||||
| DLR-Supervisor: |
| ||||||||
| Datum: | 28 August 2025 | ||||||||
| Open Access: | Ja | ||||||||
| Seitenanzahl: | 102 | ||||||||
| Status: | veröffentlicht | ||||||||
| Stichwörter: | Graph databases, ArangoDB / AQL, NL2AQL, Automatic training data generation, LLM | ||||||||
| Institution: | University of Cologne | ||||||||
| Abteilung: | Faculty of Mathematics and Natural Sciences Department of Mathematics and Computer Science | ||||||||
| HGF - Forschungsbereich: | keine Zuordnung | ||||||||
| HGF - Programm: | keine Zuordnung | ||||||||
| HGF - Programmthema: | keine Zuordnung | ||||||||
| DLR - Schwerpunkt: | Digitalisierung | ||||||||
| DLR - Forschungsgebiet: | D - keine Zuordnung | ||||||||
| DLR - Teilgebiet (Projekt, Vorhaben): | D - MeToDiO, R - Synergieprojekt | DLR FM | DLR Foundation Models [EO] | ||||||||
| Standort: | andere | ||||||||
| Institute & Einrichtungen: | Institut für Softwaretechnologie > Intelligente und verteilte Systeme Institut für Softwaretechnologie | ||||||||
| Hinterlegt von: | Bernahrndt, Marius | ||||||||
| Hinterlegt am: | 10 Dez 2025 09:06 | ||||||||
| Letzte Änderung: | 10 Dez 2025 09:06 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags