Bensch, Oliver (2021) Novel Approaches For Literature-Based Discovery. Masterarbeit, Maastricht University.
PDF
3MB |
Kurzfassung
Literature-Based Discovery (LBD) is a technique for generating novel hypotheses from scientific corpora. The typical LBD procedure comprises three stages. The first phase identifies significant concepts within a text corpus. The second phase involves the extraction of relationships between detected concepts to construct a concept graph. Finally, the unconnected concept pairs are ranked. New hypotheses could be generated based on the rank of the unconnected concepts, which could indicate novel discoveries. However, LBD is typically applied to medical literature, as this domain has annotated corpora and ontologies such as the Unified Medical Language System UMLS that facilitate knowledge extraction. This work assesses novel approaches concerning all three phases of LBD for open-domain scientific literature. For concept detection, several methods such as Named Entity Recognition models were trained and evaluated on datasets like STEM-ECR. On this dataset, the model "SciBERT-cased-STEM-ECR" achieved an F1 score of 65,4% and was used to detect concepts in a collection of abstracts retrieved from the German Aerospace Center’s (DLR) publication server eLib. Additionally, SciElectraSmall++ models were trained on a subset of the AMiner corpus, which significantly improved the performance for concept detection on the STEM-ECR dataset compared to Electra-Small++ models trained on the OpenWebText corpus. For relation extraction, Question Answering models trained on SQuAD 2.0 were first evaluated on the SciECR dataset. The "Electra-large-squad2" model identified 44,7% of relations and concepts with an average word error rate of 23,8% and was used to extract relationships between the detected concepts in the DLR dataset to create a concept graph. To compare the novel approaches to conventional methods, another graph based on tf.idf and co-occurrence was created. Using an online questionnaire the created graphs were evaluated by domain experts (N=11). The experts classified all six random sampled concept pairs detected by the question answering approach, like "virtual environments"-"USED-FOR"-"interactive exploration", as related. Whereas only 57,14% of relations, such as "scales"-"related"-"weight", in the graph, created using tf.idf and co-occurence were determined as correct by most of the experts. A search engine was developed to perform manual LBD on created graphs.The concept networks were grouped according to the year of the corresponding abstracts. It was evaluated whether link prediction methods such as common neighbors could forecast future relationships. Based on the graph up to 2018 common neighbors predicted 65 new relations, of which one matched the 443 relations added between 2018 and 2021.
elib-URL des Eintrags: | https://elib.dlr.de/147246/ | ||||||||
---|---|---|---|---|---|---|---|---|---|
Dokumentart: | Hochschulschrift (Masterarbeit) | ||||||||
Titel: | Novel Approaches For Literature-Based Discovery | ||||||||
Autoren: |
| ||||||||
Datum: | 2021 | ||||||||
Referierte Publikation: | Nein | ||||||||
Open Access: | Ja | ||||||||
Seitenanzahl: | 77 | ||||||||
Status: | nicht veröffentlicht | ||||||||
Stichwörter: | Literature-based discovery, Text Mining, Graph Mining, Knowledge Graphs | ||||||||
Institution: | Maastricht University | ||||||||
Abteilung: | Department of Data Science and Knowledge Engineering | ||||||||
HGF - Forschungsbereich: | Luftfahrt, Raumfahrt und Verkehr | ||||||||
HGF - Programm: | Raumfahrt | ||||||||
HGF - Programmthema: | keine Zuordnung | ||||||||
DLR - Schwerpunkt: | Raumfahrt | ||||||||
DLR - Forschungsgebiet: | R - keine Zuordnung | ||||||||
DLR - Teilgebiet (Projekt, Vorhaben): | R - keine Zuordnung | ||||||||
Standort: | Köln-Porz | ||||||||
Institute & Einrichtungen: | Institut für Softwaretechnologie Institut für Softwaretechnologie > Intelligente und verteilte Systeme | ||||||||
Hinterlegt von: | Hecking, Dr. Tobias | ||||||||
Hinterlegt am: | 15 Dez 2021 10:51 | ||||||||
Letzte Änderung: | 15 Dez 2021 10:51 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags