Kruspe, Anna und Kersten, Jens und Wiegmann, Matti und Stein, Benno und Klan, Friederike (2018) Classification of Incident-related Tweets: Tackling Imbalanced Training Data using Hybrid CNNs and Translation-based Data Augmentation. Text REtrieval COnference (TREC), 2018-11-14 - 2018-11-16, Gaithersburg, USA.
Dieses Archiv kann nicht den Volltext zur Verfügung stellen.
Kurzfassung
In this paper, we present our four approaches submitted to the 2018 Text REtrieval Conference (TREC) Incident Streams (IS) track. One of the main challenges in this track is the lack of training data for certain classes defined in the ontology. We therefore took measures to expand the provided data; in a first step, additional tweets were manually selected from CrisisLexT26 and EMTerms for all underrepresented classes ensuring a minimum number of 50 tweets per class. Using this expanded data, we trained three models. The first is a baseline model that uses a logistic regression classifier on word statistics. The second model is a state-of-the-art CNN which considers different frame widths on pre-trained word embeddings. This model was then extended with two identical CNN branches trained on the CrisisLexT26 and CrisisNLP data sets, and a posterior fusion network (third approach). Since all of these models still suffered from a lack of training data, more training examples were generated through a data augmentation technique using automatic round-trip translation. The fourth presented approach is identical to the third one, but was trained on this augmented data set. Finally, we describe our importance ranking procedure for tweets. Our method was implemented by weighting the average importance of the detected class and the tweet's relevance obtained with a classifier trained on the CrisisLexT26 data set.
elib-URL des Eintrags: | https://elib.dlr.de/126329/ | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dokumentart: | Konferenzbeitrag (Vortrag) | ||||||||||||||||||||||||
Titel: | Classification of Incident-related Tweets: Tackling Imbalanced Training Data using Hybrid CNNs and Translation-based Data Augmentation | ||||||||||||||||||||||||
Autoren: |
| ||||||||||||||||||||||||
Datum: | 2018 | ||||||||||||||||||||||||
Referierte Publikation: | Ja | ||||||||||||||||||||||||
Open Access: | Nein | ||||||||||||||||||||||||
Gold Open Access: | Nein | ||||||||||||||||||||||||
In SCOPUS: | Nein | ||||||||||||||||||||||||
In ISI Web of Science: | Nein | ||||||||||||||||||||||||
Status: | akzeptierter Beitrag | ||||||||||||||||||||||||
Stichwörter: | social media, tweet, classification, disaster management, imbalanced training data | ||||||||||||||||||||||||
Veranstaltungstitel: | Text REtrieval COnference (TREC) | ||||||||||||||||||||||||
Veranstaltungsort: | Gaithersburg, USA | ||||||||||||||||||||||||
Veranstaltungsart: | internationale Konferenz | ||||||||||||||||||||||||
Veranstaltungsbeginn: | 14 November 2018 | ||||||||||||||||||||||||
Veranstaltungsende: | 16 November 2018 | ||||||||||||||||||||||||
HGF - Forschungsbereich: | Luftfahrt, Raumfahrt und Verkehr | ||||||||||||||||||||||||
HGF - Programm: | Raumfahrt | ||||||||||||||||||||||||
HGF - Programmthema: | keine Zuordnung | ||||||||||||||||||||||||
DLR - Schwerpunkt: | Raumfahrt | ||||||||||||||||||||||||
DLR - Forschungsgebiet: | R - keine Zuordnung | ||||||||||||||||||||||||
DLR - Teilgebiet (Projekt, Vorhaben): | R - keine Zuordnung | ||||||||||||||||||||||||
Standort: | Jena | ||||||||||||||||||||||||
Institute & Einrichtungen: | Institut für Datenwissenschaften > Bürgerwissenschaften | ||||||||||||||||||||||||
Hinterlegt von: | Klan, Dr. Friederike | ||||||||||||||||||||||||
Hinterlegt am: | 04 Feb 2019 09:58 | ||||||||||||||||||||||||
Letzte Änderung: | 29 Jul 2024 15:15 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags