Kruspe, Anna and Kersten, Jens and Wiegmann, Matti and Stein, Benno and Klan, Friederike (2018) Classification of Incident-related Tweets: Tackling Imbalanced Training Data using Hybrid CNNs and Translation-based Data Augmentation. Text REtrieval COnference (TREC), 2018-11-14 - 2018-11-16, Gaithersburg, USA.
Full text not available from this repository.
Abstract
In this paper, we present our four approaches submitted to the 2018 Text REtrieval Conference (TREC) Incident Streams (IS) track. One of the main challenges in this track is the lack of training data for certain classes defined in the ontology. We therefore took measures to expand the provided data; in a first step, additional tweets were manually selected from CrisisLexT26 and EMTerms for all underrepresented classes ensuring a minimum number of 50 tweets per class. Using this expanded data, we trained three models. The first is a baseline model that uses a logistic regression classifier on word statistics. The second model is a state-of-the-art CNN which considers different frame widths on pre-trained word embeddings. This model was then extended with two identical CNN branches trained on the CrisisLexT26 and CrisisNLP data sets, and a posterior fusion network (third approach). Since all of these models still suffered from a lack of training data, more training examples were generated through a data augmentation technique using automatic round-trip translation. The fourth presented approach is identical to the third one, but was trained on this augmented data set. Finally, we describe our importance ranking procedure for tweets. Our method was implemented by weighting the average importance of the detected class and the tweet's relevance obtained with a classifier trained on the CrisisLexT26 data set.
Item URL in elib: | https://elib.dlr.de/126329/ | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Document Type: | Conference or Workshop Item (Speech) | ||||||||||||||||||||||||
Title: | Classification of Incident-related Tweets: Tackling Imbalanced Training Data using Hybrid CNNs and Translation-based Data Augmentation | ||||||||||||||||||||||||
Authors: |
| ||||||||||||||||||||||||
Date: | 2018 | ||||||||||||||||||||||||
Refereed publication: | Yes | ||||||||||||||||||||||||
Open Access: | No | ||||||||||||||||||||||||
Gold Open Access: | No | ||||||||||||||||||||||||
In SCOPUS: | No | ||||||||||||||||||||||||
In ISI Web of Science: | No | ||||||||||||||||||||||||
Status: | Accepted | ||||||||||||||||||||||||
Keywords: | social media, tweet, classification, disaster management, imbalanced training data | ||||||||||||||||||||||||
Event Title: | Text REtrieval COnference (TREC) | ||||||||||||||||||||||||
Event Location: | Gaithersburg, USA | ||||||||||||||||||||||||
Event Type: | international Conference | ||||||||||||||||||||||||
Event Start Date: | 14 November 2018 | ||||||||||||||||||||||||
Event End Date: | 16 November 2018 | ||||||||||||||||||||||||
HGF - Research field: | Aeronautics, Space and Transport | ||||||||||||||||||||||||
HGF - Program: | Space | ||||||||||||||||||||||||
HGF - Program Themes: | other | ||||||||||||||||||||||||
DLR - Research area: | Raumfahrt | ||||||||||||||||||||||||
DLR - Program: | R - no assignment | ||||||||||||||||||||||||
DLR - Research theme (Project): | R - no assignment | ||||||||||||||||||||||||
Location: | Jena | ||||||||||||||||||||||||
Institutes and Institutions: | Institute of Data Science > Citizen Science | ||||||||||||||||||||||||
Deposited By: | Klan, Dr. Friederike | ||||||||||||||||||||||||
Deposited On: | 04 Feb 2019 09:58 | ||||||||||||||||||||||||
Last Modified: | 29 Jul 2024 15:15 |
Repository Staff Only: item control page