elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Imprint | Privacy Policy | Contact | Deutsch
Fontsize: [-] Text [+]

Classification of Incident-related Tweets: Tackling Imbalanced Training Data using Hybrid CNNs and Translation-based Data Augmentation

Kruspe, Anna and Kersten, Jens and Wiegmann, Matti and Stein, Benno and Klan, Friederike (2018) Classification of Incident-related Tweets: Tackling Imbalanced Training Data using Hybrid CNNs and Translation-based Data Augmentation. Text REtrieval COnference (TREC), 2018-11-14 - 2018-11-16, Gaithersburg, USA.

Full text not available from this repository.

Abstract

In this paper, we present our four approaches submitted to the 2018 Text REtrieval Conference (TREC) Incident Streams (IS) track. One of the main challenges in this track is the lack of training data for certain classes defined in the ontology. We therefore took measures to expand the provided data; in a first step, additional tweets were manually selected from CrisisLexT26 and EMTerms for all underrepresented classes ensuring a minimum number of 50 tweets per class. Using this expanded data, we trained three models. The first is a baseline model that uses a logistic regression classifier on word statistics. The second model is a state-of-the-art CNN which considers different frame widths on pre-trained word embeddings. This model was then extended with two identical CNN branches trained on the CrisisLexT26 and CrisisNLP data sets, and a posterior fusion network (third approach). Since all of these models still suffered from a lack of training data, more training examples were generated through a data augmentation technique using automatic round-trip translation. The fourth presented approach is identical to the third one, but was trained on this augmented data set. Finally, we describe our importance ranking procedure for tweets. Our method was implemented by weighting the average importance of the detected class and the tweet's relevance obtained with a classifier trained on the CrisisLexT26 data set.

Item URL in elib:https://elib.dlr.de/126329/
Document Type:Conference or Workshop Item (Speech)
Title:Classification of Incident-related Tweets: Tackling Imbalanced Training Data using Hybrid CNNs and Translation-based Data Augmentation
Authors:
AuthorsInstitution or Email of AuthorsAuthor's ORCID iDORCID Put Code
Kruspe, AnnaUNSPECIFIEDhttps://orcid.org/0000-0002-2041-9453UNSPECIFIED
Kersten, JensUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Wiegmann, MattiBauhaus-Universität WeimarUNSPECIFIEDUNSPECIFIED
Stein, BennoBauhaus-Universität WeimarUNSPECIFIEDUNSPECIFIED
Klan, FriederikeUNSPECIFIEDhttps://orcid.org/0000-0002-1856-7334UNSPECIFIED
Date:2018
Refereed publication:Yes
Open Access:No
Gold Open Access:No
In SCOPUS:No
In ISI Web of Science:No
Status:Accepted
Keywords:social media, tweet, classification, disaster management, imbalanced training data
Event Title:Text REtrieval COnference (TREC)
Event Location:Gaithersburg, USA
Event Type:international Conference
Event Start Date:14 November 2018
Event End Date:16 November 2018
HGF - Research field:Aeronautics, Space and Transport
HGF - Program:Space
HGF - Program Themes:other
DLR - Research area:Raumfahrt
DLR - Program:R - no assignment
DLR - Research theme (Project):R - no assignment
Location: Jena
Institutes and Institutions:Institute of Data Science > Citizen Science
Deposited By: Klan, Dr. Friederike
Deposited On:04 Feb 2019 09:58
Last Modified:29 Jul 2024 15:15

Repository Staff Only: item control page

Browse
Search
Help & Contact
Information
electronic library is running on EPrints 3.3.12
Website and database design: Copyright © German Aerospace Center (DLR). All rights reserved.