DLR-Logo -> http://www.dlr.de
DLR Portal Home | Imprint | Privacy Policy | Contact | Deutsch
Fontsize: [-] Text [+]

Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets

Bongard, Jan Heinrich (2020) Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets. Master's, Friedrich-Schiller-Universität Jena / DLR Institute of Data Science.

[img] PDF - Registered users only


Within a crisis, social media data can provide near-real time information about current events and their development. In particular, Twitter data offers high information value due to a large number of users and easy access to its stream API. Considering the sheer volume of data, it is necessary to filter out the small amount of crisis-relevant tweets, process them, and present them in a manageable way for users. Current approaches focus on the use of pre-trained supervised models that decide on the relevance of each tweet. However, the supervised models are highly dependent on the training data used and are only partially transferable to new event types. Compute-intensive unsupervised methods filter the data stream for data reduction based on keywords, hashtags, or named entities so that much information is excluded before it has been analyzed. In contrast to these approaches, this work focuses on analyzing the stream's complete content by combining an interval-based parametric Chinese Restaurant Process clustering with a pre-trained feed-forward neural network. The stream is divided into intervals and each tweet is projected into a numerical vector by pre-trained sentence embeddings. The embeddings are then grouped by clustering. At the end of each interval, the pre-trained neural network decides whether a cluster contains crisis relevant tweets. With a further developed concept of cluster chains and central centroids, crisis relevant clusters of different intervals can be synchronized. This hybrid methodology is validated in three steps. It is evaluated how well the selected clustering can distinguish tweets from crisis relevant categories and is compared to a non-parametric cluster approach. Moreover, the applicability of the proposed method to the Twitter stream is quantitatively evaluated. Finally, the information content of several clusters will be visualized and investigated by a qualitative validation. It is shown that the hybrid approach can significantly improve the results of pre-trained supervised methods. This is especially true for categories in which the supervised methods could not be sufficiently pre-trained due to missing labels. In addition, the semantic grouping of tweets offers a flexible and customizable procedure, resulting in an overview of topic-specific stream content. The proposed method can distinguish between parallel events and sub-events. With the proposed hybrid method, the automated detection and analysis of crisis-relevant tweets can be improved.

Item URL in elib:https://elib.dlr.de/140299/
Document Type:Thesis (Master's)
Title:Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets
AuthorsInstitution or Email of AuthorsAuthor's ORCID iD
Bongard, Jan Heinrichjan.bongard (at) dlr.deUNSPECIFIED
Journal or Publication Title:Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets
Refereed publication:Yes
Open Access:Yes
Gold Open Access:No
In ISI Web of Science:No
Number of Pages:67
Keywords:Crisis informatics, semantic clustering, Twitter stream, supervised and unsupervised methods
Institution:Friedrich-Schiller-Universität Jena / DLR Institute of Data Science
Department:Department of Geography / Citizen Science
HGF - Research field:Aeronautics, Space and Transport
HGF - Program:Space
HGF - Program Themes:other
DLR - Research area:Raumfahrt
DLR - Program:R - no assignment
DLR - Research theme (Project):R - no assignment
Location: Jena
Institutes and Institutions:Institute of Data Science > Citizen Science
Deposited By: Kersten, Dr.-Ing. Jens
Deposited On:12 Jan 2021 10:15
Last Modified:12 Jan 2021 10:15

Repository Staff Only: item control page

Help & Contact
electronic library is running on EPrints 3.3.12
Copyright © 2008-2017 German Aerospace Center (DLR). All rights reserved.