Bongard, Jan Heinrich (2020) Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets. Master's, Friedrich-Schiller-Universität Jena / DLR Institute of Data Science.
This is the latest version of this item.
PDF
- Only accessible within DLR
8MB |
Abstract
Within a crisis, social media data can provide near-real time information about current events and their development. In particular, Twitter data offers high information value due to a large number of users and easy access to its stream API. Considering the sheer volume of data, it is necessary to filter out the small amount of crisis-relevant tweets, process them, and present them in a manageable way for users. Current approaches focus on the use of pre-trained supervised models that decide on the relevance of each tweet. However, the supervised models are highly dependent on the training data used and are only partially transferable to new event types. Compute-intensive unsupervised methods filter the data stream for data reduction based on keywords, hashtags, or named entities so that much information is excluded before it has been analyzed. In contrast to these approaches, this work focuses on analyzing the stream's complete content by combining an interval-based parametric Chinese Restaurant Process clustering with a pre-trained feed-forward neural network. The stream is divided into intervals and each tweet is projected into a numerical vector by pre-trained sentence embeddings. The embeddings are then grouped by clustering. At the end of each interval, the pre-trained neural network decides whether a cluster contains crisis relevant tweets. With a further developed concept of cluster chains and central centroids, crisis relevant clusters of different intervals can be synchronized. This hybrid methodology is validated in three steps. It is evaluated how well the selected clustering can distinguish tweets from crisis relevant categories and is compared to a non-parametric cluster approach. Moreover, the applicability of the proposed method to the Twitter stream is quantitatively evaluated. Finally, the information content of several clusters will be visualized and investigated by a qualitative validation. It is shown that the hybrid approach can significantly improve the results of pre-trained supervised methods. This is especially true for categories in which the supervised methods could not be sufficiently pre-trained due to missing labels. In addition, the semantic grouping of tweets offers a flexible and customizable procedure, resulting in an overview of topic-specific stream content. The proposed method can distinguish between parallel events and sub-events. With the proposed hybrid method, the automated detection and analysis of crisis-relevant tweets can be improved.
Item URL in elib: | https://elib.dlr.de/140299/ | ||||||||
---|---|---|---|---|---|---|---|---|---|
Document Type: | Thesis (Master's) | ||||||||
Title: | Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets | ||||||||
Authors: |
| ||||||||
Date: | 2020 | ||||||||
Journal or Publication Title: | Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets | ||||||||
Refereed publication: | Yes | ||||||||
Open Access: | Yes | ||||||||
Number of Pages: | 67 | ||||||||
Status: | Published | ||||||||
Keywords: | Crisis informatics, semantic clustering, Twitter stream, supervised and unsupervised methods | ||||||||
Institution: | Friedrich-Schiller-Universität Jena / DLR Institute of Data Science | ||||||||
Department: | Department of Geography / Citizen Science | ||||||||
HGF - Research field: | Aeronautics, Space and Transport | ||||||||
HGF - Program: | Space | ||||||||
HGF - Program Themes: | other | ||||||||
DLR - Research area: | Raumfahrt | ||||||||
DLR - Program: | R - no assignment | ||||||||
DLR - Research theme (Project): | R - no assignment, R - Exploration of citizen science methods, R - QS-Project_04 Big-Data-Plattform | ||||||||
Location: | Jena | ||||||||
Institutes and Institutions: | Institute of Data Science > Citizen Science | ||||||||
Deposited By: | Kersten, Dr.-Ing. Jens | ||||||||
Deposited On: | 12 Jan 2021 10:15 | ||||||||
Last Modified: | 30 Nov 2021 14:27 |
Available Versions of this Item
- Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets. (deposited 12 Jan 2021 10:15) [Currently Displayed]
Repository Staff Only: item control page