Bongard, Jan Heinrich (2020) Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets. Masterarbeit, Friedrich-Schiller-Universität Jena / DLR Institute of Data Science.
Dies ist die aktuellste Version dieses Eintrags.
PDF
- Nur DLR-intern zugänglich
8MB |
Kurzfassung
Within a crisis, social media data can provide near-real time information about current events and their development. In particular, Twitter data offers high information value due to a large number of users and easy access to its stream API. Considering the sheer volume of data, it is necessary to filter out the small amount of crisis-relevant tweets, process them, and present them in a manageable way for users. Current approaches focus on the use of pre-trained supervised models that decide on the relevance of each tweet. However, the supervised models are highly dependent on the training data used and are only partially transferable to new event types. Compute-intensive unsupervised methods filter the data stream for data reduction based on keywords, hashtags, or named entities so that much information is excluded before it has been analyzed. In contrast to these approaches, this work focuses on analyzing the stream's complete content by combining an interval-based parametric Chinese Restaurant Process clustering with a pre-trained feed-forward neural network. The stream is divided into intervals and each tweet is projected into a numerical vector by pre-trained sentence embeddings. The embeddings are then grouped by clustering. At the end of each interval, the pre-trained neural network decides whether a cluster contains crisis relevant tweets. With a further developed concept of cluster chains and central centroids, crisis relevant clusters of different intervals can be synchronized. This hybrid methodology is validated in three steps. It is evaluated how well the selected clustering can distinguish tweets from crisis relevant categories and is compared to a non-parametric cluster approach. Moreover, the applicability of the proposed method to the Twitter stream is quantitatively evaluated. Finally, the information content of several clusters will be visualized and investigated by a qualitative validation. It is shown that the hybrid approach can significantly improve the results of pre-trained supervised methods. This is especially true for categories in which the supervised methods could not be sufficiently pre-trained due to missing labels. In addition, the semantic grouping of tweets offers a flexible and customizable procedure, resulting in an overview of topic-specific stream content. The proposed method can distinguish between parallel events and sub-events. With the proposed hybrid method, the automated detection and analysis of crisis-relevant tweets can be improved.
elib-URL des Eintrags: | https://elib.dlr.de/140299/ | ||||||||
---|---|---|---|---|---|---|---|---|---|
Dokumentart: | Hochschulschrift (Masterarbeit) | ||||||||
Titel: | Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets | ||||||||
Autoren: |
| ||||||||
Datum: | 2020 | ||||||||
Erschienen in: | Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets | ||||||||
Referierte Publikation: | Ja | ||||||||
Open Access: | Ja | ||||||||
Seitenanzahl: | 67 | ||||||||
Status: | veröffentlicht | ||||||||
Stichwörter: | Crisis informatics, semantic clustering, Twitter stream, supervised and unsupervised methods | ||||||||
Institution: | Friedrich-Schiller-Universität Jena / DLR Institute of Data Science | ||||||||
Abteilung: | Department of Geography / Citizen Science | ||||||||
HGF - Forschungsbereich: | Luftfahrt, Raumfahrt und Verkehr | ||||||||
HGF - Programm: | Raumfahrt | ||||||||
HGF - Programmthema: | keine Zuordnung | ||||||||
DLR - Schwerpunkt: | Raumfahrt | ||||||||
DLR - Forschungsgebiet: | R - keine Zuordnung | ||||||||
DLR - Teilgebiet (Projekt, Vorhaben): | R - keine Zuordnung, R - Erforschung Bürgerwissenschaftlicher Methoden, R - QS-Projekt_04 Big-Data-Plattform | ||||||||
Standort: | Jena | ||||||||
Institute & Einrichtungen: | Institut für Datenwissenschaften > Bürgerwissenschaften | ||||||||
Hinterlegt von: | Kersten, Dr.-Ing. Jens | ||||||||
Hinterlegt am: | 12 Jan 2021 10:15 | ||||||||
Letzte Änderung: | 30 Nov 2021 14:27 |
Verfügbare Versionen dieses Eintrags
- Twitter Stream Clustering for the Identification and Contextualization of Event-related Tweets. (deposited 12 Jan 2021 10:15) [Gegenwärtig angezeigt]
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags